linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
@ 2025-11-21  1:58 Ming Lei
  2025-11-21  1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
                   ` (28 more replies)
  0 siblings, 29 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Hello,

This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
server in batching way:

- Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os

- Batch processing: Multiple I/Os are handled in single operation

- Multishot commands: Use io_uring multishot for reducing submission overhead

- Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)

- Better load balancing: Tasks can adjust their workload dynamically

- help for future optimizations:
	- blk-mq batch tags free
  	- support io-poll
	- per-task batch for avoiding per-io lock
	- fetch command priority

- simplify command cancel process with per-queue lock

selftest are provided.


Performance test result(IOPS) on V3:

- page copy

tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]

- zero copy(--auto_zc)
tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]

- IO test
taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0

1) 16 jobs IO
- page copy:  			37.77M vs. 42.40M(BATCH_IO), +12%
- zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%


2) single job IO
- page copy:  			2.54M vs. 2.6M(BATCH_IO),   +2.3%
- zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO),  +7%


V4:
	- fix handling in case of running out of mshot buffer, request has to
	  be un-prepared for zero copy
	- don't expose unused tag to userspace
	- replace fixed buffer with plain user buffer for
	  UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
	- replace iov iterator with plain copy_from_user() for
	  ublk_walk_cmd_buf(), code is simplified with performance improvement
	- don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
	  UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
	- use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
	- all kinds of patch style fix(Caleb Sander Mateos)
	- inline __kfifo_alloc() (Caleb Sander Mateos)


V3:
	- rebase on for-6.19/block
	- use blk_mq_end_request_batch() to free requests in batch, only for
	  page copy
	- fix one IO hang issue because of memory barrier order, comments on
	the memory barrier pairing
	- add NUMA ware kfifo_alloc_node()
	- fix one build warning reported by 0-DAY CI
	- selftests improvement & fix

V2:
	- ublk_config_io_buf() vs. __ublk_fetch() order
	- code style clean
	- use READ_ONCE() to cache sqe data because sqe copy becomes
	  conditional recently
	- don't use sqe->len for UBLK_U_IO_PREP_IO_CMDS &
	  UBLK_U_IO_COMMIT_IO_CMDS
	- fix one build warning
	- fix build_user_data()
	- run performance analysis, and find one bug in
	  io_uring_cmd_buffer_select(), fix is posted already

Ming Lei (27):
  kfifo: add kfifo_alloc_node() helper for NUMA awareness
  ublk: add parameter `struct io_uring_cmd *` to
    ublk_prep_auto_buf_reg()
  ublk: add `union ublk_io_buf` with improved naming
  ublk: refactor auto buffer register in ublk_dispatch_req()
  ublk: pass const pointer to ublk_queue_is_zoned()
  ublk: add helper of __ublk_fetch()
  ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO
  ublk: prepare for not tracking task context for command batch
  ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS &
    UBLK_U_IO_COMMIT_IO_CMDS
  ublk: handle UBLK_U_IO_PREP_IO_CMDS
  ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  ublk: add io events fifo structure
  ublk: add batch I/O dispatch infrastructure
  ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  ublk: abort requests filled in event kfifo
  ublk: add new feature UBLK_F_BATCH_IO
  ublk: document feature UBLK_F_BATCH_IO
  ublk: implement batch request completion via
    blk_mq_end_request_batch()
  selftests: ublk: fix user_data truncation for tgt_data >= 256
  selftests: ublk: replace assert() with ublk_assert()
  selftests: ublk: add ublk_io_buf_idx() for returning io buffer index
  selftests: ublk: add batch buffer management infrastructure
  selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS
  selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS
  selftests: ublk: add --batch/-b for enabling F_BATCH_IO
  selftests: ublk: support arbitrary threads/queues combination

 Documentation/block/ublk.rst                  |   60 +-
 drivers/block/ublk_drv.c                      | 1312 +++++++++++++++--
 include/linux/kfifo.h                         |   34 +-
 include/uapi/linux/ublk_cmd.h                 |   85 ++
 lib/kfifo.c                                   |    8 +-
 tools/testing/selftests/ublk/Makefile         |    7 +-
 tools/testing/selftests/ublk/batch.c          |  604 ++++++++
 tools/testing/selftests/ublk/common.c         |    2 +-
 tools/testing/selftests/ublk/file_backed.c    |   11 +-
 tools/testing/selftests/ublk/kublk.c          |  143 +-
 tools/testing/selftests/ublk/kublk.h          |  195 ++-
 tools/testing/selftests/ublk/null.c           |   18 +-
 tools/testing/selftests/ublk/stripe.c         |   17 +-
 .../testing/selftests/ublk/test_generic_14.sh |   32 +
 .../testing/selftests/ublk/test_generic_15.sh |   30 +
 .../testing/selftests/ublk/test_generic_16.sh |   30 +
 .../testing/selftests/ublk/test_stress_06.sh  |   45 +
 .../testing/selftests/ublk/test_stress_07.sh  |   44 +
 tools/testing/selftests/ublk/utils.h          |   64 +
 19 files changed, 2563 insertions(+), 178 deletions(-)
 create mode 100644 tools/testing/selftests/ublk/batch.c
 create mode 100755 tools/testing/selftests/ublk/test_generic_14.sh
 create mode 100755 tools/testing/selftests/ublk/test_generic_15.sh
 create mode 100755 tools/testing/selftests/ublk/test_generic_16.sh
 create mode 100755 tools/testing/selftests/ublk/test_stress_06.sh
 create mode 100755 tools/testing/selftests/ublk/test_stress_07.sh

-- 
2.47.0


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-29 19:12   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(),
and define kfifo_alloc_node() macro to support NUMA-aware memory
allocation.

The new __kfifo_alloc_node() function accepts a NUMA node parameter
and uses kmalloc_array_node() instead of kmalloc_array() for
node-specific allocation. The existing __kfifo_alloc() now calls
__kfifo_alloc_node() with NUMA_NO_NODE to maintain backward
compatibility.

This enables users to allocate kfifo buffers on specific NUMA nodes,
which is important for performance in NUMA systems where the kfifo
will be primarily accessed by threads running on specific nodes.

Cc: Stefani Seibold <stefani@seibold.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/kfifo.h | 34 ++++++++++++++++++++++++++++++++--
 lib/kfifo.c           |  8 ++++----
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/kfifo.h b/include/linux/kfifo.h
index fd743d4c4b4b..8b81ac74829c 100644
--- a/include/linux/kfifo.h
+++ b/include/linux/kfifo.h
@@ -369,6 +369,30 @@ __kfifo_int_must_check_helper( \
 }) \
 )
 
+/**
+ * kfifo_alloc_node - dynamically allocates a new fifo buffer on a NUMA node
+ * @fifo: pointer to the fifo
+ * @size: the number of elements in the fifo, this must be a power of 2
+ * @gfp_mask: get_free_pages mask, passed to kmalloc()
+ * @node: NUMA node to allocate memory on
+ *
+ * This macro dynamically allocates a new fifo buffer with NUMA node awareness.
+ *
+ * The number of elements will be rounded-up to a power of 2.
+ * The fifo will be release with kfifo_free().
+ * Return 0 if no error, otherwise an error code.
+ */
+#define kfifo_alloc_node(fifo, size, gfp_mask, node) \
+__kfifo_int_must_check_helper( \
+({ \
+	typeof((fifo) + 1) __tmp = (fifo); \
+	struct __kfifo *__kfifo = &__tmp->kfifo; \
+	__is_kfifo_ptr(__tmp) ? \
+	__kfifo_alloc_node(__kfifo, size, sizeof(*__tmp->type), gfp_mask, node) : \
+	-EINVAL; \
+}) \
+)
+
 /**
  * kfifo_free - frees the fifo
  * @fifo: the fifo to be freed
@@ -899,8 +923,14 @@ __kfifo_uint_must_check_helper( \
 )
 
 
-extern int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
-	size_t esize, gfp_t gfp_mask);
+extern int __kfifo_alloc_node(struct __kfifo *fifo, unsigned int size,
+	size_t esize, gfp_t gfp_mask, int node);
+
+static inline int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
+				size_t esize, gfp_t gfp_mask)
+{
+	return __kfifo_alloc_node(fifo, size, esize, gfp_mask, NUMA_NO_NODE);
+}
 
 extern void __kfifo_free(struct __kfifo *fifo);
 
diff --git a/lib/kfifo.c b/lib/kfifo.c
index a8b2eed90599..525e66f8294c 100644
--- a/lib/kfifo.c
+++ b/lib/kfifo.c
@@ -22,8 +22,8 @@ static inline unsigned int kfifo_unused(struct __kfifo *fifo)
 	return (fifo->mask + 1) - (fifo->in - fifo->out);
 }
 
-int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
-		size_t esize, gfp_t gfp_mask)
+int __kfifo_alloc_node(struct __kfifo *fifo, unsigned int size,
+		size_t esize, gfp_t gfp_mask, int node)
 {
 	/*
 	 * round up to the next power of 2, since our 'let the indices
@@ -41,7 +41,7 @@ int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
 		return -EINVAL;
 	}
 
-	fifo->data = kmalloc_array(esize, size, gfp_mask);
+	fifo->data = kmalloc_array_node(esize, size, gfp_mask, node);
 
 	if (!fifo->data) {
 		fifo->mask = 0;
@@ -51,7 +51,7 @@ int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
 
 	return 0;
 }
-EXPORT_SYMBOL(__kfifo_alloc);
+EXPORT_SYMBOL(__kfifo_alloc_node);
 
 void __kfifo_free(struct __kfifo *fifo)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
  2025-11-21  1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 03/27] ublk: add `union ublk_io_buf` with improved naming Ming Lei
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() and
prepare for reusing this helper for the coming UBLK_BATCH_IO feature,
which can fetch & commit one batch of io commands via single uring_cmd.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 75f210523e52..2884e0687e31 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1176,11 +1176,12 @@ ublk_auto_buf_reg_fallback(const struct ublk_queue *ubq, struct ublk_io *io)
 }
 
 static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
-			      struct ublk_io *io, unsigned int issue_flags)
+			      struct ublk_io *io, struct io_uring_cmd *cmd,
+			      unsigned int issue_flags)
 {
 	int ret;
 
-	ret = io_buffer_register_bvec(io->cmd, req, ublk_io_release,
+	ret = io_buffer_register_bvec(cmd, req, ublk_io_release,
 				      io->buf.index, issue_flags);
 	if (ret) {
 		if (io->buf.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
@@ -1192,18 +1193,19 @@ static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
 	}
 
 	io->task_registered_buffers = 1;
-	io->buf_ctx_handle = io_uring_cmd_ctx_handle(io->cmd);
+	io->buf_ctx_handle = io_uring_cmd_ctx_handle(cmd);
 	io->flags |= UBLK_IO_FLAG_AUTO_BUF_REG;
 	return true;
 }
 
 static bool ublk_prep_auto_buf_reg(struct ublk_queue *ubq,
 				   struct request *req, struct ublk_io *io,
+				   struct io_uring_cmd *cmd,
 				   unsigned int issue_flags)
 {
 	ublk_init_req_ref(ubq, io);
 	if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
-		return ublk_auto_buf_reg(ubq, req, io, issue_flags);
+		return ublk_auto_buf_reg(ubq, req, io, cmd, issue_flags);
 
 	return true;
 }
@@ -1278,7 +1280,7 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
 	if (!ublk_start_io(ubq, req, io))
 		return;
 
-	if (ublk_prep_auto_buf_reg(ubq, req, io, issue_flags))
+	if (ublk_prep_auto_buf_reg(ubq, req, io, io->cmd, issue_flags))
 		ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 03/27] ublk: add `union ublk_io_buf` with improved naming
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
  2025-11-21  1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
  2025-11-21  1:58 ` [PATCH V4 02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 04/27] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's
addr and buf fields, meantime apply it to `struct ublk_io` for storing either
ublk auto buffer register data or ublk server io buffer address.

The union uses clear field names:
- `addr`: for regular ublk server io buffer addresses
- `auto_reg`: for ublk auto buffer registration data

This eliminates confusing access patterns and improves code readability.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2884e0687e31..f1fa5ceacdf6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -155,12 +155,13 @@ struct ublk_uring_cmd_pdu {
  */
 #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
 
+union ublk_io_buf {
+	__u64	addr;
+	struct ublk_auto_buf_reg auto_reg;
+};
+
 struct ublk_io {
-	/* userspace buffer address from io cmd */
-	union {
-		__u64	addr;
-		struct ublk_auto_buf_reg buf;
-	};
+	union ublk_io_buf buf;
 	unsigned int flags;
 	int res;
 
@@ -498,7 +499,7 @@ static blk_status_t ublk_setup_iod_zoned(struct ublk_queue *ubq,
 	iod->op_flags = ublk_op | ublk_req_build_flags(req);
 	iod->nr_sectors = blk_rq_sectors(req);
 	iod->start_sector = blk_rq_pos(req);
-	iod->addr = io->addr;
+	iod->addr = io->buf.addr;
 
 	return BLK_STS_OK;
 }
@@ -981,7 +982,7 @@ static unsigned int ublk_map_io(const struct ublk_queue *ubq,
 		struct iov_iter iter;
 		const int dir = ITER_DEST;
 
-		import_ubuf(dir, u64_to_user_ptr(io->addr), rq_bytes, &iter);
+		import_ubuf(dir, u64_to_user_ptr(io->buf.addr), rq_bytes, &iter);
 		return ublk_copy_user_pages(req, 0, &iter, dir);
 	}
 	return rq_bytes;
@@ -1002,7 +1003,7 @@ static unsigned int ublk_unmap_io(bool need_map,
 
 		WARN_ON_ONCE(io->res > rq_bytes);
 
-		import_ubuf(dir, u64_to_user_ptr(io->addr), io->res, &iter);
+		import_ubuf(dir, u64_to_user_ptr(io->buf.addr), io->res, &iter);
 		return ublk_copy_user_pages(req, 0, &iter, dir);
 	}
 	return rq_bytes;
@@ -1068,7 +1069,7 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
 	iod->op_flags = ublk_op | ublk_req_build_flags(req);
 	iod->nr_sectors = blk_rq_sectors(req);
 	iod->start_sector = blk_rq_pos(req);
-	iod->addr = io->addr;
+	iod->addr = io->buf.addr;
 
 	return BLK_STS_OK;
 }
@@ -1182,9 +1183,9 @@ static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
 	int ret;
 
 	ret = io_buffer_register_bvec(cmd, req, ublk_io_release,
-				      io->buf.index, issue_flags);
+				      io->buf.auto_reg.index, issue_flags);
 	if (ret) {
-		if (io->buf.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
+		if (io->buf.auto_reg.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
 			ublk_auto_buf_reg_fallback(ubq, io);
 			return true;
 		}
@@ -1473,7 +1474,7 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
 		 */
 		io->flags &= UBLK_IO_FLAG_CANCELED;
 		io->cmd = NULL;
-		io->addr = 0;
+		io->buf.addr = 0;
 
 		/*
 		 * old task is PF_EXITING, put it now
@@ -2034,13 +2035,16 @@ static inline int ublk_check_cmd_op(u32 cmd_op)
 
 static inline int ublk_set_auto_buf_reg(struct ublk_io *io, struct io_uring_cmd *cmd)
 {
-	io->buf = ublk_sqe_addr_to_auto_buf_reg(READ_ONCE(cmd->sqe->addr));
+	struct ublk_auto_buf_reg buf;
+
+	buf = ublk_sqe_addr_to_auto_buf_reg(READ_ONCE(cmd->sqe->addr));
 
-	if (io->buf.reserved0 || io->buf.reserved1)
+	if (buf.reserved0 || buf.reserved1)
 		return -EINVAL;
 
-	if (io->buf.flags & ~UBLK_AUTO_BUF_REG_F_MASK)
+	if (buf.flags & ~UBLK_AUTO_BUF_REG_F_MASK)
 		return -EINVAL;
+	io->buf.auto_reg = buf;
 	return 0;
 }
 
@@ -2062,7 +2066,7 @@ static int ublk_handle_auto_buf_reg(struct ublk_io *io,
 		 * this ublk request gets stuck.
 		 */
 		if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd))
-			*buf_idx = io->buf.index;
+			*buf_idx = io->buf.auto_reg.index;
 	}
 
 	return ublk_set_auto_buf_reg(io, cmd);
@@ -2090,7 +2094,7 @@ ublk_config_io_buf(const struct ublk_device *ub, struct ublk_io *io,
 	if (ublk_dev_support_auto_buf_reg(ub))
 		return ublk_handle_auto_buf_reg(io, cmd, buf_idx);
 
-	io->addr = buf_addr;
+	io->buf.addr = buf_addr;
 	return 0;
 }
 
@@ -2287,7 +2291,7 @@ static bool ublk_get_data(const struct ublk_queue *ubq, struct ublk_io *io,
 	 */
 	io->flags &= ~UBLK_IO_FLAG_NEED_GET_DATA;
 	/* update iod->addr because ublksrv may have passed a new io buffer */
-	ublk_get_iod(ubq, req->tag)->addr = io->addr;
+	ublk_get_iod(ubq, req->tag)->addr = io->buf.addr;
 	pr_devel("%s: update iod->addr: qid %d tag %d io_flags %x addr %llx\n",
 			__func__, ubq->q_id, req->tag, io->flags,
 			ublk_get_iod(ubq, req->tag)->addr);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 04/27] ublk: refactor auto buffer register in ublk_dispatch_req()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (2 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 03/27] ublk: add `union ublk_io_buf` with improved naming Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 05/27] ublk: pass const pointer to ublk_queue_is_zoned() Ming Lei
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Refactor auto buffer register code and prepare for supporting batch IO
feature, and the main motivation is to put 'ublk_io' operation code
together, so that per-io lock can be applied for the code block.

The key changes are:
- Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg()
- Introduce an enum `auto_buf_reg_res` to represent the result of
  the buffer registration attempt (FAIL, FALLBACK, OK).
- Split the existing `ublk_do_auto_buf_reg` function into two:
  - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration
    and returns the `auto_buf_reg_res` status.
  - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function
    and handles the I/O preparation based on the result.
- Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for
  preparing the I/O for completion after buffer registration.
- Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid
  recalculating it.

This refactoring makes the control flow clearer and isolates the different
stages of the auto buffer registration process.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 64 +++++++++++++++++++++++++++-------------
 1 file changed, 43 insertions(+), 21 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f1fa5ceacdf6..b36cd55eceb0 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1168,17 +1168,37 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
 }
 
 static void
-ublk_auto_buf_reg_fallback(const struct ublk_queue *ubq, struct ublk_io *io)
+ublk_auto_buf_reg_fallback(const struct ublk_queue *ubq, unsigned tag)
 {
-	unsigned tag = io - ubq->ios;
 	struct ublksrv_io_desc *iod = ublk_get_iod(ubq, tag);
 
 	iod->op_flags |= UBLK_IO_F_NEED_REG_BUF;
 }
 
-static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
-			      struct ublk_io *io, struct io_uring_cmd *cmd,
-			      unsigned int issue_flags)
+enum auto_buf_reg_res {
+	AUTO_BUF_REG_FAIL,
+	AUTO_BUF_REG_FALLBACK,
+	AUTO_BUF_REG_OK,
+};
+
+static void ublk_prep_auto_buf_reg_io(const struct ublk_queue *ubq,
+				      struct request *req, struct ublk_io *io,
+				      struct io_uring_cmd *cmd,
+				      enum auto_buf_reg_res res)
+{
+	if (res == AUTO_BUF_REG_OK) {
+		io->task_registered_buffers = 1;
+		io->buf_ctx_handle = io_uring_cmd_ctx_handle(cmd);
+		io->flags |= UBLK_IO_FLAG_AUTO_BUF_REG;
+	}
+	ublk_init_req_ref(ubq, io);
+	__ublk_prep_compl_io_cmd(io, req);
+}
+
+static enum auto_buf_reg_res
+__ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
+		       struct ublk_io *io, struct io_uring_cmd *cmd,
+		       unsigned int issue_flags)
 {
 	int ret;
 
@@ -1186,29 +1206,27 @@ static bool ublk_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
 				      io->buf.auto_reg.index, issue_flags);
 	if (ret) {
 		if (io->buf.auto_reg.flags & UBLK_AUTO_BUF_REG_FALLBACK) {
-			ublk_auto_buf_reg_fallback(ubq, io);
-			return true;
+			ublk_auto_buf_reg_fallback(ubq, req->tag);
+			return AUTO_BUF_REG_FALLBACK;
 		}
 		blk_mq_end_request(req, BLK_STS_IOERR);
-		return false;
+		return AUTO_BUF_REG_FAIL;
 	}
 
-	io->task_registered_buffers = 1;
-	io->buf_ctx_handle = io_uring_cmd_ctx_handle(cmd);
-	io->flags |= UBLK_IO_FLAG_AUTO_BUF_REG;
-	return true;
+	return AUTO_BUF_REG_OK;
 }
 
-static bool ublk_prep_auto_buf_reg(struct ublk_queue *ubq,
-				   struct request *req, struct ublk_io *io,
-				   struct io_uring_cmd *cmd,
-				   unsigned int issue_flags)
+static void ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
+				 struct ublk_io *io, struct io_uring_cmd *cmd,
+				 unsigned int issue_flags)
 {
-	ublk_init_req_ref(ubq, io);
-	if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
-		return ublk_auto_buf_reg(ubq, req, io, cmd, issue_flags);
+	enum auto_buf_reg_res res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
+			issue_flags);
 
-	return true;
+	if (res != AUTO_BUF_REG_FAIL) {
+		ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
+		io_uring_cmd_done(cmd, UBLK_IO_RES_OK, issue_flags);
+	}
 }
 
 static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
@@ -1281,8 +1299,12 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
 	if (!ublk_start_io(ubq, req, io))
 		return;
 
-	if (ublk_prep_auto_buf_reg(ubq, req, io, io->cmd, issue_flags))
+	if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req)) {
+		ublk_do_auto_buf_reg(ubq, req, io, io->cmd, issue_flags);
+	} else {
+		ublk_init_req_ref(ubq, io);
 		ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
+	}
 }
 
 static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 05/27] ublk: pass const pointer to ublk_queue_is_zoned()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (3 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 04/27] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 06/27] ublk: add helper of __ublk_fetch() Ming Lei
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Pass const pointer to ublk_queue_is_zoned() because it is readonly.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b36cd55eceb0..5e83c1b2a69e 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -265,7 +265,7 @@ static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
 	return ub->dev_info.flags & UBLK_F_ZONED;
 }
 
-static inline bool ublk_queue_is_zoned(struct ublk_queue *ubq)
+static inline bool ublk_queue_is_zoned(const struct ublk_queue *ubq)
 {
 	return ubq->flags & UBLK_F_ZONED;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 06/27] ublk: add helper of __ublk_fetch()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (4 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 05/27] ublk: pass const pointer to ublk_queue_is_zoned() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 07/27] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add helper __ublk_fetch() for refactoring ublk_fetch().

Meantime move ublk_config_io_buf() out of __ublk_fetch() to make
the code structure cleaner.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 46 +++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 5e83c1b2a69e..dd9c35758a46 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -2234,39 +2234,41 @@ static int ublk_check_fetch_buf(const struct ublk_device *ub, __u64 buf_addr)
 	return 0;
 }
 
-static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
-		      struct ublk_io *io, __u64 buf_addr)
+static int __ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
+			struct ublk_io *io)
 {
-	int ret = 0;
-
-	/*
-	 * When handling FETCH command for setting up ublk uring queue,
-	 * ub->mutex is the innermost lock, and we won't block for handling
-	 * FETCH, so it is fine even for IO_URING_F_NONBLOCK.
-	 */
-	mutex_lock(&ub->mutex);
 	/* UBLK_IO_FETCH_REQ is only allowed before dev is setup */
-	if (ublk_dev_ready(ub)) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (ublk_dev_ready(ub))
+		return -EBUSY;
 
 	/* allow each command to be FETCHed at most once */
-	if (io->flags & UBLK_IO_FLAG_ACTIVE) {
-		ret = -EINVAL;
-		goto out;
-	}
+	if (io->flags & UBLK_IO_FLAG_ACTIVE)
+		return -EINVAL;
 
 	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV);
 
 	ublk_fill_io_cmd(io, cmd);
-	ret = ublk_config_io_buf(ub, io, cmd, buf_addr, NULL);
-	if (ret)
-		goto out;
 
 	WRITE_ONCE(io->task, get_task_struct(current));
 	ublk_mark_io_ready(ub);
-out:
+
+	return 0;
+}
+
+static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
+		      struct ublk_io *io, __u64 buf_addr)
+{
+	int ret;
+
+	/*
+	 * When handling FETCH command for setting up ublk uring queue,
+	 * ub->mutex is the innermost lock, and we won't block for handling
+	 * FETCH, so it is fine even for IO_URING_F_NONBLOCK.
+	 */
+	mutex_lock(&ub->mutex);
+	ret = __ublk_fetch(cmd, ub, io);
+	if (!ret)
+		ret = ublk_config_io_buf(ub, io, cmd, buf_addr, NULL);
 	mutex_unlock(&ub->mutex);
 	return ret;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 07/27] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (5 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 06/27] ublk: add helper of __ublk_fetch() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 08/27] ublk: prepare for not tracking task context for command batch Ming Lei
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Introduces the basic structure for a batched I/O feature in the ublk driver.
It adds placeholder functions and a new file operations structure,
ublk_ch_batch_io_fops, which will be used for fetching and committing I/O
commands in batches. Currently, the feature is disabled.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index dd9c35758a46..1fcca52591c3 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -254,6 +254,11 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 		u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
 static inline unsigned int ublk_req_build_flags(struct request *req);
 
+static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
+{
+	return false;
+}
+
 static inline struct ublksrv_io_desc *
 ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
 {
@@ -2512,6 +2517,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	return ublk_ch_uring_cmd_local(cmd, issue_flags);
 }
 
+static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
+				       unsigned int issue_flags)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline bool ublk_check_ubuf_dir(const struct request *req,
 		int ubuf_dir)
 {
@@ -2618,6 +2629,16 @@ static const struct file_operations ublk_ch_fops = {
 	.mmap = ublk_ch_mmap,
 };
 
+static const struct file_operations ublk_ch_batch_io_fops = {
+	.owner = THIS_MODULE,
+	.open = ublk_ch_open,
+	.release = ublk_ch_release,
+	.read_iter = ublk_ch_read_iter,
+	.write_iter = ublk_ch_write_iter,
+	.uring_cmd = ublk_ch_batch_io_uring_cmd,
+	.mmap = ublk_ch_mmap,
+};
+
 static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
 {
 	struct ublk_queue *ubq = ub->queues[q_id];
@@ -2778,7 +2799,10 @@ static int ublk_add_chdev(struct ublk_device *ub)
 	if (ret)
 		goto fail;
 
-	cdev_init(&ub->cdev, &ublk_ch_fops);
+	if (ublk_dev_support_batch_io(ub))
+		cdev_init(&ub->cdev, &ublk_ch_batch_io_fops);
+	else
+		cdev_init(&ub->cdev, &ublk_ch_fops);
 	ret = cdev_device_add(&ub->cdev, dev);
 	if (ret)
 		goto fail;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 08/27] ublk: prepare for not tracking task context for command batch
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (6 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 07/27] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

batch io is designed to be independent of task context, and we will not
track task context for batch io feature.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1fcca52591c3..c62b2f2057fe 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -2254,7 +2254,10 @@ static int __ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
 
 	ublk_fill_io_cmd(io, cmd);
 
-	WRITE_ONCE(io->task, get_task_struct(current));
+	if (ublk_dev_support_batch_io(ub))
+		WRITE_ONCE(io->task, NULL);
+	else
+		WRITE_ONCE(io->task, get_task_struct(current));
 	ublk_mark_io_ready(ub);
 
 	return 0;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (7 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 08/27] ublk: prepare for not tracking task context for command batch Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-29 19:19   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of
UBLK_IO_FETCH_REQ.

Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command
result only, still the batch version.

The new command header type is `struct ublk_batch_io`.

This patch doesn't actually implement these commands yet, just validates the
SQE fields.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 85 ++++++++++++++++++++++++++++++++++-
 include/uapi/linux/ublk_cmd.h | 49 ++++++++++++++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c62b2f2057fe..21890947ceec 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -85,6 +85,11 @@
 	 UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED |    \
 	 UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT)
 
+#define UBLK_BATCH_F_ALL  \
+	(UBLK_BATCH_F_HAS_ZONE_LBA | \
+	 UBLK_BATCH_F_HAS_BUF_ADDR | \
+	 UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
+
 struct ublk_uring_cmd_pdu {
 	/*
 	 * Store requests in same batch temporarily for queuing them to
@@ -108,6 +113,12 @@ struct ublk_uring_cmd_pdu {
 	u16 tag;
 };
 
+struct ublk_batch_io_data {
+	struct ublk_device *ub;
+	struct io_uring_cmd *cmd;
+	struct ublk_batch_io header;
+};
+
 /*
  * io command is active: sqe cmd is received, and its cqe isn't done
  *
@@ -2520,10 +2531,82 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	return ublk_ch_uring_cmd_local(cmd, issue_flags);
 }
 
+static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
+{
+	unsigned elem_bytes = sizeof(struct ublk_elem_header);
+
+	if (uc->flags & ~UBLK_BATCH_F_ALL)
+		return -EINVAL;
+
+	/* UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK requires buffer index */
+	if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) &&
+			(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR))
+		return -EINVAL;
+
+	elem_bytes += (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA ? sizeof(u64) : 0) +
+		(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR ? sizeof(u64) : 0);
+	if (uc->elem_bytes != elem_bytes)
+		return -EINVAL;
+	return 0;
+}
+
+static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
+{
+
+	const struct ublk_batch_io *uc = &data->header;
+
+	if (uc->nr_elem > data->ub->dev_info.queue_depth)
+		return -E2BIG;
+
+	if ((uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA) &&
+			!ublk_dev_is_zoned(data->ub))
+		return -EINVAL;
+
+	if ((uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR) &&
+			!ublk_dev_need_map_io(data->ub))
+		return -EINVAL;
+
+	if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) &&
+			!ublk_dev_support_auto_buf_reg(data->ub))
+		return -EINVAL;
+
+	return ublk_check_batch_cmd_flags(uc);
+}
+
 static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 				       unsigned int issue_flags)
 {
-	return -EOPNOTSUPP;
+	const struct ublk_batch_io *uc = io_uring_sqe_cmd(cmd->sqe);
+	struct ublk_device *ub = cmd->file->private_data;
+	struct ublk_batch_io_data data = {
+		.ub  = ub,
+		.cmd = cmd,
+		.header = (struct ublk_batch_io) {
+			.q_id = READ_ONCE(uc->q_id),
+			.flags = READ_ONCE(uc->flags),
+			.nr_elem = READ_ONCE(uc->nr_elem),
+			.elem_bytes = READ_ONCE(uc->elem_bytes),
+		},
+	};
+	u32 cmd_op = cmd->cmd_op;
+	int ret = -EINVAL;
+
+	if (data.header.q_id >= ub->dev_info.nr_hw_queues)
+		goto out;
+
+	switch (cmd_op) {
+	case UBLK_U_IO_PREP_IO_CMDS:
+	case UBLK_U_IO_COMMIT_IO_CMDS:
+		ret = ublk_check_batch_cmd(&data);
+		if (ret)
+			goto out;
+		ret = -EOPNOTSUPP;
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+	}
+out:
+	return ret;
 }
 
 static inline bool ublk_check_ubuf_dir(const struct request *req,
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index ec77dabba45b..2ce5a496b622 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -102,6 +102,10 @@
 	_IOWR('u', 0x23, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_UNREGISTER_IO_BUF	\
 	_IOWR('u', 0x24, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_PREP_IO_CMDS	\
+	_IOWR('u', 0x25, struct ublk_batch_io)
+#define	UBLK_U_IO_COMMIT_IO_CMDS	\
+	_IOWR('u', 0x26, struct ublk_batch_io)
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
@@ -525,6 +529,51 @@ struct ublksrv_io_cmd {
 	};
 };
 
+struct ublk_elem_header {
+	__u16 tag;	/* IO tag */
+
+	/*
+	 * Buffer index for incoming io command, only valid iff
+	 * UBLK_F_AUTO_BUF_REG is set
+	 */
+	__u16 buf_index;
+	__s32 result;	/* I/O completion result (commit only) */
+};
+
+/*
+ * uring_cmd buffer structure for batch commands
+ *
+ * buffer includes multiple elements, which number is specified by
+ * `nr_elem`. Each element buffer is organized in the following order:
+ *
+ * struct ublk_elem_buffer {
+ * 	// Mandatory fields (8 bytes)
+ * 	struct ublk_elem_header header;
+ *
+ * 	// Optional fields (8 bytes each, included based on flags)
+ *
+ * 	// Buffer address (if UBLK_BATCH_F_HAS_BUF_ADDR) for copying data
+ * 	// between ublk request and ublk server buffer
+ * 	__u64 buf_addr;
+ *
+ * 	// returned Zone append LBA (if UBLK_BATCH_F_HAS_ZONE_LBA)
+ * 	__u64 zone_lba;
+ * }
+ *
+ * Used for `UBLK_U_IO_PREP_IO_CMDS` and `UBLK_U_IO_COMMIT_IO_CMDS`
+ */
+struct ublk_batch_io {
+	__u16  q_id;
+#define UBLK_BATCH_F_HAS_ZONE_LBA	(1 << 0)
+#define UBLK_BATCH_F_HAS_BUF_ADDR 	(1 << 1)
+#define UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK	(1 << 2)
+	__u16	flags;
+	__u16	nr_elem;
+	__u8	elem_bytes;
+	__u8	reserved;
+	__u64   reserved2;
+};
+
 struct ublk_param_basic {
 #define UBLK_ATTR_READ_ONLY            (1 << 0)
 #define UBLK_ATTR_ROTATIONAL           (1 << 1)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (8 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-29 19:47   ` Caleb Sander Mateos
  2025-11-30 19:25   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
                   ` (18 subsequent siblings)
  28 siblings, 2 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
which allows userspace to prepare a batch of I/O requests.

The core of this change is the `ublk_walk_cmd_buf` function, which iterates
over the elements in the uring_cmd fixed buffer. For each element, it parses
the I/O details, finds the corresponding `ublk_io` structure, and prepares it
for future dispatch.

Add per-io lock for protecting concurrent delivery and committing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 193 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/ublk_cmd.h |   5 +
 2 files changed, 197 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 21890947ceec..66c77daae955 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -117,6 +117,7 @@ struct ublk_batch_io_data {
 	struct ublk_device *ub;
 	struct io_uring_cmd *cmd;
 	struct ublk_batch_io header;
+	unsigned int issue_flags;
 };
 
 /*
@@ -201,6 +202,7 @@ struct ublk_io {
 	unsigned task_registered_buffers;
 
 	void *buf_ctx_handle;
+	spinlock_t lock;
 } ____cacheline_aligned_in_smp;
 
 struct ublk_queue {
@@ -270,6 +272,16 @@ static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
 	return false;
 }
 
+static inline void ublk_io_lock(struct ublk_io *io)
+{
+	spin_lock(&io->lock);
+}
+
+static inline void ublk_io_unlock(struct ublk_io *io)
+{
+	spin_unlock(&io->lock);
+}
+
 static inline struct ublksrv_io_desc *
 ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
 {
@@ -2531,6 +2543,171 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	return ublk_ch_uring_cmd_local(cmd, issue_flags);
 }
 
+static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
+					const struct ublk_elem_header *elem)
+{
+	const void *buf = elem;
+
+	if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)
+		return *(__u64 *)(buf + sizeof(*elem));
+	return 0;
+}
+
+static struct ublk_auto_buf_reg
+ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
+			const struct ublk_elem_header *elem)
+{
+	struct ublk_auto_buf_reg reg = {
+		.index = elem->buf_index,
+		.flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ?
+			UBLK_AUTO_BUF_REG_FALLBACK : 0,
+	};
+
+	return reg;
+}
+
+/*
+ * 48 can hold any type of buffer element(8, 16 and 24 bytes) because
+ * it is the least common multiple(LCM) of 8, 16 and 24
+ */
+#define UBLK_CMD_BATCH_TMP_BUF_SZ  (48 * 10)
+struct ublk_batch_io_iter {
+	void __user *uaddr;
+	unsigned done, total;
+	unsigned char elem_bytes;
+	/* copy to this buffer from user space */
+	unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ];
+};
+
+static inline int
+__ublk_walk_cmd_buf(struct ublk_queue *ubq,
+		    struct ublk_batch_io_iter *iter,
+		    const struct ublk_batch_io_data *data,
+		    unsigned bytes,
+		    int (*cb)(struct ublk_queue *q,
+			    const struct ublk_batch_io_data *data,
+			    const struct ublk_elem_header *elem))
+{
+	unsigned int i;
+	int ret = 0;
+
+	for (i = 0; i < bytes; i += iter->elem_bytes) {
+		const struct ublk_elem_header *elem =
+			(const struct ublk_elem_header *)&iter->buf[i];
+
+		if (unlikely(elem->tag >= data->ub->dev_info.queue_depth)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		ret = cb(ubq, data, elem);
+		if (unlikely(ret))
+			break;
+	}
+
+	iter->done += i;
+	return ret;
+}
+
+static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
+			     const struct ublk_batch_io_data *data,
+			     int (*cb)(struct ublk_queue *q,
+				     const struct ublk_batch_io_data *data,
+				     const struct ublk_elem_header *elem))
+{
+	struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
+	int ret = 0;
+
+	while (iter->done < iter->total) {
+		unsigned int len = min(sizeof(iter->buf), iter->total - iter->done);
+
+		if (copy_from_user(iter->buf, iter->uaddr + iter->done, len)) {
+			pr_warn("ublk%d: read batch cmd buffer failed\n",
+					data->ub->dev_info.dev_id);
+			return -EFAULT;
+		}
+
+		ret = __ublk_walk_cmd_buf(ubq, iter, data, len, cb);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int ublk_batch_unprep_io(struct ublk_queue *ubq,
+				const struct ublk_batch_io_data *data,
+				const struct ublk_elem_header *elem)
+{
+	struct ublk_io *io = &ubq->ios[elem->tag];
+
+	data->ub->nr_io_ready--;
+	ublk_io_lock(io);
+	io->flags = 0;
+	ublk_io_unlock(io);
+	return 0;
+}
+
+static void ublk_batch_revert_prep_cmd(struct ublk_batch_io_iter *iter,
+				       const struct ublk_batch_io_data *data)
+{
+	int ret;
+
+	/* Re-process only what we've already processed, starting from beginning */
+	iter->total = iter->done;
+	iter->done = 0;
+
+	ret = ublk_walk_cmd_buf(iter, data, ublk_batch_unprep_io);
+	WARN_ON_ONCE(ret);
+}
+
+static int ublk_batch_prep_io(struct ublk_queue *ubq,
+			      const struct ublk_batch_io_data *data,
+			      const struct ublk_elem_header *elem)
+{
+	struct ublk_io *io = &ubq->ios[elem->tag];
+	const struct ublk_batch_io *uc = &data->header;
+	union ublk_io_buf buf = { 0 };
+	int ret;
+
+	if (ublk_dev_support_auto_buf_reg(data->ub))
+		buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
+	else if (ublk_dev_need_map_io(data->ub)) {
+		buf.addr = ublk_batch_buf_addr(uc, elem);
+
+		ret = ublk_check_fetch_buf(data->ub, buf.addr);
+		if (ret)
+			return ret;
+	}
+
+	ublk_io_lock(io);
+	ret = __ublk_fetch(data->cmd, data->ub, io);
+	if (!ret)
+		io->buf = buf;
+	ublk_io_unlock(io);
+
+	return ret;
+}
+
+static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
+{
+	const struct ublk_batch_io *uc = &data->header;
+	struct io_uring_cmd *cmd = data->cmd;
+	struct ublk_batch_io_iter iter = {
+		.uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)),
+		.total = uc->nr_elem * uc->elem_bytes,
+		.elem_bytes = uc->elem_bytes,
+	};
+	int ret;
+
+	mutex_lock(&data->ub->mutex);
+	ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_prep_io);
+
+	if (ret && iter.done)
+		ublk_batch_revert_prep_cmd(&iter, data);
+	mutex_unlock(&data->ub->mutex);
+	return ret;
+}
+
 static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
 {
 	unsigned elem_bytes = sizeof(struct ublk_elem_header);
@@ -2587,6 +2764,7 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 			.nr_elem = READ_ONCE(uc->nr_elem),
 			.elem_bytes = READ_ONCE(uc->elem_bytes),
 		},
+		.issue_flags = issue_flags,
 	};
 	u32 cmd_op = cmd->cmd_op;
 	int ret = -EINVAL;
@@ -2596,6 +2774,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 
 	switch (cmd_op) {
 	case UBLK_U_IO_PREP_IO_CMDS:
+		ret = ublk_check_batch_cmd(&data);
+		if (ret)
+			goto out;
+		ret = ublk_handle_batch_prep_cmd(&data);
+		break;
 	case UBLK_U_IO_COMMIT_IO_CMDS:
 		ret = ublk_check_batch_cmd(&data);
 		if (ret)
@@ -2770,7 +2953,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
 	struct ublk_queue *ubq;
 	struct page *page;
 	int numa_node;
-	int size;
+	int size, i;
 
 	/* Determine NUMA node based on queue's CPU affinity */
 	numa_node = ublk_get_queue_numa_node(ub, q_id);
@@ -2795,6 +2978,9 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
 	}
 	ubq->io_cmd_buf = page_address(page);
 
+	for (i = 0; i < ubq->q_depth; i++)
+		spin_lock_init(&ubq->ios[i].lock);
+
 	ub->queues[q_id] = ubq;
 	ubq->dev = ub;
 	return 0;
@@ -3021,6 +3207,11 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub,
 		return -EINVAL;
 
 	mutex_lock(&ub->mutex);
+	/* device may become not ready in case of F_BATCH */
+	if (!ublk_dev_ready(ub)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
 	if (ub->dev_info.state == UBLK_S_DEV_LIVE ||
 	    test_bit(UB_STATE_USED, &ub->state)) {
 		ret = -EEXIST;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 2ce5a496b622..c96c299057c3 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -102,6 +102,11 @@
 	_IOWR('u', 0x23, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_UNREGISTER_IO_BUF	\
 	_IOWR('u', 0x24, struct ublksrv_io_cmd)
+
+/*
+ * return 0 if the command is run successfully, otherwise failure code
+ * is returned
+ */
 #define	UBLK_U_IO_PREP_IO_CMDS	\
 	_IOWR('u', 0x25, struct ublk_batch_io)
 #define	UBLK_U_IO_COMMIT_IO_CMDS	\
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (9 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-30 16:39   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 12/27] ublk: add io events fifo structure Ming Lei
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:

- read each element into one temp buffer in batch style

- parse and apply each element for committing io result

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 117 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |   8 +++
 2 files changed, 121 insertions(+), 4 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 66c77daae955..ea992366af5b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -2098,9 +2098,9 @@ static inline int ublk_set_auto_buf_reg(struct ublk_io *io, struct io_uring_cmd
 	return 0;
 }
 
-static int ublk_handle_auto_buf_reg(struct ublk_io *io,
-				    struct io_uring_cmd *cmd,
-				    u16 *buf_idx)
+static void __ublk_handle_auto_buf_reg(struct ublk_io *io,
+				       struct io_uring_cmd *cmd,
+				       u16 *buf_idx)
 {
 	if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG) {
 		io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
@@ -2118,7 +2118,13 @@ static int ublk_handle_auto_buf_reg(struct ublk_io *io,
 		if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd))
 			*buf_idx = io->buf.auto_reg.index;
 	}
+}
 
+static int ublk_handle_auto_buf_reg(struct ublk_io *io,
+				    struct io_uring_cmd *cmd,
+				    u16 *buf_idx)
+{
+	__ublk_handle_auto_buf_reg(io, cmd, buf_idx);
 	return ublk_set_auto_buf_reg(io, cmd);
 }
 
@@ -2553,6 +2559,17 @@ static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
 	return 0;
 }
 
+static inline __u64 ublk_batch_zone_lba(const struct ublk_batch_io *uc,
+					const struct ublk_elem_header *elem)
+{
+	const void *buf = (const void *)elem;
+
+	if (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA)
+		return *(__u64 *)(buf + sizeof(*elem) +
+				8 * !!(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR));
+	return -1;
+}
+
 static struct ublk_auto_buf_reg
 ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
 			const struct ublk_elem_header *elem)
@@ -2708,6 +2725,98 @@ static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
 	return ret;
 }
 
+static int ublk_batch_commit_io_check(const struct ublk_queue *ubq,
+				      struct ublk_io *io,
+				      union ublk_io_buf *buf)
+{
+	struct request *req = io->req;
+
+	if (!req)
+		return -EINVAL;
+
+	if (io->flags & UBLK_IO_FLAG_ACTIVE)
+		return -EBUSY;
+
+	if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
+		return -EINVAL;
+
+	if (ublk_need_map_io(ubq)) {
+		/*
+		 * COMMIT_AND_FETCH_REQ has to provide IO buffer if
+		 * NEED GET DATA is not enabled or it is Read IO.
+		 */
+		if (!buf->addr && (!ublk_need_get_data(ubq) ||
+					req_op(req) == REQ_OP_READ))
+			return -EINVAL;
+	}
+	return 0;
+}
+
+static int ublk_batch_commit_io(struct ublk_queue *ubq,
+				const struct ublk_batch_io_data *data,
+				const struct ublk_elem_header *elem)
+{
+	struct ublk_io *io = &ubq->ios[elem->tag];
+	const struct ublk_batch_io *uc = &data->header;
+	u16 buf_idx = UBLK_INVALID_BUF_IDX;
+	union ublk_io_buf buf = { 0 };
+	struct request *req = NULL;
+	bool auto_reg = false;
+	bool compl = false;
+	int ret;
+
+	if (ublk_dev_support_auto_buf_reg(data->ub)) {
+		buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
+		auto_reg = true;
+	} else if (ublk_dev_need_map_io(data->ub))
+		buf.addr = ublk_batch_buf_addr(uc, elem);
+
+	ublk_io_lock(io);
+	ret = ublk_batch_commit_io_check(ubq, io, &buf);
+	if (!ret) {
+		io->res = elem->result;
+		io->buf = buf;
+		req = ublk_fill_io_cmd(io, data->cmd);
+
+		if (auto_reg)
+			__ublk_handle_auto_buf_reg(io, data->cmd, &buf_idx);
+		compl = ublk_need_complete_req(data->ub, io);
+	}
+	ublk_io_unlock(io);
+
+	if (unlikely(ret)) {
+		pr_warn("%s: dev %u queue %u io %u: commit failure %d\n",
+			__func__, data->ub->dev_info.dev_id, ubq->q_id,
+			elem->tag, ret);
+		return ret;
+	}
+
+	/* can't touch 'ublk_io' any more */
+	if (buf_idx != UBLK_INVALID_BUF_IDX)
+		io_buffer_unregister_bvec(data->cmd, buf_idx, data->issue_flags);
+	if (req_op(req) == REQ_OP_ZONE_APPEND)
+		req->__sector = ublk_batch_zone_lba(uc, elem);
+	if (compl)
+		__ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub));
+	return 0;
+}
+
+static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
+{
+	const struct ublk_batch_io *uc = &data->header;
+	struct io_uring_cmd *cmd = data->cmd;
+	struct ublk_batch_io_iter iter = {
+		.uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)),
+		.total = uc->nr_elem * uc->elem_bytes,
+		.elem_bytes = uc->elem_bytes,
+	};
+	int ret;
+
+	ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_commit_io);
+
+	return iter.done == 0 ? ret : iter.done;
+}
+
 static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
 {
 	unsigned elem_bytes = sizeof(struct ublk_elem_header);
@@ -2783,7 +2892,7 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 		ret = ublk_check_batch_cmd(&data);
 		if (ret)
 			goto out;
-		ret = -EOPNOTSUPP;
+		ret = ublk_handle_batch_commit_cmd(&data);
 		break;
 	default:
 		ret = -EOPNOTSUPP;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index c96c299057c3..295ec8f34173 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -109,6 +109,14 @@
  */
 #define	UBLK_U_IO_PREP_IO_CMDS	\
 	_IOWR('u', 0x25, struct ublk_batch_io)
+/*
+ * If failure code is returned, nothing in the command buffer is handled.
+ * Otherwise, the returned value means how many bytes in command buffer
+ * are handled actually, then number of handled IOs can be calculated with
+ * `elem_bytes` for each IO. IOs in the remained bytes are not committed,
+ * userspace has to check return value for dealing with partial committing
+ * correctly.
+ */
 #define	UBLK_U_IO_COMMIT_IO_CMDS	\
 	_IOWR('u', 0x26, struct ublk_batch_io)
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 12/27] ublk: add io events fifo structure
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (10 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-30 16:53   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure Ming Lei
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add ublk io events fifo structure and prepare for supporting command
batch, which will use io_uring multishot uring_cmd for fetching one
batch of io commands each time.

One nice feature of kfifo is to allow multiple producer vs single
consumer. We just need lock the producer side, meantime the single
consumer can be lockless.

The producer is actually from ublk_queue_rq() or ublk_queue_rqs(), so
lock contention can be eased by setting proper blk-mq nr_queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 65 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ea992366af5b..6ff284243630 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -44,6 +44,7 @@
 #include <linux/task_work.h>
 #include <linux/namei.h>
 #include <linux/kref.h>
+#include <linux/kfifo.h>
 #include <uapi/linux/ublk_cmd.h>
 
 #define UBLK_MINORS		(1U << MINORBITS)
@@ -217,6 +218,22 @@ struct ublk_queue {
 	bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
 	spinlock_t		cancel_lock;
 	struct ublk_device *dev;
+
+	/*
+	 * Inflight ublk request tag is saved in this fifo
+	 *
+	 * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
+	 * so lock is required for storing request tag to fifo
+	 *
+	 * Make sure just one reader for fetching request from task work
+	 * function to ublk server, so no need to grab the lock in reader
+	 * side.
+	 */
+	struct {
+		DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
+		spinlock_t evts_lock;
+	}____cacheline_aligned_in_smp;
+
 	struct ublk_io ios[] __counted_by(q_depth);
 };
 
@@ -282,6 +299,32 @@ static inline void ublk_io_unlock(struct ublk_io *io)
 	spin_unlock(&io->lock);
 }
 
+/* Initialize the queue */
+static inline int ublk_io_evts_init(struct ublk_queue *q, unsigned int size,
+				    int numa_node)
+{
+	spin_lock_init(&q->evts_lock);
+	return kfifo_alloc_node(&q->evts_fifo, size, GFP_KERNEL, numa_node);
+}
+
+/* Check if queue is empty */
+static inline bool ublk_io_evts_empty(const struct ublk_queue *q)
+{
+	return kfifo_is_empty(&q->evts_fifo);
+}
+
+/* Check if queue is full */
+static inline bool ublk_io_evts_full(const struct ublk_queue *q)
+{
+	return kfifo_is_full(&q->evts_fifo);
+}
+
+static inline void ublk_io_evts_deinit(struct ublk_queue *q)
+{
+	WARN_ON_ONCE(!kfifo_is_empty(&q->evts_fifo));
+	kfifo_free(&q->evts_fifo);
+}
+
 static inline struct ublksrv_io_desc *
 ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
 {
@@ -3038,6 +3081,9 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
 	if (ubq->io_cmd_buf)
 		free_pages((unsigned long)ubq->io_cmd_buf, get_order(size));
 
+	if (ublk_dev_support_batch_io(ub))
+		ublk_io_evts_deinit(ubq);
+
 	kvfree(ubq);
 	ub->queues[q_id] = NULL;
 }
@@ -3062,7 +3108,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
 	struct ublk_queue *ubq;
 	struct page *page;
 	int numa_node;
-	int size, i;
+	int size, i, ret = -ENOMEM;
 
 	/* Determine NUMA node based on queue's CPU affinity */
 	numa_node = ublk_get_queue_numa_node(ub, q_id);
@@ -3081,18 +3127,27 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
 
 	/* Allocate I/O command buffer on local NUMA node */
 	page = alloc_pages_node(numa_node, gfp_flags, get_order(size));
-	if (!page) {
-		kvfree(ubq);
-		return -ENOMEM;
-	}
+	if (!page)
+		goto fail_nomem;
 	ubq->io_cmd_buf = page_address(page);
 
 	for (i = 0; i < ubq->q_depth; i++)
 		spin_lock_init(&ubq->ios[i].lock);
 
+	if (ublk_dev_support_batch_io(ub)) {
+		ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node);
+		if (ret)
+			goto fail;
+	}
 	ub->queues[q_id] = ubq;
 	ubq->dev = ub;
+
 	return 0;
+fail:
+	ublk_deinit_queue(ub, q_id);
+fail_nomem:
+	kvfree(ubq);
+	return ret;
 }
 
 static void ublk_deinit_queues(struct ublk_device *ub)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (11 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 12/27] ublk: add io events fifo structure Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-30 19:24   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
                   ` (15 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add infrastructure for delivering I/O commands to ublk server in batches,
preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.

Key components:

- struct ublk_batch_fcmd: Represents a batch fetch uring_cmd that will
  receive multiple I/O tags in a single operation, using io_uring's
  multishot command for efficient ublk IO delivery.

- ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
  * Pulls multiple request tags from the events FIFO (lock-free reader)
  * Prepares each I/O for delivery (including auto buffer registration)
  * Delivers tags to userspace via single uring_cmd notification
  * Handles partial failures by restoring undelivered tags to FIFO

The batch approach significantly reduces notification overhead by aggregating
multiple I/O completions into single uring_cmd, while maintaining the same
I/O processing semantics as individual operations.

Error handling ensures system consistency: if buffer selection or CQE
posting fails, undelivered tags are restored to the FIFO for retry,
meantime IO state has to be restored.

This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
or called directly from ->uring_cmd(), enabling efficient batch processing
without blocking the I/O submission path.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 189 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 6ff284243630..cc9c92d97349 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -91,6 +91,12 @@
 	 UBLK_BATCH_F_HAS_BUF_ADDR | \
 	 UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
 
+/* ublk batch fetch uring_cmd */
+struct ublk_batch_fcmd {
+	struct io_uring_cmd *cmd;
+	unsigned short buf_group;
+};
+
 struct ublk_uring_cmd_pdu {
 	/*
 	 * Store requests in same batch temporarily for queuing them to
@@ -168,6 +174,9 @@ struct ublk_batch_io_data {
  */
 #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
 
+/* used for UBLK_F_BATCH_IO only */
+#define UBLK_BATCH_IO_UNUSED_TAG	((unsigned short)-1)
+
 union ublk_io_buf {
 	__u64	addr;
 	struct ublk_auto_buf_reg auto_reg;
@@ -616,6 +625,32 @@ static wait_queue_head_t ublk_idr_wq;	/* wait until one idr is freed */
 static DEFINE_MUTEX(ublk_ctl_mutex);
 
 
+static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
+					struct ublk_batch_fcmd *fcmd,
+					int res)
+{
+	io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
+	fcmd->cmd = NULL;
+}
+
+static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
+				     struct io_br_sel *sel,
+				     unsigned int issue_flags)
+{
+	if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags))
+		return -ENOBUFS;
+	return 0;
+}
+
+static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fcmd *fcmd,
+				       void __user *buf, const u16 *tag_buf,
+				       unsigned int len)
+{
+	if (copy_to_user(buf, tag_buf, len))
+		return -EFAULT;
+	return len;
+}
+
 #define UBLK_MAX_UBLKS UBLK_MINORS
 
 /*
@@ -1378,6 +1413,160 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
 	}
 }
 
+static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq,
+				       const struct ublk_batch_io_data *data,
+				       unsigned short tag)
+{
+	struct ublk_device *ub = data->ub;
+	struct ublk_io *io = &ubq->ios[tag];
+	struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK;
+	struct io_uring_cmd *cmd = data->cmd;
+
+	if (!ublk_start_io(ubq, req, io))
+		return false;
+
+	if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
+		res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
+				data->issue_flags);
+
+	if (res == AUTO_BUF_REG_FAIL)
+		return false;
+
+	ublk_io_lock(io);
+	ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
+	ublk_io_unlock(io);
+
+	return true;
+}
+
+static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq,
+				     const struct ublk_batch_io_data *data,
+				     unsigned short *tag_buf,
+				     unsigned int len)
+{
+	bool has_unused = false;
+	int i;
+
+	for (i = 0; i < len; i += 1) {
+		unsigned short tag = tag_buf[i];
+
+		if (!__ublk_batch_prep_dispatch(ubq, data, tag)) {
+			tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG;
+			has_unused = true;
+		}
+	}
+
+	return has_unused;
+}
+
+/*
+ * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf.
+ * Returns the new length after filtering.
+ */
+static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf,
+					    unsigned int len)
+{
+	unsigned int i, j;
+
+	for (i = 0, j = 0; i < len; i++) {
+		if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) {
+			if (i != j)
+				tag_buf[j] = tag_buf[i];
+			j++;
+		}
+	}
+
+	return j;
+}
+
+#define MAX_NR_TAG 128
+static int __ublk_batch_dispatch(struct ublk_queue *ubq,
+				 const struct ublk_batch_io_data *data,
+				 struct ublk_batch_fcmd *fcmd)
+{
+	unsigned short tag_buf[MAX_NR_TAG];
+	struct io_br_sel sel;
+	size_t len = 0;
+	bool needs_filter;
+	int ret;
+
+	sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
+					 data->issue_flags);
+	if (sel.val < 0)
+		return sel.val;
+	if (!sel.addr)
+		return -ENOBUFS;
+
+	/* single reader needn't lock and sizeof(kfifo element) is 2 bytes */
+	len = min(len, sizeof(tag_buf)) / 2;
+	len = kfifo_out(&ubq->evts_fifo, tag_buf, len);
+
+	needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len);
+	/* Filter out unused tags before posting to userspace */
+	if (unlikely(needs_filter)) {
+		int new_len = ublk_filter_unused_tags(tag_buf, len);
+
+		if (!new_len)
+			return len;
+		len = new_len;
+	}
+
+	sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * 2);
+	ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags);
+	if (unlikely(ret < 0)) {
+		int i, res;
+
+		/*
+		 * Undo prep state for all IOs since userspace never received them.
+		 * This restores IOs to pre-prepared state so they can be cleanly
+		 * re-prepared when tags are pulled from FIFO again.
+		 */
+		for (i = 0; i < len; i++) {
+			struct ublk_io *io = &ubq->ios[tag_buf[i]];
+			int index = -1;
+
+			ublk_io_lock(io);
+			if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
+				index = io->buf.auto_reg.index;
+			io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG);
+			io->flags |= UBLK_IO_FLAG_ACTIVE;
+			ublk_io_unlock(io);
+
+			if (index != -1)
+				io_buffer_unregister_bvec(data->cmd, index,
+						data->issue_flags);
+		}
+
+		res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo,
+			tag_buf, len, &ubq->evts_lock);
+
+		pr_warn("%s: copy tags or post CQE failure, move back "
+				"tags(%d %zu) ret %d\n", __func__, res, len,
+				ret);
+	}
+	return ret;
+}
+
+static __maybe_unused int
+ublk_batch_dispatch(struct ublk_queue *ubq,
+		    const struct ublk_batch_io_data *data,
+		    struct ublk_batch_fcmd *fcmd)
+{
+	int ret = 0;
+
+	while (!ublk_io_evts_empty(ubq)) {
+		ret = __ublk_batch_dispatch(ubq, data, fcmd);
+		if (ret <= 0)
+			break;
+	}
+
+	if (ret < 0)
+		ublk_batch_deinit_fetch_buf(data, fcmd, ret);
+
+	return ret;
+}
+
 static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
 			   unsigned int issue_flags)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (12 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-12-01  5:55   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
of I/O requests. This multishot uring_cmd allows the ublk server to fetch
multiple I/O commands in a single operation, significantly reducing
submission overhead compared to individual FETCH_REQ* commands.

Key Design Features:

1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
   commands, with the batch size limited by the provided buffer length.

2. Dynamic Load Balancing: Multiple fetch commands can be submitted
   simultaneously, but only one is active at any time. This enables
   efficient load distribution across multiple server task contexts.

3. Implicit State Management: The implementation uses three key variables
   to track state:
   - evts_fifo: Queue of request tags awaiting processing
   - fcmd_head: List of available fetch commands
   - active_fcmd: Currently active fetch command (NULL = none active)

   States are derived implicitly:
   - IDLE: No fetch commands available
   - READY: Fetch commands available, none active
   - ACTIVE: One fetch command processing events

4. Lockless Reader Optimization: The active fetch command can read from
   evts_fifo without locking (single reader guarantee), while writers
   (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
   barrier pairing plays key role for the single lockless reader
   optimization.

Implementation Details:

- ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
- __ublk_pick_active_fcmd() selects an available fetch command when
  events arrive and no command is currently active
- ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
  buffer and posts completion via io_uring_mshot_cmd_post_cqe()
- State transitions are coordinated via evts_lock to maintain consistency

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
 include/uapi/linux/ublk_cmd.h |   7 +
 2 files changed, 388 insertions(+), 31 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index cc9c92d97349..2e5e392c939e 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -93,6 +93,7 @@
 
 /* ublk batch fetch uring_cmd */
 struct ublk_batch_fcmd {
+	struct list_head node;
 	struct io_uring_cmd *cmd;
 	unsigned short buf_group;
 };
@@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
 	 */
 	struct ublk_queue *ubq;
 
-	u16 tag;
+	union {
+		u16 tag;
+		struct ublk_batch_fcmd *fcmd; /* batch io only */
+	};
 };
 
 struct ublk_batch_io_data {
@@ -229,18 +233,36 @@ struct ublk_queue {
 	struct ublk_device *dev;
 
 	/*
-	 * Inflight ublk request tag is saved in this fifo
+	 * Batch I/O State Management:
+	 *
+	 * The batch I/O system uses implicit state management based on the
+	 * combination of three key variables below.
+	 *
+	 * - IDLE: list_empty(&fcmd_head) && !active_fcmd
+	 *   No fetch commands available, events queue in evts_fifo
+	 *
+	 * - READY: !list_empty(&fcmd_head) && !active_fcmd
+	 *   Fetch commands available but none processing events
 	 *
-	 * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
-	 * so lock is required for storing request tag to fifo
+	 * - ACTIVE: active_fcmd
+	 *   One fetch command actively processing events from evts_fifo
 	 *
-	 * Make sure just one reader for fetching request from task work
-	 * function to ublk server, so no need to grab the lock in reader
-	 * side.
+	 * Key Invariants:
+	 * - At most one active_fcmd at any time (single reader)
+	 * - active_fcmd is always from fcmd_head list when non-NULL
+	 * - evts_fifo can be read locklessly by the single active reader
+	 * - All state transitions require evts_lock protection
+	 * - Multiple writers to evts_fifo require lock protection
 	 */
 	struct {
 		DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
 		spinlock_t evts_lock;
+
+		/* List of fetch commands available to process events */
+		struct list_head fcmd_head;
+
+		/* Currently active fetch command (NULL = none active) */
+		struct ublk_batch_fcmd  *active_fcmd;
 	}____cacheline_aligned_in_smp;
 
 	struct ublk_io ios[] __counted_by(q_depth);
@@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 		u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
 static inline unsigned int ublk_req_build_flags(struct request *req);
+static void ublk_batch_dispatch(struct ublk_queue *ubq,
+				struct ublk_batch_io_data *data,
+				struct ublk_batch_fcmd *fcmd);
 
 static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
 {
 	return false;
 }
 
+static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
+{
+	return false;
+}
+
 static inline void ublk_io_lock(struct ublk_io *io)
 {
 	spin_lock(&io->lock);
@@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;	/* wait until one idr is freed */
 
 static DEFINE_MUTEX(ublk_ctl_mutex);
 
+static struct ublk_batch_fcmd *
+ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
+{
+	struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
+
+	if (fcmd) {
+		fcmd->cmd = cmd;
+		fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
+	}
+	return fcmd;
+}
+
+static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
+{
+	kfree(fcmd);
+}
+
+static void __ublk_release_fcmd(struct ublk_queue *ubq)
+{
+	WRITE_ONCE(ubq->active_fcmd, NULL);
+}
 
-static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
+/*
+ * Nothing can move on, so clear ->active_fcmd, and the caller should stop
+ * dispatching
+ */
+static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
+					const struct ublk_batch_io_data *data,
 					struct ublk_batch_fcmd *fcmd,
 					int res)
 {
+	spin_lock(&ubq->evts_lock);
+	list_del(&fcmd->node);
+	WARN_ON_ONCE(fcmd != ubq->active_fcmd);
+	__ublk_release_fcmd(ubq);
+	spin_unlock(&ubq->evts_lock);
+
 	io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
-	fcmd->cmd = NULL;
+	ublk_batch_free_fcmd(fcmd);
 }
 
 static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
@@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
 	bool needs_filter;
 	int ret;
 
+	WARN_ON_ONCE(data->cmd != fcmd->cmd);
+
 	sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
 					 data->issue_flags);
 	if (sel.val < 0)
@@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
 	return ret;
 }
 
-static __maybe_unused int
-ublk_batch_dispatch(struct ublk_queue *ubq,
-		    const struct ublk_batch_io_data *data,
-		    struct ublk_batch_fcmd *fcmd)
+static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
+		struct ublk_queue *ubq)
+{
+	struct ublk_batch_fcmd *fcmd;
+
+	lockdep_assert_held(&ubq->evts_lock);
+
+	/*
+	 * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
+	 *
+	 * The pair is the smp_mb() in ublk_batch_dispatch().
+	 *
+	 * If ubq->active_fcmd is observed as non-NULL, the new added tags
+	 * can be visisible in ublk_batch_dispatch() with the barrier pairing.
+	 */
+	smp_mb();
+	if (READ_ONCE(ubq->active_fcmd)) {
+		fcmd = NULL;
+	} else {
+		fcmd = list_first_entry_or_null(&ubq->fcmd_head,
+				struct ublk_batch_fcmd, node);
+		WRITE_ONCE(ubq->active_fcmd, fcmd);
+	}
+	return fcmd;
+}
+
+static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
+			   unsigned int issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+	struct ublk_batch_fcmd *fcmd = pdu->fcmd;
+	struct ublk_batch_io_data data = {
+		.ub = pdu->ubq->dev,
+		.cmd = fcmd->cmd,
+		.issue_flags = issue_flags,
+	};
+
+	WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
+
+	ublk_batch_dispatch(pdu->ubq, &data, fcmd);
+}
+
+static void ublk_batch_dispatch(struct ublk_queue *ubq,
+				struct ublk_batch_io_data *data,
+				struct ublk_batch_fcmd *fcmd)
 {
+	struct ublk_batch_fcmd *new_fcmd;
+	void *handle;
+	bool empty;
 	int ret = 0;
 
+again:
 	while (!ublk_io_evts_empty(ubq)) {
 		ret = __ublk_batch_dispatch(ubq, data, fcmd);
 		if (ret <= 0)
 			break;
 	}
 
-	if (ret < 0)
-		ublk_batch_deinit_fetch_buf(data, fcmd, ret);
+	if (ret < 0) {
+		ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
+		return;
+	}
 
-	return ret;
+	handle = io_uring_cmd_ctx_handle(fcmd->cmd);
+	__ublk_release_fcmd(ubq);
+	/*
+	 * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
+	 * checking ubq->evts_fifo.
+	 *
+	 * The pair is the smp_mb() in __ublk_acquire_fcmd().
+	 */
+	smp_mb();
+	empty = ublk_io_evts_empty(ubq);
+	if (likely(empty))
+		return;
+
+	spin_lock(&ubq->evts_lock);
+	new_fcmd = __ublk_acquire_fcmd(ubq);
+	spin_unlock(&ubq->evts_lock);
+
+	if (!new_fcmd)
+		return;
+	if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
+		data->cmd = new_fcmd->cmd;
+		fcmd = new_fcmd;
+		goto again;
+	}
+	io_uring_cmd_complete_in_task(new_fcmd->cmd, ublk_batch_tw_cb);
 }
 
 static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
@@ -1576,13 +1711,27 @@ static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
 	ublk_dispatch_req(ubq, pdu->req, issue_flags);
 }
 
-static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
+static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq, bool last)
 {
-	struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
-	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+	if (ublk_support_batch_io(ubq)) {
+		unsigned short tag = rq->tag;
+		struct ublk_batch_fcmd *fcmd = NULL;
 
-	pdu->req = rq;
-	io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
+		spin_lock(&ubq->evts_lock);
+		kfifo_put(&ubq->evts_fifo, tag);
+		if (last)
+			fcmd = __ublk_acquire_fcmd(ubq);
+		spin_unlock(&ubq->evts_lock);
+
+		if (fcmd)
+			io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
+	} else {
+		struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
+		struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+
+		pdu->req = rq;
+		io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
+	}
 }
 
 static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
@@ -1600,14 +1749,44 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
 	} while (rq);
 }
 
-static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
+static void ublk_batch_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
 {
-	struct io_uring_cmd *cmd = io->cmd;
-	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+	unsigned short tags[MAX_NR_TAG];
+	struct ublk_batch_fcmd *fcmd;
+	struct request *rq;
+	unsigned cnt = 0;
+
+	spin_lock(&ubq->evts_lock);
+	rq_list_for_each(l, rq) {
+		tags[cnt++] = (unsigned short)rq->tag;
+		if (cnt >= MAX_NR_TAG) {
+			kfifo_in(&ubq->evts_fifo, tags, cnt);
+			cnt = 0;
+		}
+	}
+	if (cnt)
+		kfifo_in(&ubq->evts_fifo, tags, cnt);
+	fcmd = __ublk_acquire_fcmd(ubq);
+	spin_unlock(&ubq->evts_lock);
 
-	pdu->req_list = rq_list_peek(l);
 	rq_list_init(l);
-	io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
+	if (fcmd)
+		io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
+}
+
+static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct ublk_io *io,
+				struct rq_list *l, bool batch)
+{
+	if (batch) {
+		ublk_batch_queue_cmd_list(ubq, l);
+	} else {
+		struct io_uring_cmd *cmd = io->cmd;
+		struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+
+		pdu->req_list = rq_list_peek(l);
+		rq_list_init(l);
+		io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
+	}
 }
 
 static enum blk_eh_timer_return ublk_timeout(struct request *rq)
@@ -1686,7 +1865,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return BLK_STS_OK;
 	}
 
-	ublk_queue_cmd(ubq, rq);
+	ublk_queue_cmd(ubq, rq, bd->last);
 	return BLK_STS_OK;
 }
 
@@ -1698,11 +1877,25 @@ static inline bool ublk_belong_to_same_batch(const struct ublk_io *io,
 		(io->task == io2->task);
 }
 
-static void ublk_queue_rqs(struct rq_list *rqlist)
+static void ublk_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct ublk_queue *ubq = hctx->driver_data;
+	struct ublk_batch_fcmd *fcmd;
+
+	spin_lock(&ubq->evts_lock);
+	fcmd = __ublk_acquire_fcmd(ubq);
+	spin_unlock(&ubq->evts_lock);
+
+	if (fcmd)
+		io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
+}
+
+static void __ublk_queue_rqs(struct rq_list *rqlist, bool batch)
 {
 	struct rq_list requeue_list = { };
 	struct rq_list submit_list = { };
 	struct ublk_io *io = NULL;
+	struct ublk_queue *ubq = NULL;
 	struct request *req;
 
 	while ((req = rq_list_pop(rqlist))) {
@@ -1716,16 +1909,27 @@ static void ublk_queue_rqs(struct rq_list *rqlist)
 
 		if (io && !ublk_belong_to_same_batch(io, this_io) &&
 				!rq_list_empty(&submit_list))
-			ublk_queue_cmd_list(io, &submit_list);
+			ublk_queue_cmd_list(ubq, io, &submit_list, batch);
 		io = this_io;
+		ubq = this_q;
 		rq_list_add_tail(&submit_list, req);
 	}
 
 	if (!rq_list_empty(&submit_list))
-		ublk_queue_cmd_list(io, &submit_list);
+		ublk_queue_cmd_list(ubq, io, &submit_list, batch);
 	*rqlist = requeue_list;
 }
 
+static void ublk_queue_rqs(struct rq_list *rqlist)
+{
+	__ublk_queue_rqs(rqlist, false);
+}
+
+static void ublk_batch_queue_rqs(struct rq_list *rqlist)
+{
+	__ublk_queue_rqs(rqlist, true);
+}
+
 static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
 		unsigned int hctx_idx)
 {
@@ -1743,6 +1947,14 @@ static const struct blk_mq_ops ublk_mq_ops = {
 	.timeout	= ublk_timeout,
 };
 
+static const struct blk_mq_ops ublk_batch_mq_ops = {
+	.commit_rqs	= ublk_commit_rqs,
+	.queue_rq       = ublk_queue_rq,
+	.queue_rqs      = ublk_batch_queue_rqs,
+	.init_hctx	= ublk_init_hctx,
+	.timeout	= ublk_timeout,
+};
+
 static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
 {
 	int i;
@@ -2120,6 +2332,56 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag,
 		io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, issue_flags);
 }
 
+static void ublk_batch_cancel_cmd(struct ublk_queue *ubq,
+				  struct ublk_batch_fcmd *fcmd,
+				  unsigned int issue_flags)
+{
+	bool done;
+
+	spin_lock(&ubq->evts_lock);
+	done = (ubq->active_fcmd != fcmd);
+	if (done)
+		list_del(&fcmd->node);
+	spin_unlock(&ubq->evts_lock);
+
+	if (done) {
+		io_uring_cmd_done(fcmd->cmd, UBLK_IO_RES_ABORT, issue_flags);
+		ublk_batch_free_fcmd(fcmd);
+	}
+}
+
+static void ublk_batch_cancel_queue(struct ublk_queue *ubq)
+{
+	LIST_HEAD(fcmd_list);
+
+	spin_lock(&ubq->evts_lock);
+	ubq->force_abort = true;
+	list_splice_init(&ubq->fcmd_head, &fcmd_list);
+	if (ubq->active_fcmd)
+		list_move(&ubq->active_fcmd->node, &ubq->fcmd_head);
+	spin_unlock(&ubq->evts_lock);
+
+	while (!list_empty(&fcmd_list)) {
+		struct ublk_batch_fcmd *fcmd = list_first_entry(&fcmd_list,
+				struct ublk_batch_fcmd, node);
+
+		ublk_batch_cancel_cmd(ubq, fcmd, IO_URING_F_UNLOCKED);
+	}
+}
+
+static void ublk_batch_cancel_fn(struct io_uring_cmd *cmd,
+				 unsigned int issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+	struct ublk_batch_fcmd *fcmd = pdu->fcmd;
+	struct ublk_queue *ubq = pdu->ubq;
+
+	if (!ubq->canceling)
+		ublk_start_cancel(ubq->dev);
+
+	ublk_batch_cancel_cmd(ubq, fcmd, issue_flags);
+}
+
 /*
  * The ublk char device won't be closed when calling cancel fn, so both
  * ublk device and queue are guaranteed to be live
@@ -2171,6 +2433,11 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
 {
 	int i;
 
+	if (ublk_support_batch_io(ubq)) {
+		ublk_batch_cancel_queue(ubq);
+		return;
+	}
+
 	for (i = 0; i < ubq->q_depth; i++)
 		ublk_cancel_cmd(ubq, i, IO_URING_F_UNLOCKED);
 }
@@ -3091,6 +3358,74 @@ static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
 	return ublk_check_batch_cmd_flags(uc);
 }
 
+static int ublk_batch_attach(struct ublk_queue *ubq,
+			     struct ublk_batch_io_data *data,
+			     struct ublk_batch_fcmd *fcmd)
+{
+	struct ublk_batch_fcmd *new_fcmd = NULL;
+	bool free = false;
+
+	spin_lock(&ubq->evts_lock);
+	if (unlikely(ubq->force_abort || ubq->canceling)) {
+		free = true;
+	} else {
+		list_add_tail(&fcmd->node, &ubq->fcmd_head);
+		new_fcmd = __ublk_acquire_fcmd(ubq);
+	}
+	spin_unlock(&ubq->evts_lock);
+
+	/*
+	 * If the two fetch commands are originated from same io_ring_ctx,
+	 * run batch dispatch directly. Otherwise, schedule task work for
+	 * doing it.
+	 */
+	if (new_fcmd && io_uring_cmd_ctx_handle(new_fcmd->cmd) ==
+			io_uring_cmd_ctx_handle(fcmd->cmd)) {
+		data->cmd = new_fcmd->cmd;
+		ublk_batch_dispatch(ubq, data, new_fcmd);
+	} else if (new_fcmd) {
+		io_uring_cmd_complete_in_task(new_fcmd->cmd,
+				ublk_batch_tw_cb);
+	}
+
+	if (free) {
+		ublk_batch_free_fcmd(fcmd);
+		return -ENODEV;
+	}
+	return -EIOCBQUEUED;
+}
+
+static int ublk_handle_batch_fetch_cmd(struct ublk_batch_io_data *data)
+{
+	struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
+	struct ublk_batch_fcmd *fcmd = ublk_batch_alloc_fcmd(data->cmd);
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(data->cmd);
+
+	if (!fcmd)
+		return -ENOMEM;
+
+	pdu->ubq = ubq;
+	pdu->fcmd = fcmd;
+	io_uring_cmd_mark_cancelable(data->cmd, data->issue_flags);
+
+	return ublk_batch_attach(ubq, data, fcmd);
+}
+
+static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
+					 const struct ublk_batch_io *uc)
+{
+	if (!(data->cmd->flags & IORING_URING_CMD_MULTISHOT))
+		return -EINVAL;
+
+	if (uc->elem_bytes != sizeof(__u16))
+		return -EINVAL;
+
+	if (uc->flags != 0)
+		return -E2BIG;
+
+	return 0;
+}
+
 static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 				       unsigned int issue_flags)
 {
@@ -3113,6 +3448,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 	if (data.header.q_id >= ub->dev_info.nr_hw_queues)
 		goto out;
 
+	if (unlikely(issue_flags & IO_URING_F_CANCEL)) {
+		ublk_batch_cancel_fn(cmd, issue_flags);
+		return 0;
+	}
+
 	switch (cmd_op) {
 	case UBLK_U_IO_PREP_IO_CMDS:
 		ret = ublk_check_batch_cmd(&data);
@@ -3126,6 +3466,12 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 			goto out;
 		ret = ublk_handle_batch_commit_cmd(&data);
 		break;
+	case UBLK_U_IO_FETCH_IO_CMDS:
+		ret = ublk_validate_batch_fetch_cmd(&data, uc);
+		if (ret)
+			goto out;
+		ret = ublk_handle_batch_fetch_cmd(&data);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 	}
@@ -3327,6 +3673,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
 		ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node);
 		if (ret)
 			goto fail;
+		INIT_LIST_HEAD(&ubq->fcmd_head);
 	}
 	ub->queues[q_id] = ubq;
 	ubq->dev = ub;
@@ -3451,7 +3798,10 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
 
 static int ublk_add_tag_set(struct ublk_device *ub)
 {
-	ub->tag_set.ops = &ublk_mq_ops;
+	if (ublk_dev_support_batch_io(ub))
+		ub->tag_set.ops = &ublk_batch_mq_ops;
+	else
+		ub->tag_set.ops = &ublk_mq_ops;
 	ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
 	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
 	ub->tag_set.numa_node = NUMA_NO_NODE;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 295ec8f34173..cd894c1d188e 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -120,6 +120,13 @@
 #define	UBLK_U_IO_COMMIT_IO_CMDS	\
 	_IOWR('u', 0x26, struct ublk_batch_io)
 
+/*
+ * Fetch io commands to provided buffer in multishot style,
+ * `IORING_URING_CMD_MULTISHOT` is required for this command.
+ */
+#define	UBLK_U_IO_FETCH_IO_CMDS 	\
+	_IOWR('u', 0x27, struct ublk_batch_io)
+
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
 #define UBLK_IO_RES_NEED_GET_DATA	1
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 15/27] ublk: abort requests filled in event kfifo
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (13 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-12-01 18:52   ` Caleb Sander Mateos
  2025-12-01 19:00   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
                   ` (13 subsequent siblings)
  28 siblings, 2 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

In case of BATCH_IO, any request filled in event kfifo, they don't get
chance to be dispatched any more when releasing ublk char device, so
we have to abort them too.

Add ublk_abort_batch_queue() for aborting this kind of requests.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2e5e392c939e..849199771f86 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -2241,7 +2241,8 @@ static int ublk_ch_mmap(struct file *filp, struct vm_area_struct *vma)
 static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
 		struct request *req)
 {
-	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
+	WARN_ON_ONCE(!ublk_dev_support_batch_io(ub) &&
+			io->flags & UBLK_IO_FLAG_ACTIVE);
 
 	if (ublk_nosrv_should_reissue_outstanding(ub))
 		blk_mq_requeue_request(req, false);
@@ -2251,6 +2252,26 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
 	}
 }
 
+/*
+ * Request tag may just be filled to event kfifo, not get chance to
+ * dispatch, abort these requests too
+ */
+static void ublk_abort_batch_queue(struct ublk_device *ub,
+				   struct ublk_queue *ubq)
+{
+	while (true) {
+		struct request *req;
+		short tag;
+
+		if (!kfifo_out(&ubq->evts_fifo, &tag, 1))
+			break;
+
+		req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+		if (req && blk_mq_request_started(req))
+			__ublk_fail_req(ub, &ubq->ios[tag], req);
+	}
+}
+
 /*
  * Called from ublk char device release handler, when any uring_cmd is
  * done, meantime request queue is "quiesced" since all inflight requests
@@ -2269,6 +2290,9 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
 		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
 			__ublk_fail_req(ub, io, io->req);
 	}
+
+	if (ublk_support_batch_io(ubq))
+		ublk_abort_batch_queue(ub, ubq);
 }
 
 static void ublk_start_cancel(struct ublk_device *ub)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (14 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-12-01 21:16   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 17/27] ublk: document " Ming Lei
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add new feature UBLK_F_BATCH_IO which replaces the following two
per-io commands:

	- UBLK_U_IO_FETCH_REQ

	- UBLK_U_IO_COMMIT_AND_FETCH_REQ

with three per-queue batch io uring_cmd:

	- UBLK_U_IO_PREP_IO_CMDS

	- UBLK_U_IO_COMMIT_IO_CMDS

	- UBLK_U_IO_FETCH_IO_CMDS

Then ublk can deliver batch io commands to ublk server in single
multishort uring_cmd, also allows to prepare & commit multiple
commands in batch style via single uring_cmd, communication cost is
reduced a lot.

This feature also doesn't limit task context any more for all supported
commands, so any allowed uring_cmd can be issued in any task context.
ublk server implementation becomes much easier.

Meantime load balance becomes much easier to support with this feature.
The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
contexts, so each task can adjust this command's buffer length or number
of inflight commands for controlling how much load is handled by current
task.

Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
for improving load balance support.

UBLK_U_IO_GET_DATA isn't supported in batch io yet, but it may be
enabled in future via its batch pair.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 58 ++++++++++++++++++++++++++++++++---
 include/uapi/linux/ublk_cmd.h | 16 ++++++++++
 2 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 849199771f86..90cd1863bc83 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -74,7 +74,8 @@
 		| UBLK_F_AUTO_BUF_REG \
 		| UBLK_F_QUIESCE \
 		| UBLK_F_PER_IO_DAEMON \
-		| UBLK_F_BUF_REG_OFF_DAEMON)
+		| UBLK_F_BUF_REG_OFF_DAEMON \
+		| UBLK_F_BATCH_IO)
 
 #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
 		| UBLK_F_USER_RECOVERY_REISSUE \
@@ -320,12 +321,12 @@ static void ublk_batch_dispatch(struct ublk_queue *ubq,
 
 static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
 {
-	return false;
+	return ub->dev_info.flags & UBLK_F_BATCH_IO;
 }
 
 static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
 {
-	return false;
+	return ubq->flags & UBLK_F_BATCH_IO;
 }
 
 static inline void ublk_io_lock(struct ublk_io *io)
@@ -3450,6 +3451,41 @@ static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
 	return 0;
 }
 
+static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd,
+				     unsigned int issue_flags)
+{
+	const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe);
+	struct ublk_device *ub = cmd->file->private_data;
+	unsigned tag = READ_ONCE(ub_cmd->tag);
+	unsigned q_id = READ_ONCE(ub_cmd->q_id);
+	unsigned index = READ_ONCE(ub_cmd->addr);
+	struct ublk_queue *ubq;
+	struct ublk_io *io;
+	int ret = -EINVAL;
+
+	if (!ub)
+		return ret;
+
+	if (q_id >= ub->dev_info.nr_hw_queues)
+		return ret;
+
+	ubq = ublk_get_queue(ub, q_id);
+	if (tag >= ubq->q_depth)
+		return ret;
+
+	io = &ubq->ios[tag];
+
+	switch (cmd->cmd_op) {
+	case UBLK_U_IO_REGISTER_IO_BUF:
+		return ublk_register_io_buf(cmd, ub, q_id, tag, io, index,
+				issue_flags);
+	case UBLK_U_IO_UNREGISTER_IO_BUF:
+		return ublk_unregister_io_buf(cmd, ub, index, issue_flags);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 				       unsigned int issue_flags)
 {
@@ -3497,7 +3533,8 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
 		ret = ublk_handle_batch_fetch_cmd(&data);
 		break;
 	default:
-		ret = -EOPNOTSUPP;
+		ret = ublk_handle_non_batch_cmd(cmd, issue_flags);
+		break;
 	}
 out:
 	return ret;
@@ -4163,9 +4200,13 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 
 	ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
 		UBLK_F_URING_CMD_COMP_IN_TASK |
-		UBLK_F_PER_IO_DAEMON |
+		(ublk_dev_support_batch_io(ub) ? 0 : UBLK_F_PER_IO_DAEMON) |
 		UBLK_F_BUF_REG_OFF_DAEMON;
 
+	/* So far, UBLK_F_PER_IO_DAEMON won't be exposed for BATCH_IO */
+	if (ublk_dev_support_batch_io(ub))
+		ub->dev_info.flags &= ~UBLK_F_PER_IO_DAEMON;
+
 	/* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
 	if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
 				UBLK_F_AUTO_BUF_REG))
@@ -4518,6 +4559,13 @@ static int ublk_wait_for_idle_io(struct ublk_device *ub,
 	unsigned int elapsed = 0;
 	int ret;
 
+	/*
+	 * For UBLK_F_BATCH_IO ublk server can get notified with existing
+	 * or new fetch command, so needn't wait any more
+	 */
+	if (ublk_dev_support_batch_io(ub))
+		return 0;
+
 	while (elapsed < timeout_ms && !signal_pending(current)) {
 		unsigned int queues_cancelable = 0;
 		int i;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index cd894c1d188e..5e8b1211b7f4 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -335,6 +335,22 @@
  */
 #define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14)
 
+
+/*
+ * Support the following commands for delivering & committing io command
+ * in batch.
+ *
+ * 	- UBLK_U_IO_PREP_IO_CMDS
+ * 	- UBLK_U_IO_COMMIT_IO_CMDS
+ * 	- UBLK_U_IO_FETCH_IO_CMDS
+ * 	- UBLK_U_IO_REGISTER_IO_BUF
+ * 	- UBLK_U_IO_UNREGISTER_IO_BUF
+ *
+ * The existing UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ and
+ * UBLK_U_IO_GET_DATA uring_cmd are not supported for this feature.
+ */
+#define UBLK_F_BATCH_IO		(1ULL << 15)
+
 /* device state */
 #define UBLK_S_DEV_DEAD	0
 #define UBLK_S_DEV_LIVE	1
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 17/27] ublk: document feature UBLK_F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (15 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-12-01 21:46   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch() Ming Lei
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Document feature UBLK_F_BATCH_IO.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 Documentation/block/ublk.rst | 60 +++++++++++++++++++++++++++++++++---
 1 file changed, 56 insertions(+), 4 deletions(-)

diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 8c4030bcabb6..09a5604f8e10 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -260,9 +260,12 @@ The following IO commands are communicated via io_uring passthrough command,
 and each command is only for forwarding the IO and committing the result
 with specified IO tag in the command data:
 
-- ``UBLK_IO_FETCH_REQ``
+Traditional Per-I/O Commands
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-  Sent from the server IO pthread for fetching future incoming IO requests
+- ``UBLK_U_IO_FETCH_REQ``
+
+  Sent from the server I/O pthread for fetching future incoming I/O requests
   destined to ``/dev/ublkb*``. This command is sent only once from the server
   IO pthread for ublk driver to setup IO forward environment.
 
@@ -278,7 +281,7 @@ with specified IO tag in the command data:
   supported by the driver, daemons must be per-queue instead - i.e. all I/Os
   associated to a single qid must be handled by the same task.
 
-- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
+- ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``
 
   When an IO request is destined to ``/dev/ublkb*``, the driver stores
   the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
@@ -293,7 +296,7 @@ with specified IO tag in the command data:
   requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
   is reused for both fetching request and committing back IO result.
 
-- ``UBLK_IO_NEED_GET_DATA``
+- ``UBLK_U_IO_NEED_GET_DATA``
 
   With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
   issued to ublk server without data copy. Then, IO backend of ublk server
@@ -322,6 +325,55 @@ with specified IO tag in the command data:
   ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
   the server buffer (pages) read to the IO request pages.
 
+Batch I/O Commands (UBLK_F_BATCH_IO)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance
+I/O handling model that replaces the traditional per-I/O commands with
+per-queue batch commands. This significantly reduces communication overhead
+and enables better load balancing across multiple server tasks.
+
+Key differences from traditional mode:
+
+- **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os
+- **Batch processing**: Multiple I/Os are handled in single operations
+- **Multishot commands**: Use io_uring multishot for reduced submission overhead
+- **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons)
+- **Better load balancing**: Tasks can adjust their workload dynamically
+
+Batch I/O Commands:
+
+- ``UBLK_U_IO_PREP_IO_CMDS``
+
+  Prepares multiple I/O commands in batch. The server provides a buffer
+  containing multiple I/O descriptors that will be processed together.
+  This reduces the number of individual command submissions required.
+
+- ``UBLK_U_IO_COMMIT_IO_CMDS``
+
+  Commits results for multiple I/O operations in batch. The server provides
+  a buffer containing the results of multiple completed I/Os, allowing
+  efficient bulk completion of requests.
+
+- ``UBLK_U_IO_FETCH_IO_CMDS``
+
+  **Multishot command** for fetching I/O commands in batch. This is the key
+  command that enables high-performance batch processing:
+
+  * Uses io_uring multishot capability for reduced submission overhead
+  * Single command can fetch multiple I/O requests over time
+  * Buffer size determines maximum batch size per operation
+  * Multiple fetch commands can be submitted for load balancing
+  * Only one fetch command is active at any time per queue
+  * Supports dynamic load balancing across multiple server tasks
+
+  Each task can submit ``UBLK_U_IO_FETCH_IO_CMDS`` with different buffer
+  sizes to control how much work it handles. This enables sophisticated
+  load balancing strategies in multi-threaded servers.
+
+Migration: Applications using traditional commands (``UBLK_U_IO_FETCH_REQ``,
+``UBLK_U_IO_COMMIT_AND_FETCH_REQ``) cannot use batch mode simultaneously.
+
 Zero copy
 ---------
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (16 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 17/27] ublk: document " Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-12-01 21:55   ` Caleb Sander Mateos
  2025-11-21  1:58 ` [PATCH V4 19/27] selftests: ublk: fix user_data truncation for tgt_data >= 256 Ming Lei
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Reduce overhead when completing multiple requests in batch I/O mode by
accumulating them in an io_comp_batch structure and completing them
together via blk_mq_end_request_batch(). This minimizes per-request
completion overhead and improves performance for high IOPS workloads.

The implementation adds an io_comp_batch pointer to struct ublk_io and
initializes it in __ublk_fetch(). For batch I/O, the pointer is set to
the batch structure in ublk_batch_commit_io(). The __ublk_complete_rq()
function uses io->iob to call blk_mq_add_to_batch() for batch mode.
After processing all batch I/Os, the completion callback is invoked in
ublk_handle_batch_commit_cmd() to complete all accumulated requests
efficiently.

So far just covers direct completion. For deferred completion(zero copy,
auto buffer reg), ublk_io_release() is often delayed in freeing buffer
consumer io_uring request's code path, so this patch often doesn't work,
also it is hard to pass the per-task 'struct io_comp_batch' for deferred
completion.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 90cd1863bc83..a5606c7111a4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -130,6 +130,7 @@ struct ublk_batch_io_data {
 	struct io_uring_cmd *cmd;
 	struct ublk_batch_io header;
 	unsigned int issue_flags;
+	struct io_comp_batch *iob;
 };
 
 /*
@@ -642,7 +643,12 @@ static blk_status_t ublk_setup_iod_zoned(struct ublk_queue *ubq,
 #endif
 
 static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
-				      bool need_map);
+				      bool need_map, struct io_comp_batch *iob);
+
+static void ublk_complete_batch(struct io_comp_batch *iob)
+{
+	blk_mq_end_request_batch(iob);
+}
 
 static dev_t ublk_chr_devt;
 static const struct class ublk_chr_class = {
@@ -912,7 +918,7 @@ static inline void ublk_put_req_ref(struct ublk_io *io, struct request *req)
 		return;
 
 	/* ublk_need_map_io() and ublk_need_req_ref() are mutually exclusive */
-	__ublk_complete_rq(req, io, false);
+	__ublk_complete_rq(req, io, false, NULL);
 }
 
 static inline bool ublk_sub_req_ref(struct ublk_io *io)
@@ -1251,7 +1257,7 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
 
 /* todo: handle partial completion */
 static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
-				      bool need_map)
+				      bool need_map, struct io_comp_batch *iob)
 {
 	unsigned int unmapped_bytes;
 	blk_status_t res = BLK_STS_OK;
@@ -1288,8 +1294,11 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 
 	if (blk_update_request(req, BLK_STS_OK, io->res))
 		blk_mq_requeue_request(req, true);
-	else if (likely(!blk_should_fake_timeout(req->q)))
+	else if (likely(!blk_should_fake_timeout(req->q))) {
+		if (blk_mq_add_to_batch(req, iob, false, ublk_complete_batch))
+			return;
 		__blk_mq_end_request(req, BLK_STS_OK);
+	}
 
 	return;
 exit:
@@ -2249,7 +2258,7 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
 		blk_mq_requeue_request(req, false);
 	else {
 		io->res = -EIO;
-		__ublk_complete_rq(req, io, ublk_dev_need_map_io(ub));
+		__ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL);
 	}
 }
 
@@ -2986,7 +2995,7 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
 		if (req_op(req) == REQ_OP_ZONE_APPEND)
 			req->__sector = addr;
 		if (compl)
-			__ublk_complete_rq(req, io, ublk_dev_need_map_io(ub));
+			__ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL);
 
 		if (ret)
 			goto out;
@@ -3321,11 +3330,11 @@ static int ublk_batch_commit_io(struct ublk_queue *ubq,
 	if (req_op(req) == REQ_OP_ZONE_APPEND)
 		req->__sector = ublk_batch_zone_lba(uc, elem);
 	if (compl)
-		__ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub));
+		__ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub), data->iob);
 	return 0;
 }
 
-static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
+static int ublk_handle_batch_commit_cmd(struct ublk_batch_io_data *data)
 {
 	const struct ublk_batch_io *uc = &data->header;
 	struct io_uring_cmd *cmd = data->cmd;
@@ -3334,10 +3343,15 @@ static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
 		.total = uc->nr_elem * uc->elem_bytes,
 		.elem_bytes = uc->elem_bytes,
 	};
+	DEFINE_IO_COMP_BATCH(iob);
 	int ret;
 
+	data->iob = &iob;
 	ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_commit_io);
 
+	if (iob.complete)
+		iob.complete(&iob);
+
 	return iter.done == 0 ? ret : iter.done;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 19/27] selftests: ublk: fix user_data truncation for tgt_data >= 256
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (17 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 20/27] selftests: ublk: replace assert() with ublk_assert() Ming Lei
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

The build_user_data() function packs multiple fields into a __u64
value using bit shifts. Without explicit __u64 casts before shifting,
the shift operations are performed on 32-bit unsigned integers before
being promoted to 64-bit, causing data loss.

Specifically, when tgt_data >= 256, the expression (tgt_data << 24)
shifts on a 32-bit value, truncating the upper 8 bits before promotion
to __u64. Since tgt_data can be up to 16 bits (assertion allows up to
65535), values >= 256 would have their high byte lost.

Add explicit __u64 casts to both op and tgt_data before shifting to
ensure the shift operations happen in 64-bit space, preserving all
bits of the input values.

user_data_to_tgt_data() is only used by stripe.c, in which the max
supported member disks are 4, so won't trigger this issue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/kublk.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index fe42705c6d42..38d80e60e211 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -220,7 +220,7 @@ static inline __u64 build_user_data(unsigned tag, unsigned op,
 	_Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7);
 	assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7));
 
-	return tag | (op << 16) | (tgt_data << 24) |
+	return tag | ((__u64)op << 16) | ((__u64)tgt_data << 24) |
 		(__u64)q_id << 56 | (__u64)is_target_io << 63;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 20/27] selftests: ublk: replace assert() with ublk_assert()
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (18 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 19/27] selftests: ublk: fix user_data truncation for tgt_data >= 256 Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 21/27] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Replace assert() with ublk_assert() since it is often triggered in daemon,
and we may get nothing shown in terminal.

Add ublk_assert(), so we can log something to syslog when assert() is
triggered.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/common.c      |  2 +-
 tools/testing/selftests/ublk/file_backed.c |  2 +-
 tools/testing/selftests/ublk/kublk.c       |  2 +-
 tools/testing/selftests/ublk/kublk.h       |  2 +-
 tools/testing/selftests/ublk/stripe.c      | 10 +++++-----
 tools/testing/selftests/ublk/utils.h       | 10 ++++++++++
 6 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/ublk/common.c b/tools/testing/selftests/ublk/common.c
index 01580a6f8519..4c07bc37eb6d 100644
--- a/tools/testing/selftests/ublk/common.c
+++ b/tools/testing/selftests/ublk/common.c
@@ -16,7 +16,7 @@ int backing_file_tgt_init(struct ublk_dev *dev)
 {
 	int fd, i;
 
-	assert(dev->nr_fds == 1);
+	ublk_assert(dev->nr_fds == 1);
 
 	for (i = 0; i < dev->tgt.nr_backing_files; i++) {
 		char *file = dev->tgt.backing_file[i];
diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
index cd9fe69ecce2..9e7dd3859ea9 100644
--- a/tools/testing/selftests/ublk/file_backed.c
+++ b/tools/testing/selftests/ublk/file_backed.c
@@ -10,7 +10,7 @@ static enum io_uring_op ublk_to_uring_op(const struct ublksrv_io_desc *iod, int
 		return zc ? IORING_OP_READ_FIXED : IORING_OP_READ;
 	else if (ublk_op == UBLK_IO_OP_WRITE)
 		return zc ? IORING_OP_WRITE_FIXED : IORING_OP_WRITE;
-	assert(0);
+	ublk_assert(0);
 }
 
 static int loop_queue_flush_io(struct ublk_thread *t, struct ublk_queue *q,
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index f8fa102a627f..bb8da9ff247d 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -750,7 +750,7 @@ static void ublk_handle_uring_cmd(struct ublk_thread *t,
 	}
 
 	if (cqe->res == UBLK_IO_RES_OK) {
-		assert(tag < q->q_depth);
+		ublk_assert(tag < q->q_depth);
 		if (q->tgt_ops->queue_io)
 			q->tgt_ops->queue_io(t, q, tag);
 	} else if (cqe->res == UBLK_IO_RES_NEED_GET_DATA) {
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 38d80e60e211..f5c0978f30c2 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -218,7 +218,7 @@ static inline __u64 build_user_data(unsigned tag, unsigned op,
 {
 	/* we only have 7 bits to encode q_id */
 	_Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7);
-	assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7));
+	ublk_assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7));
 
 	return tag | ((__u64)op << 16) | ((__u64)tgt_data << 24) |
 		(__u64)q_id << 56 | (__u64)is_target_io << 63;
diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c
index 791fa8dc1651..50874858a829 100644
--- a/tools/testing/selftests/ublk/stripe.c
+++ b/tools/testing/selftests/ublk/stripe.c
@@ -96,12 +96,12 @@ static void calculate_stripe_array(const struct stripe_conf *conf,
 			this->seq = seq;
 			s->nr += 1;
 		} else {
-			assert(seq == this->seq);
-			assert(this->start + this->nr_sects == stripe_off);
+			ublk_assert(seq == this->seq);
+			ublk_assert(this->start + this->nr_sects == stripe_off);
 			this->nr_sects += nr_sects;
 		}
 
-		assert(this->nr_vec < this->cap);
+		ublk_assert(this->nr_vec < this->cap);
 		this->vec[this->nr_vec].iov_base = (void *)(base + done);
 		this->vec[this->nr_vec++].iov_len = nr_sects << 9;
 
@@ -120,7 +120,7 @@ static inline enum io_uring_op stripe_to_uring_op(
 		return zc ? IORING_OP_READV_FIXED : IORING_OP_READV;
 	else if (ublk_op == UBLK_IO_OP_WRITE)
 		return zc ? IORING_OP_WRITEV_FIXED : IORING_OP_WRITEV;
-	assert(0);
+	ublk_assert(0);
 }
 
 static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
@@ -318,7 +318,7 @@ static int ublk_stripe_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
 	if (!dev->tgt.nr_backing_files || dev->tgt.nr_backing_files > NR_STRIPE)
 		return -EINVAL;
 
-	assert(dev->nr_fds == dev->tgt.nr_backing_files + 1);
+	ublk_assert(dev->nr_fds == dev->tgt.nr_backing_files + 1);
 
 	for (i = 0; i < dev->tgt.nr_backing_files; i++)
 		dev->tgt.backing_file_size[i] &= ~((1 << chunk_shift) - 1);
diff --git a/tools/testing/selftests/ublk/utils.h b/tools/testing/selftests/ublk/utils.h
index a852e0b7153e..17eefed73690 100644
--- a/tools/testing/selftests/ublk/utils.h
+++ b/tools/testing/selftests/ublk/utils.h
@@ -43,6 +43,7 @@ static inline void ublk_err(const char *fmt, ...)
 
 	va_start(ap, fmt);
 	vfprintf(stderr, fmt, ap);
+	va_end(ap);
 }
 
 static inline void ublk_log(const char *fmt, ...)
@@ -52,6 +53,7 @@ static inline void ublk_log(const char *fmt, ...)
 
 		va_start(ap, fmt);
 		vfprintf(stdout, fmt, ap);
+		va_end(ap);
 	}
 }
 
@@ -62,7 +64,15 @@ static inline void ublk_dbg(int level, const char *fmt, ...)
 
 		va_start(ap, fmt);
 		vfprintf(stdout, fmt, ap);
+		va_end(ap);
 	}
 }
 
+#define ublk_assert(x)  do { \
+	if (!(x)) {     \
+		ublk_err("%s %d: assert!\n", __func__, __LINE__); \
+		assert(x);      \
+	}       \
+} while (0)
+
 #endif
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 21/27] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (19 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 20/27] selftests: ublk: replace assert() with ublk_assert() Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 22/27] selftests: ublk: add batch buffer management infrastructure Ming Lei
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Since UBLK_F_PER_IO_DAEMON is added, io buffer index may depend on current
thread because the common way is to use per-pthread io_ring_ctx for issuing
ublk uring_cmd.

Add one helper for returning io buffer index, so we can hide the buffer
index implementation details for target code.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/file_backed.c |  9 +++++----
 tools/testing/selftests/ublk/kublk.c       |  9 +++++----
 tools/testing/selftests/ublk/kublk.h       | 10 +++++++++-
 tools/testing/selftests/ublk/null.c        | 18 ++++++++++--------
 tools/testing/selftests/ublk/stripe.c      |  7 ++++---
 5 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
index 9e7dd3859ea9..58ac59528b74 100644
--- a/tools/testing/selftests/ublk/file_backed.c
+++ b/tools/testing/selftests/ublk/file_backed.c
@@ -36,6 +36,7 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	enum io_uring_op op = ublk_to_uring_op(iod, zc | auto_zc);
 	struct io_uring_sqe *sqe[3];
 	void *addr = (zc | auto_zc) ? NULL : (void *)iod->addr;
+	unsigned short buf_idx = ublk_io_buf_idx(t, q, tag);
 
 	if (!zc || auto_zc) {
 		ublk_io_alloc_sqes(t, sqe, 1);
@@ -47,7 +48,7 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 				iod->nr_sectors << 9,
 				iod->start_sector << 9);
 		if (auto_zc)
-			sqe[0]->buf_index = tag;
+			sqe[0]->buf_index = buf_idx;
 		io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
 		/* bit63 marks us as tgt io */
 		sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
@@ -56,7 +57,7 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 
 	ublk_io_alloc_sqes(t, sqe, 3);
 
-	io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+	io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_idx);
 	sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
 	sqe[0]->user_data = build_user_data(tag,
 			ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
@@ -64,11 +65,11 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	io_uring_prep_rw(op, sqe[1], ublk_get_registered_fd(q, 1) /*fds[1]*/, 0,
 		iod->nr_sectors << 9,
 		iod->start_sector << 9);
-	sqe[1]->buf_index = tag;
+	sqe[1]->buf_index = buf_idx;
 	sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK;
 	sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
 
-	io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+	io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, buf_idx);
 	sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
 
 	return 2;
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index bb8da9ff247d..1665a7865af4 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -579,16 +579,17 @@ static void ublk_dev_unprep(struct ublk_dev *dev)
 	close(dev->fds[0]);
 }
 
-static void ublk_set_auto_buf_reg(const struct ublk_queue *q,
+static void ublk_set_auto_buf_reg(const struct ublk_thread *t,
+				  const struct ublk_queue *q,
 				  struct io_uring_sqe *sqe,
 				  unsigned short tag)
 {
 	struct ublk_auto_buf_reg buf = {};
 
 	if (q->tgt_ops->buf_index)
-		buf.index = q->tgt_ops->buf_index(q, tag);
+		buf.index = q->tgt_ops->buf_index(t, q, tag);
 	else
-		buf.index = q->ios[tag].buf_index;
+		buf.index = ublk_io_buf_idx(t, q, tag);
 
 	if (ublk_queue_auto_zc_fallback(q))
 		buf.flags = UBLK_AUTO_BUF_REG_FALLBACK;
@@ -655,7 +656,7 @@ int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io)
 		cmd->addr	= 0;
 
 	if (ublk_queue_use_auto_zc(q))
-		ublk_set_auto_buf_reg(q, sqe[0], io->tag);
+		ublk_set_auto_buf_reg(t, q, sqe[0], io->tag);
 
 	user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, q->q_id, 0);
 	io_uring_sqe_set_data64(sqe[0], user_data);
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index f5c0978f30c2..5b951ad9b03d 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -143,7 +143,8 @@ struct ublk_tgt_ops {
 	void (*usage)(const struct ublk_tgt_ops *ops);
 
 	/* return buffer index for UBLK_F_AUTO_BUF_REG */
-	unsigned short (*buf_index)(const struct ublk_queue *, int tag);
+	unsigned short (*buf_index)(const struct ublk_thread *t,
+			const struct ublk_queue *, int tag);
 };
 
 struct ublk_tgt {
@@ -351,6 +352,13 @@ static inline void ublk_set_sqe_cmd_op(struct io_uring_sqe *sqe, __u32 cmd_op)
 	addr[1] = 0;
 }
 
+static inline unsigned short ublk_io_buf_idx(const struct ublk_thread *t,
+					     const struct ublk_queue *q,
+					     unsigned tag)
+{
+	return q->ios[tag].buf_index;
+}
+
 static inline struct ublk_io *ublk_get_io(struct ublk_queue *q, unsigned tag)
 {
 	return &q->ios[tag];
diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
index 280043f6b689..819f72ac2da9 100644
--- a/tools/testing/selftests/ublk/null.c
+++ b/tools/testing/selftests/ublk/null.c
@@ -43,12 +43,12 @@ static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
 }
 
 static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod,
-		struct io_uring_sqe *sqe, int q_id)
+		struct io_uring_sqe *sqe, int q_id, unsigned buf_idx)
 {
 	unsigned ublk_op = ublksrv_get_op(iod);
 
 	io_uring_prep_nop(sqe);
-	sqe->buf_index = tag;
+	sqe->buf_index = buf_idx;
 	sqe->flags |= IOSQE_FIXED_FILE;
 	sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT;
 	sqe->len = iod->nr_sectors << 9; 	/* injected result */
@@ -60,18 +60,19 @@ static int null_queue_zc_io(struct ublk_thread *t, struct ublk_queue *q,
 {
 	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
 	struct io_uring_sqe *sqe[3];
+	unsigned short buf_idx = ublk_io_buf_idx(t, q, tag);
 
 	ublk_io_alloc_sqes(t, sqe, 3);
 
-	io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+	io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_idx);
 	sqe[0]->user_data = build_user_data(tag,
 			ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
 	sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
 
-	__setup_nop_io(tag, iod, sqe[1], q->q_id);
+	__setup_nop_io(tag, iod, sqe[1], q->q_id, buf_idx);
 	sqe[1]->flags |= IOSQE_IO_HARDLINK;
 
-	io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+	io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, buf_idx);
 	sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
 
 	// buf register is marked as IOSQE_CQE_SKIP_SUCCESS
@@ -85,7 +86,7 @@ static int null_queue_auto_zc_io(struct ublk_thread *t, struct ublk_queue *q,
 	struct io_uring_sqe *sqe[1];
 
 	ublk_io_alloc_sqes(t, sqe, 1);
-	__setup_nop_io(tag, iod, sqe[0], q->q_id);
+	__setup_nop_io(tag, iod, sqe[0], q->q_id, ublk_io_buf_idx(t, q, tag));
 	return 1;
 }
 
@@ -136,11 +137,12 @@ static int ublk_null_queue_io(struct ublk_thread *t, struct ublk_queue *q,
  * return invalid buffer index for triggering auto buffer register failure,
  * then UBLK_IO_RES_NEED_REG_BUF handling is covered
  */
-static unsigned short ublk_null_buf_index(const struct ublk_queue *q, int tag)
+static unsigned short ublk_null_buf_index(const struct ublk_thread *t,
+		const struct ublk_queue *q, int tag)
 {
 	if (ublk_queue_auto_zc_fallback(q))
 		return (unsigned short)-1;
-	return q->ios[tag].buf_index;
+	return ublk_io_buf_idx(t, q, tag);
 }
 
 const struct ublk_tgt_ops null_tgt_ops = {
diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c
index 50874858a829..db281a879877 100644
--- a/tools/testing/selftests/ublk/stripe.c
+++ b/tools/testing/selftests/ublk/stripe.c
@@ -135,6 +135,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	struct ublk_io *io = ublk_get_io(q, tag);
 	int i, extra = zc ? 2 : 0;
 	void *base = (zc | auto_zc) ? NULL : (void *)iod->addr;
+	unsigned short buf_idx = ublk_io_buf_idx(t, q, tag);
 
 	io->private_data = s;
 	calculate_stripe_array(conf, iod, s, base);
@@ -142,7 +143,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	ublk_io_alloc_sqes(t, sqe, s->nr + extra);
 
 	if (zc) {
-		io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, io->buf_index);
+		io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, buf_idx);
 		sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
 		sqe[0]->user_data = build_user_data(tag,
 			ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
@@ -158,7 +159,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 				t->start << 9);
 		io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE);
 		if (auto_zc || zc) {
-			sqe[i]->buf_index = tag;
+			sqe[i]->buf_index = buf_idx;
 			if (zc)
 				sqe[i]->flags |= IOSQE_IO_HARDLINK;
 		}
@@ -168,7 +169,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	if (zc) {
 		struct io_uring_sqe *unreg = sqe[s->nr + 1];
 
-		io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, io->buf_index);
+		io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, buf_idx);
 		unreg->user_data = build_user_data(
 			tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1);
 	}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 22/27] selftests: ublk: add batch buffer management infrastructure
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (20 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 21/27] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 23/27] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add the foundational infrastructure for UBLK_F_BATCH_IO buffer
management including:

- Allocator utility functions for small sized per-thread allocation
- Batch buffer allocation and deallocation functions
- Buffer index management for commit buffers
- Thread state management for batch I/O mode
- Buffer size calculation based on device features

This prepares the groundwork for handling batch I/O commands by
establishing the buffer management layer needed for UBLK_U_IO_PREP_IO_CMDS
and UBLK_U_IO_COMMIT_IO_CMDS operations.

The allocator uses CPU sets for efficient per-thread buffer tracking,
and commit buffers are pre-allocated with 2 buffers per thread to handle
overlapping command operations.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile |   2 +-
 tools/testing/selftests/ublk/batch.c  | 152 ++++++++++++++++++++++++++
 tools/testing/selftests/ublk/kublk.c  |  26 ++++-
 tools/testing/selftests/ublk/kublk.h  |  53 +++++++++
 tools/testing/selftests/ublk/utils.h  |  54 +++++++++
 5 files changed, 283 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/ublk/batch.c

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 770269efe42a..a724276622d0 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -44,7 +44,7 @@ TEST_GEN_PROGS_EXTENDED = kublk
 
 include ../lib.mk
 
-$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
+$(TEST_GEN_PROGS_EXTENDED): kublk.c batch.c null.c file_backed.c common.c stripe.c \
 	fault_inject.c
 
 check:
diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests/ublk/batch.c
new file mode 100644
index 000000000000..609e6073c9c0
--- /dev/null
+++ b/tools/testing/selftests/ublk/batch.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Description: UBLK_F_BATCH_IO buffer management
+ */
+
+#include "kublk.h"
+
+static inline void *ublk_get_commit_buf(struct ublk_thread *t,
+					unsigned short buf_idx)
+{
+	unsigned idx;
+
+	if (buf_idx < t->commit_buf_start ||
+			buf_idx >= t->commit_buf_start + t->nr_commit_buf)
+		return NULL;
+	idx = buf_idx - t->commit_buf_start;
+	return t->commit_buf + idx * t->commit_buf_size;
+}
+
+/*
+ * Allocate one buffer for UBLK_U_IO_PREP_IO_CMDS or UBLK_U_IO_COMMIT_IO_CMDS
+ *
+ * Buffer index is returned.
+ */
+static inline unsigned short ublk_alloc_commit_buf(struct ublk_thread *t)
+{
+	int idx = allocator_get(&t->commit_buf_alloc);
+
+	if (idx >= 0)
+		return  idx + t->commit_buf_start;
+	return UBLKS_T_COMMIT_BUF_INV_IDX;
+}
+
+/*
+ * Free one commit buffer which is used by UBLK_U_IO_PREP_IO_CMDS or
+ * UBLK_U_IO_COMMIT_IO_CMDS
+ */
+static inline void ublk_free_commit_buf(struct ublk_thread *t,
+					 unsigned short i)
+{
+	unsigned short idx = i - t->commit_buf_start;
+
+	ublk_assert(idx < t->nr_commit_buf);
+	ublk_assert(allocator_get_val(&t->commit_buf_alloc, idx) != 0);
+
+	allocator_put(&t->commit_buf_alloc, idx);
+}
+
+static unsigned char ublk_commit_elem_buf_size(struct ublk_dev *dev)
+{
+	if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_USER_COPY |
+				UBLK_F_AUTO_BUF_REG))
+		return 8;
+
+	/* one extra 8bytes for carrying buffer address */
+	return 16;
+}
+
+static unsigned ublk_commit_buf_size(struct ublk_thread *t)
+{
+	struct ublk_dev *dev = t->dev;
+	unsigned elem_size = ublk_commit_elem_buf_size(dev);
+	unsigned int total = elem_size * dev->dev_info.queue_depth;
+	unsigned int page_sz = getpagesize();
+
+	return round_up(total, page_sz);
+}
+
+static void free_batch_commit_buf(struct ublk_thread *t)
+{
+	if (t->commit_buf) {
+		unsigned buf_size = ublk_commit_buf_size(t);
+		unsigned int total = buf_size * t->nr_commit_buf;
+
+		munlock(t->commit_buf, total);
+		free(t->commit_buf);
+	}
+	allocator_deinit(&t->commit_buf_alloc);
+}
+
+static int alloc_batch_commit_buf(struct ublk_thread *t)
+{
+	unsigned buf_size = ublk_commit_buf_size(t);
+	unsigned int total = buf_size * t->nr_commit_buf;
+	unsigned int page_sz = getpagesize();
+	void *buf = NULL;
+	int ret;
+
+	allocator_init(&t->commit_buf_alloc, t->nr_commit_buf);
+
+	t->commit_buf = NULL;
+	ret = posix_memalign(&buf, page_sz, total);
+	if (ret || !buf)
+		goto fail;
+
+	t->commit_buf = buf;
+
+	/* lock commit buffer pages for fast access */
+	if (mlock(t->commit_buf, total))
+		ublk_err("%s: can't lock commit buffer %s\n", __func__,
+			strerror(errno));
+
+	return 0;
+
+fail:
+	free_batch_commit_buf(t);
+	return ret;
+}
+
+void ublk_batch_prepare(struct ublk_thread *t)
+{
+	/*
+	 * We only handle single device in this thread context.
+	 *
+	 * All queues have same feature flags, so use queue 0's for
+	 * calculate uring_cmd flags.
+	 *
+	 * This way looks not elegant, but it works so far.
+	 */
+	struct ublk_queue *q = &t->dev->q[0];
+
+	t->commit_buf_elem_size = ublk_commit_elem_buf_size(t->dev);
+	t->commit_buf_size = ublk_commit_buf_size(t);
+	t->commit_buf_start = t->nr_bufs;
+	t->nr_commit_buf = 2;
+	t->nr_bufs += t->nr_commit_buf;
+
+	t->cmd_flags = 0;
+	if (ublk_queue_use_auto_zc(q)) {
+		if (ublk_queue_auto_zc_fallback(q))
+			t->cmd_flags |= UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK;
+	} else if (!ublk_queue_no_buf(q))
+		t->cmd_flags |= UBLK_BATCH_F_HAS_BUF_ADDR;
+
+	t->state |= UBLKS_T_BATCH_IO;
+
+	ublk_log("%s: thread %d commit(nr_bufs %u, buf_size %u, start %u)\n",
+			__func__, t->idx,
+			t->nr_commit_buf, t->commit_buf_size,
+			t->nr_bufs);
+}
+
+int ublk_batch_alloc_buf(struct ublk_thread *t)
+{
+	ublk_assert(t->nr_commit_buf < 16);
+	return alloc_batch_commit_buf(t);
+}
+
+void ublk_batch_free_buf(struct ublk_thread *t)
+{
+	free_batch_commit_buf(t);
+}
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 1665a7865af4..29594612edc9 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -423,6 +423,8 @@ static void ublk_thread_deinit(struct ublk_thread *t)
 {
 	io_uring_unregister_buffers(&t->ring);
 
+	ublk_batch_free_buf(t);
+
 	io_uring_unregister_ring_fd(&t->ring);
 
 	if (t->ring.ring_fd > 0) {
@@ -505,15 +507,33 @@ static int ublk_thread_init(struct ublk_thread *t, unsigned long long extra_flag
 		unsigned nr_ios = dev->dev_info.queue_depth * dev->dev_info.nr_hw_queues;
 		unsigned max_nr_ios_per_thread = nr_ios / dev->nthreads;
 		max_nr_ios_per_thread += !!(nr_ios % dev->nthreads);
-		ret = io_uring_register_buffers_sparse(
-			&t->ring, max_nr_ios_per_thread);
+
+		t->nr_bufs = max_nr_ios_per_thread;
+	} else {
+		t->nr_bufs = 0;
+	}
+
+	if (ublk_dev_batch_io(dev))
+		 ublk_batch_prepare(t);
+
+	if (t->nr_bufs) {
+		ret = io_uring_register_buffers_sparse(&t->ring, t->nr_bufs);
 		if (ret) {
-			ublk_err("ublk dev %d thread %d register spare buffers failed %d",
+			ublk_err("ublk dev %d thread %d register spare buffers failed %d\n",
 					dev->dev_info.dev_id, t->idx, ret);
 			goto fail;
 		}
 	}
 
+	if (ublk_dev_batch_io(dev)) {
+		ret = ublk_batch_alloc_buf(t);
+		if (ret) {
+			ublk_err("ublk dev %d thread %d alloc batch buf failed %d\n",
+				dev->dev_info.dev_id, t->idx, ret);
+			goto fail;
+		}
+	}
+
 	io_uring_register_ring_fd(&t->ring);
 
 	if (flags & UBLKS_Q_NO_UBLK_FIXED_FD) {
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 5b951ad9b03d..e75c28680783 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -174,15 +174,40 @@ struct ublk_queue {
 	struct ublk_io ios[UBLK_QUEUE_DEPTH];
 };
 
+/* align with `ublk_elem_header` */
+struct ublk_batch_elem {
+	__u16 tag;
+	__u16 buf_index;
+	__s32 result;
+	__u64 buf_addr;
+};
+
 struct ublk_thread {
 	struct ublk_dev *dev;
 	unsigned idx;
 
 #define UBLKS_T_STOPPING	(1U << 0)
 #define UBLKS_T_IDLE	(1U << 1)
+#define UBLKS_T_BATCH_IO	(1U << 31) 	/* readonly */
 	unsigned state;
 	unsigned int cmd_inflight;
 	unsigned int io_inflight;
+
+	unsigned short nr_bufs;
+
+       /* followings are for BATCH_IO */
+	unsigned short commit_buf_start;
+	unsigned char  commit_buf_elem_size;
+       /*
+        * We just support single device, so pre-calculate commit/prep flags
+        */
+	unsigned short cmd_flags;
+	unsigned int   nr_commit_buf;
+	unsigned int   commit_buf_size;
+	void *commit_buf;
+#define UBLKS_T_COMMIT_BUF_INV_IDX  ((unsigned short)-1)
+	struct allocator commit_buf_alloc;
+
 	struct io_uring ring;
 };
 
@@ -203,6 +228,27 @@ struct ublk_dev {
 
 extern int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io);
 
+static inline int __ublk_use_batch_io(__u64 flags)
+{
+	return flags & UBLK_F_BATCH_IO;
+}
+
+static inline int ublk_queue_batch_io(const struct ublk_queue *q)
+{
+	return __ublk_use_batch_io(q->flags);
+}
+
+static inline int ublk_dev_batch_io(const struct ublk_dev *dev)
+{
+	return __ublk_use_batch_io(dev->dev_info.flags);
+}
+
+/* only work for handle single device in this pthread context */
+static inline int ublk_thread_batch_io(const struct ublk_thread *t)
+{
+	return t->state & UBLKS_T_BATCH_IO;
+}
+
 
 static inline int ublk_io_auto_zc_fallback(const struct ublksrv_io_desc *iod)
 {
@@ -418,6 +464,13 @@ static inline int ublk_queue_no_buf(const struct ublk_queue *q)
 	return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
 }
 
+/* Initialize batch I/O state and calculate buffer parameters */
+void ublk_batch_prepare(struct ublk_thread *t);
+/* Allocate and register commit buffers for batch operations */
+int ublk_batch_alloc_buf(struct ublk_thread *t);
+/* Free commit buffers and cleanup batch allocator */
+void ublk_batch_free_buf(struct ublk_thread *t);
+
 extern const struct ublk_tgt_ops null_tgt_ops;
 extern const struct ublk_tgt_ops loop_tgt_ops;
 extern const struct ublk_tgt_ops stripe_tgt_ops;
diff --git a/tools/testing/selftests/ublk/utils.h b/tools/testing/selftests/ublk/utils.h
index 17eefed73690..aab522f26167 100644
--- a/tools/testing/selftests/ublk/utils.h
+++ b/tools/testing/selftests/ublk/utils.h
@@ -21,6 +21,60 @@
 #define round_up(val, rnd) \
 	(((val) + ((rnd) - 1)) & ~((rnd) - 1))
 
+/* small sized & per-thread allocator */
+struct allocator {
+	unsigned int size;
+	cpu_set_t *set;
+};
+
+static inline int allocator_init(struct allocator *a, unsigned size)
+{
+	a->set = CPU_ALLOC(size);
+	a->size = size;
+
+	if (a->set)
+		return 0;
+	return -ENOMEM;
+}
+
+static inline void allocator_deinit(struct allocator *a)
+{
+	CPU_FREE(a->set);
+	a->set = NULL;
+	a->size = 0;
+}
+
+static inline int allocator_get(struct allocator *a)
+{
+	int i;
+
+	for (i = 0; i < a->size; i += 1) {
+		size_t set_size = CPU_ALLOC_SIZE(a->size);
+
+		if (!CPU_ISSET_S(i, set_size, a->set)) {
+			CPU_SET_S(i, set_size, a->set);
+			return i;
+		}
+	}
+
+	return -1;
+}
+
+static inline void allocator_put(struct allocator *a, int i)
+{
+	size_t set_size = CPU_ALLOC_SIZE(a->size);
+
+	if (i >= 0 && i < a->size)
+		CPU_CLR_S(i, set_size, a->set);
+}
+
+static inline int allocator_get_val(struct allocator *a, int i)
+{
+	size_t set_size = CPU_ALLOC_SIZE(a->size);
+
+	return CPU_ISSET_S(i, set_size, a->set);
+}
+
 static inline unsigned int ilog2(unsigned int x)
 {
 	if (x == 0)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 23/27] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (21 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 22/27] selftests: ublk: add batch buffer management infrastructure Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 24/27] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Implement support for UBLK_U_IO_PREP_IO_CMDS in the batch I/O framework:

- Add batch command initialization and setup functions
- Implement prep command queueing with proper buffer management
- Add command completion handling for prep and commit commands
- Integrate batch I/O setup into thread initialization
- Update CQE handling to support batch commands

The implementation uses the previously established buffer management
infrastructure to queue UBLK_U_IO_PREP_IO_CMDS commands. Commands are
prepared in the first thread context and use commit buffers for
efficient command batching.

Key changes:
- ublk_batch_queue_prep_io_cmds() prepares I/O command batches
- ublk_batch_compl_cmd() handles batch command completions
- Modified thread setup to use batch operations when enabled
- Enhanced buffer index calculation for batch mode

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/batch.c | 113 +++++++++++++++++++++++++++
 tools/testing/selftests/ublk/kublk.c |  46 ++++++++---
 tools/testing/selftests/ublk/kublk.h |  22 ++++++
 3 files changed, 171 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests/ublk/batch.c
index 609e6073c9c0..01f00c21dfdb 100644
--- a/tools/testing/selftests/ublk/batch.c
+++ b/tools/testing/selftests/ublk/batch.c
@@ -150,3 +150,116 @@ void ublk_batch_free_buf(struct ublk_thread *t)
 {
 	free_batch_commit_buf(t);
 }
+
+static void ublk_init_batch_cmd(struct ublk_thread *t, __u16 q_id,
+				struct io_uring_sqe *sqe, unsigned op,
+				unsigned short elem_bytes,
+				unsigned short nr_elem,
+				unsigned short buf_idx)
+{
+	struct ublk_batch_io *cmd;
+	__u64 user_data;
+
+	cmd = (struct ublk_batch_io *)ublk_get_sqe_cmd(sqe);
+
+	ublk_set_sqe_cmd_op(sqe, op);
+
+	sqe->fd	= 0;	/* dev->fds[0] */
+	sqe->opcode	= IORING_OP_URING_CMD;
+	sqe->flags	= IOSQE_FIXED_FILE;
+
+	cmd->q_id	= q_id;
+	cmd->flags	= 0;
+	cmd->reserved 	= 0;
+	cmd->elem_bytes = elem_bytes;
+	cmd->nr_elem	= nr_elem;
+
+	user_data = build_user_data(buf_idx, _IOC_NR(op), 0, q_id, 0);
+	io_uring_sqe_set_data64(sqe, user_data);
+
+	t->cmd_inflight += 1;
+
+	ublk_dbg(UBLK_DBG_IO_CMD, "%s: thread %u qid %d cmd_op %x data %lx "
+			"nr_elem %u elem_bytes %u buf_size %u buf_idx %d "
+			"cmd_inflight %u\n",
+			__func__, t->idx, q_id, op, user_data,
+			cmd->nr_elem, cmd->elem_bytes,
+			nr_elem * elem_bytes, buf_idx, t->cmd_inflight);
+}
+
+static void ublk_setup_commit_sqe(struct ublk_thread *t,
+				  struct io_uring_sqe *sqe,
+				  unsigned short buf_idx)
+{
+	struct ublk_batch_io *cmd;
+
+	cmd = (struct ublk_batch_io *)ublk_get_sqe_cmd(sqe);
+
+	/* Use plain user buffer instead of fixed buffer */
+	cmd->flags |= t->cmd_flags;
+}
+
+int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
+{
+	unsigned short nr_elem = q->q_depth;
+	unsigned short buf_idx = ublk_alloc_commit_buf(t);
+	struct io_uring_sqe *sqe;
+	void *buf;
+	int i;
+
+	ublk_assert(buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX);
+
+	ublk_io_alloc_sqes(t, &sqe, 1);
+
+	ublk_assert(nr_elem == q->q_depth);
+	buf = ublk_get_commit_buf(t, buf_idx);
+	for (i = 0; i < nr_elem; i++) {
+		struct ublk_batch_elem *elem = (struct ublk_batch_elem *)(
+				buf + i * t->commit_buf_elem_size);
+		struct ublk_io *io = &q->ios[i];
+
+		elem->tag = i;
+		elem->result = 0;
+
+		if (ublk_queue_use_auto_zc(q))
+			elem->buf_index = ublk_batch_io_buf_idx(t, q, i);
+		else if (!ublk_queue_no_buf(q))
+			elem->buf_addr = (__u64)io->buf_addr;
+	}
+
+	sqe->addr = (__u64)buf;
+	sqe->len = t->commit_buf_elem_size * nr_elem;
+
+	ublk_init_batch_cmd(t, q->q_id, sqe, UBLK_U_IO_PREP_IO_CMDS,
+			t->commit_buf_elem_size, nr_elem, buf_idx);
+	ublk_setup_commit_sqe(t, sqe, buf_idx);
+	return 0;
+}
+
+static void ublk_batch_compl_commit_cmd(struct ublk_thread *t,
+					const struct io_uring_cqe *cqe,
+					unsigned op)
+{
+	unsigned short buf_idx = user_data_to_tag(cqe->user_data);
+
+	if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS))
+		ublk_assert(cqe->res == 0);
+	else if (op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS))
+		;//assert(cqe->res == t->commit_buf_size);
+	else
+		ublk_assert(0);
+
+	ublk_free_commit_buf(t, buf_idx);
+}
+
+void ublk_batch_compl_cmd(struct ublk_thread *t,
+			  const struct io_uring_cqe *cqe)
+{
+	unsigned op = user_data_to_op(cqe->user_data);
+
+	if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS) ||
+			op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS)) {
+		ublk_batch_compl_commit_cmd(t, cqe, op);
+		return;
+	}
+}
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 29594612edc9..e981fcf18475 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -795,28 +795,32 @@ static void ublk_handle_cqe(struct ublk_thread *t,
 {
 	struct ublk_dev *dev = t->dev;
 	unsigned q_id = user_data_to_q_id(cqe->user_data);
-	struct ublk_queue *q = &dev->q[q_id];
 	unsigned cmd_op = user_data_to_op(cqe->user_data);
 
 	if (cqe->res < 0 && cqe->res != -ENODEV)
-		ublk_err("%s: res %d userdata %llx queue state %x\n", __func__,
-				cqe->res, cqe->user_data, q->flags);
+		ublk_err("%s: res %d userdata %llx thread state %x\n", __func__,
+				cqe->res, cqe->user_data, t->state);
 
-	ublk_dbg(UBLK_DBG_IO_CMD, "%s: res %d (qid %d tag %u cmd_op %u target %d/%d) stopping %d\n",
-			__func__, cqe->res, q->q_id, user_data_to_tag(cqe->user_data),
-			cmd_op, is_target_io(cqe->user_data),
+	ublk_dbg(UBLK_DBG_IO_CMD, "%s: res %d (thread %d qid %d tag %u cmd_op %x "
+			"data %lx target %d/%d) stopping %d\n",
+			__func__, cqe->res, t->idx, q_id,
+			user_data_to_tag(cqe->user_data),
+			cmd_op, cqe->user_data, is_target_io(cqe->user_data),
 			user_data_to_tgt_data(cqe->user_data),
 			(t->state & UBLKS_T_STOPPING));
 
 	/* Don't retrieve io in case of target io */
 	if (is_target_io(cqe->user_data)) {
-		ublksrv_handle_tgt_cqe(t, q, cqe);
+		ublksrv_handle_tgt_cqe(t, &dev->q[q_id], cqe);
 		return;
 	}
 
 	t->cmd_inflight--;
 
-	ublk_handle_uring_cmd(t, q, cqe);
+	if (ublk_thread_batch_io(t))
+		ublk_batch_compl_cmd(t, cqe);
+	else
+		ublk_handle_uring_cmd(t, &dev->q[q_id], cqe);
 }
 
 static int ublk_reap_events_uring(struct ublk_thread *t)
@@ -873,6 +877,22 @@ static void ublk_thread_set_sched_affinity(const struct ublk_thread_info *info)
 				info->dev->dev_info.dev_id, info->idx);
 }
 
+static void ublk_batch_setup_queues(struct ublk_thread *t)
+{
+	int i;
+
+	/* setup all queues in the 1st thread */
+	for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *q = &t->dev->q[i];
+		int ret;
+
+		ret = ublk_batch_queue_prep_io_cmds(t, q);
+		ublk_assert(ret == 0);
+		ret = ublk_process_io(t);
+		ublk_assert(ret >= 0);
+	}
+}
+
 static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_info *info)
 {
 	struct ublk_thread t = {
@@ -893,8 +913,14 @@ static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_inf
 	ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n",
 			gettid(), dev_id, t.idx);
 
-	/* submit all io commands to ublk driver */
-	ublk_submit_fetch_commands(&t);
+	if (!ublk_thread_batch_io(&t)) {
+		/* submit all io commands to ublk driver */
+		ublk_submit_fetch_commands(&t);
+	} else if (!t.idx) {
+		/* prepare all io commands in the 1st thread context */
+		ublk_batch_setup_queues(&t);
+	}
+
 	do {
 		if (ublk_process_io(&t) < 0)
 			break;
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index e75c28680783..51fad0f4419b 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -398,10 +398,16 @@ static inline void ublk_set_sqe_cmd_op(struct io_uring_sqe *sqe, __u32 cmd_op)
 	addr[1] = 0;
 }
 
+static inline unsigned short ublk_batch_io_buf_idx(
+		const struct ublk_thread *t, const struct ublk_queue *q,
+		unsigned tag);
+
 static inline unsigned short ublk_io_buf_idx(const struct ublk_thread *t,
 					     const struct ublk_queue *q,
 					     unsigned tag)
 {
+	if (ublk_queue_batch_io(q))
+		return ublk_batch_io_buf_idx(t, q, tag);
 	return q->ios[tag].buf_index;
 }
 
@@ -464,6 +470,22 @@ static inline int ublk_queue_no_buf(const struct ublk_queue *q)
 	return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
 }
 
+/*
+ * Each IO's buffer index has to be calculated by this helper for
+ * UBLKS_T_BATCH_IO
+ */
+static inline unsigned short ublk_batch_io_buf_idx(
+		const struct ublk_thread *t, const struct ublk_queue *q,
+		unsigned tag)
+{
+	return tag;
+}
+
+/* Queue UBLK_U_IO_PREP_IO_CMDS for a specific queue with batch elements */
+int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q);
+/* Handle completion of batch I/O commands (prep/commit) */
+void ublk_batch_compl_cmd(struct ublk_thread *t,
+			  const struct io_uring_cqe *cqe);
 /* Initialize batch I/O state and calculate buffer parameters */
 void ublk_batch_prepare(struct ublk_thread *t);
 /* Allocate and register commit buffers for batch operations */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 24/27] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (22 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 23/27] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 25/27] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Implement UBLK_U_IO_COMMIT_IO_CMDS to enable efficient batched
completion of I/O operations in the batch I/O framework.

This completes the batch I/O infrastructure by adding the commit
phase that notifies the kernel about completed I/O operations:

Key features:
- Batch multiple I/O completions into single UBLK_U_IO_COMMIT_IO_CMDS
- Dynamic commit buffer allocation and management per thread
- Automatic commit buffer preparation before processing events
- Commit buffer submission after processing completed I/Os
- Integration with existing completion workflows

Implementation details:
- ublk_batch_prep_commit() allocates and initializes commit buffers
- ublk_batch_complete_io() adds completed I/Os to current batch
- ublk_batch_commit_io_cmds() submits batched completions to kernel
- Modified ublk_process_io() to handle batch commit lifecycle
- Enhanced ublk_complete_io() to route to batch or legacy completion

The commit buffer stores completion information (tag, result, buffer
details) for multiple I/Os, then submits them all at once, significantly
reducing syscall overhead compared to individual I/O completions.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/batch.c | 74 ++++++++++++++++++++++++++--
 tools/testing/selftests/ublk/kublk.c |  8 ++-
 tools/testing/selftests/ublk/kublk.h | 69 +++++++++++++++++---------
 3 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests/ublk/batch.c
index 01f00c21dfdb..e240d4decedf 100644
--- a/tools/testing/selftests/ublk/batch.c
+++ b/tools/testing/selftests/ublk/batch.c
@@ -174,7 +174,7 @@ static void ublk_init_batch_cmd(struct ublk_thread *t, __u16 q_id,
 	cmd->elem_bytes = elem_bytes;
 	cmd->nr_elem	= nr_elem;
 
-	user_data = build_user_data(buf_idx, _IOC_NR(op), 0, q_id, 0);
+	user_data = build_user_data(buf_idx, _IOC_NR(op), nr_elem, q_id, 0);
 	io_uring_sqe_set_data64(sqe, user_data);
 
 	t->cmd_inflight += 1;
@@ -244,9 +244,11 @@ static void ublk_batch_compl_commit_cmd(struct ublk_thread *t,
 
 	if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS))
 		ublk_assert(cqe->res == 0);
-	else if (op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS))
-		;//assert(cqe->res == t->commit_buf_size);
-	else
+	else if (op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS)) {
+		int nr_elem = user_data_to_tgt_data(cqe->user_data);
+
+		ublk_assert(cqe->res == t->commit_buf_elem_size * nr_elem);
+	} else
 		ublk_assert(0);
 
 	ublk_free_commit_buf(t, buf_idx);
@@ -263,3 +265,67 @@ void ublk_batch_compl_cmd(struct ublk_thread *t,
 		return;
 	}
 }
+
+void ublk_batch_commit_io_cmds(struct ublk_thread *t)
+{
+	struct io_uring_sqe *sqe;
+	unsigned short buf_idx;
+	unsigned short nr_elem = t->commit.done;
+
+	/* nothing to commit */
+	if (!nr_elem) {
+		ublk_free_commit_buf(t, t->commit.buf_idx);
+		return;
+	}
+
+	ublk_io_alloc_sqes(t, &sqe, 1);
+	buf_idx = t->commit.buf_idx;
+	sqe->addr = (__u64)t->commit.elem;
+	sqe->len = nr_elem * t->commit_buf_elem_size;
+
+	/* commit isn't per-queue command */
+	ublk_init_batch_cmd(t, t->commit.q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS,
+			t->commit_buf_elem_size, nr_elem, buf_idx);
+	ublk_setup_commit_sqe(t, sqe, buf_idx);
+}
+
+static void ublk_batch_init_commit(struct ublk_thread *t,
+				   unsigned short buf_idx)
+{
+	/* so far only support 1:1 queue/thread mapping */
+	t->commit.q_id = t->idx;
+	t->commit.buf_idx = buf_idx;
+	t->commit.elem = ublk_get_commit_buf(t, buf_idx);
+	t->commit.done = 0;
+	t->commit.count = t->commit_buf_size /
+		t->commit_buf_elem_size;
+}
+
+void ublk_batch_prep_commit(struct ublk_thread *t)
+{
+	unsigned short buf_idx = ublk_alloc_commit_buf(t);
+
+	ublk_assert(buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX);
+	ublk_batch_init_commit(t, buf_idx);
+}
+
+void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
+			    unsigned tag, int res)
+{
+	struct batch_commit_buf *cb = &t->commit;
+	struct ublk_batch_elem *elem = (struct ublk_batch_elem *)(cb->elem +
+			cb->done * t->commit_buf_elem_size);
+	struct ublk_io *io = &q->ios[tag];
+
+	ublk_assert(q->q_id == t->commit.q_id);
+
+	elem->tag = tag;
+	elem->buf_index = ublk_batch_io_buf_idx(t, q, tag);
+	elem->result = res;
+
+	if (!ublk_queue_no_buf(q))
+		elem->buf_addr	= (__u64) (uintptr_t) io->buf_addr;
+
+	cb->done += 1;
+	ublk_assert(cb->done <= cb->count);
+}
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index e981fcf18475..6565e804679c 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -852,7 +852,13 @@ static int ublk_process_io(struct ublk_thread *t)
 		return -ENODEV;
 
 	ret = io_uring_submit_and_wait(&t->ring, 1);
-	reapped = ublk_reap_events_uring(t);
+	if (ublk_thread_batch_io(t)) {
+		ublk_batch_prep_commit(t);
+		reapped = ublk_reap_events_uring(t);
+		ublk_batch_commit_io_cmds(t);
+	} else {
+		reapped = ublk_reap_events_uring(t);
+	}
 
 	ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n",
 			ret, reapped, (t->state & UBLKS_T_STOPPING),
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 51fad0f4419b..0a355653d64c 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -182,6 +182,14 @@ struct ublk_batch_elem {
 	__u64 buf_addr;
 };
 
+struct batch_commit_buf {
+	unsigned short q_id;
+	unsigned short buf_idx;
+	void *elem;
+	unsigned short done;
+	unsigned short count;
+};
+
 struct ublk_thread {
 	struct ublk_dev *dev;
 	unsigned idx;
@@ -207,6 +215,7 @@ struct ublk_thread {
 	void *commit_buf;
 #define UBLKS_T_COMMIT_BUF_INV_IDX  ((unsigned short)-1)
 	struct allocator commit_buf_alloc;
+	struct batch_commit_buf commit;
 
 	struct io_uring ring;
 };
@@ -416,30 +425,6 @@ static inline struct ublk_io *ublk_get_io(struct ublk_queue *q, unsigned tag)
 	return &q->ios[tag];
 }
 
-static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queue *q,
-				   unsigned tag, int res)
-{
-	struct ublk_io *io = &q->ios[tag];
-
-	ublk_mark_io_done(io, res);
-
-	return ublk_queue_io_cmd(t, io);
-}
-
-static inline void ublk_queued_tgt_io(struct ublk_thread *t, struct ublk_queue *q,
-				      unsigned tag, int queued)
-{
-	if (queued < 0)
-		ublk_complete_io(t, q, tag, queued);
-	else {
-		struct ublk_io *io = ublk_get_io(q, tag);
-
-		t->io_inflight += queued;
-		io->tgt_ios = queued;
-		io->result = 0;
-	}
-}
-
 static inline int ublk_completed_tgt_io(struct ublk_thread *t,
 					struct ublk_queue *q, unsigned tag)
 {
@@ -493,6 +478,42 @@ int ublk_batch_alloc_buf(struct ublk_thread *t);
 /* Free commit buffers and cleanup batch allocator */
 void ublk_batch_free_buf(struct ublk_thread *t);
 
+/* Prepare a new commit buffer for batching completed I/O operations */
+void ublk_batch_prep_commit(struct ublk_thread *t);
+/* Submit UBLK_U_IO_COMMIT_IO_CMDS with batched completed I/O operations */
+void ublk_batch_commit_io_cmds(struct ublk_thread *t);
+/* Add a completed I/O operation to the current batch commit buffer */
+void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
+			    unsigned tag, int res);
+
+static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queue *q,
+				   unsigned tag, int res)
+{
+	if (ublk_queue_batch_io(q)) {
+		ublk_batch_complete_io(t, q, tag, res);
+		return 0;
+	} else {
+		struct ublk_io *io = &q->ios[tag];
+
+		ublk_mark_io_done(io, res);
+		return ublk_queue_io_cmd(t, io);
+	}
+}
+
+static inline void ublk_queued_tgt_io(struct ublk_thread *t, struct ublk_queue *q,
+				      unsigned tag, int queued)
+{
+	if (queued < 0)
+		ublk_complete_io(t, q, tag, queued);
+	else {
+		struct ublk_io *io = ublk_get_io(q, tag);
+
+		t->io_inflight += queued;
+		io->tgt_ios = queued;
+		io->result = 0;
+	}
+}
+
 extern const struct ublk_tgt_ops null_tgt_ops;
 extern const struct ublk_tgt_ops loop_tgt_ops;
 extern const struct ublk_tgt_ops stripe_tgt_ops;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 25/27] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (23 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 24/27] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 26/27] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add support for UBLK_U_IO_FETCH_IO_CMDS to enable efficient batch
fetching of I/O commands using multishot io_uring operations.

Key improvements:
- Implement multishot UBLK_U_IO_FETCH_IO_CMDS for continuous command fetching
- Add fetch buffer management with page-aligned, mlocked buffers
- Process fetched I/O command tags from kernel-provided buffers
- Integrate fetch operations with existing batch I/O infrastructure
- Significantly reduce uring_cmd issuing overhead through batching

The implementation uses two fetch buffers per thread with automatic
requeuing to maintain continuous I/O command flow. Each fetch operation
retrieves multiple command tags in a single syscall, dramatically
improving performance compared to individual command fetching.

Technical details:
- Fetch buffers are page-aligned and mlocked for optimal performance
- Uses IORING_URING_CMD_MULTISHOT for continuous operation
- Automatic buffer management and requeuing on completion
- Enhanced CQE handling for fetch command completions

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/batch.c | 134 ++++++++++++++++++++++++++-
 tools/testing/selftests/ublk/kublk.c |  14 ++-
 tools/testing/selftests/ublk/kublk.h |  13 +++
 3 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests/ublk/batch.c
index e240d4decedf..7db91f910944 100644
--- a/tools/testing/selftests/ublk/batch.c
+++ b/tools/testing/selftests/ublk/batch.c
@@ -140,15 +140,63 @@ void ublk_batch_prepare(struct ublk_thread *t)
 			t->nr_bufs);
 }
 
+static void free_batch_fetch_buf(struct ublk_thread *t)
+{
+	int i;
+
+	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+		io_uring_free_buf_ring(&t->ring, t->fetch[i].br, 1, i);
+		munlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size);
+		free(t->fetch[i].fetch_buf);
+	}
+}
+
+static int alloc_batch_fetch_buf(struct ublk_thread *t)
+{
+	/* page aligned fetch buffer, and it is mlocked for speedup delivery */
+	unsigned pg_sz = getpagesize();
+	unsigned buf_size = round_up(t->dev->dev_info.queue_depth * 2, pg_sz);
+	int ret;
+	int i = 0;
+
+	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+		t->fetch[i].fetch_buf_size = buf_size;
+
+		if (posix_memalign((void **)&t->fetch[i].fetch_buf, pg_sz,
+					t->fetch[i].fetch_buf_size))
+			return -ENOMEM;
+
+		/* lock fetch buffer page for fast fetching */
+		if (mlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size))
+			ublk_err("%s: can't lock fetch buffer %s\n", __func__,
+				strerror(errno));
+		t->fetch[i].br = io_uring_setup_buf_ring(&t->ring, 1,
+			i, IOU_PBUF_RING_INC, &ret);
+		if (!t->fetch[i].br) {
+			ublk_err("Buffer ring register failed %d\n", ret);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
 int ublk_batch_alloc_buf(struct ublk_thread *t)
 {
+	int ret;
+
 	ublk_assert(t->nr_commit_buf < 16);
-	return alloc_batch_commit_buf(t);
+
+	ret = alloc_batch_commit_buf(t);
+	if (ret)
+		return ret;
+	return alloc_batch_fetch_buf(t);
 }
 
 void ublk_batch_free_buf(struct ublk_thread *t)
 {
 	free_batch_commit_buf(t);
+	free_batch_fetch_buf(t);
 }
 
 static void ublk_init_batch_cmd(struct ublk_thread *t, __u16 q_id,
@@ -199,6 +247,76 @@ static void ublk_setup_commit_sqe(struct ublk_thread *t,
 	cmd->flags |= t->cmd_flags;
 }
 
+static void ublk_batch_queue_fetch(struct ublk_thread *t,
+				   struct ublk_queue *q,
+				   unsigned short buf_idx)
+{
+	unsigned short nr_elem = t->fetch[buf_idx].fetch_buf_size / 2;
+	struct io_uring_sqe *sqe;
+
+	io_uring_buf_ring_add(t->fetch[buf_idx].br, t->fetch[buf_idx].fetch_buf,
+			t->fetch[buf_idx].fetch_buf_size,
+			0, 0, 0);
+	io_uring_buf_ring_advance(t->fetch[buf_idx].br, 1);
+
+	ublk_io_alloc_sqes(t, &sqe, 1);
+
+	ublk_init_batch_cmd(t, q->q_id, sqe, UBLK_U_IO_FETCH_IO_CMDS, 2, nr_elem,
+			buf_idx);
+
+	sqe->rw_flags= IORING_URING_CMD_MULTISHOT;
+	sqe->buf_group = buf_idx;
+	sqe->flags |= IOSQE_BUFFER_SELECT;
+
+	t->fetch[buf_idx].fetch_buf_off = 0;
+}
+
+void ublk_batch_start_fetch(struct ublk_thread *t,
+			    struct ublk_queue *q)
+{
+	int i;
+
+	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++)
+		ublk_batch_queue_fetch(t, q, i);
+}
+
+static unsigned short ublk_compl_batch_fetch(struct ublk_thread *t,
+				   struct ublk_queue *q,
+				   const struct io_uring_cqe *cqe)
+{
+	unsigned short buf_idx = user_data_to_tag(cqe->user_data);
+	unsigned start = t->fetch[buf_idx].fetch_buf_off;
+	unsigned end = start + cqe->res;
+	void *buf = t->fetch[buf_idx].fetch_buf;
+	int i;
+
+	if (cqe->res < 0)
+		return buf_idx;
+
+       if ((end - start) / 2 > q->q_depth) {
+               ublk_err("%s: fetch duplicated ios offset %u count %u\n", __func__, start, cqe->res);
+
+               for (i = start; i < end; i += 2) {
+                       unsigned short tag = *(unsigned short *)(buf + i);
+
+                       ublk_err("%u ", tag);
+               }
+               ublk_err("\n");
+       }
+
+	for (i = start; i < end; i += 2) {
+		unsigned short tag = *(unsigned short *)(buf + i);
+
+		if (tag >= q->q_depth)
+			ublk_err("%s: bad tag %u\n", __func__, tag);
+
+		if (q->tgt_ops->queue_io)
+			q->tgt_ops->queue_io(t, q, tag);
+	}
+	t->fetch[buf_idx].fetch_buf_off = end;
+	return buf_idx;
+}
+
 int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
 {
 	unsigned short nr_elem = q->q_depth;
@@ -258,12 +376,26 @@ void ublk_batch_compl_cmd(struct ublk_thread *t,
 			  const struct io_uring_cqe *cqe)
 {
 	unsigned op = user_data_to_op(cqe->user_data);
+	struct ublk_queue *q;
+	unsigned buf_idx;
+	unsigned q_id;
 
 	if (op == _IOC_NR(UBLK_U_IO_PREP_IO_CMDS) ||
 			op == _IOC_NR(UBLK_U_IO_COMMIT_IO_CMDS)) {
 		ublk_batch_compl_commit_cmd(t, cqe, op);
 		return;
 	}
+
+	/* FETCH command is per queue */
+	q_id = user_data_to_q_id(cqe->user_data);
+	q = &t->dev->q[q_id];
+	buf_idx = ublk_compl_batch_fetch(t, q, cqe);
+
+	if (cqe->res < 0 && cqe->res != -ENOBUFS) {
+		 t->state |= UBLKS_T_STOPPING;
+	} else if (!(cqe->flags & IORING_CQE_F_MORE) || cqe->res == -ENOBUFS) {
+		ublk_batch_queue_fetch(t, q, buf_idx);
+	}
 }
 
 void ublk_batch_commit_io_cmds(struct ublk_thread *t)
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 6565e804679c..cb329c7aebc4 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -493,6 +493,10 @@ static int ublk_thread_init(struct ublk_thread *t, unsigned long long extra_flag
 	int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
 	int ret;
 
+	/* FETCH_IO_CMDS is multishot, so increase cq depth for BATCH_IO */
+	if (ublk_dev_batch_io(dev))
+		cq_depth += dev->dev_info.queue_depth;
+
 	ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth,
 			IORING_SETUP_COOP_TASKRUN |
 			IORING_SETUP_SINGLE_ISSUER |
@@ -797,7 +801,7 @@ static void ublk_handle_cqe(struct ublk_thread *t,
 	unsigned q_id = user_data_to_q_id(cqe->user_data);
 	unsigned cmd_op = user_data_to_op(cqe->user_data);
 
-	if (cqe->res < 0 && cqe->res != -ENODEV)
+	if (cqe->res < 0 && cqe->res != -ENODEV && cqe->res != -ENOBUFS)
 		ublk_err("%s: res %d userdata %llx thread state %x\n", __func__,
 				cqe->res, cqe->user_data, t->state);
 
@@ -922,9 +926,13 @@ static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_inf
 	if (!ublk_thread_batch_io(&t)) {
 		/* submit all io commands to ublk driver */
 		ublk_submit_fetch_commands(&t);
-	} else if (!t.idx) {
+	} else {
+		struct ublk_queue *q = &t.dev->q[t.idx];
+
 		/* prepare all io commands in the 1st thread context */
-		ublk_batch_setup_queues(&t);
+		if (!t.idx)
+			ublk_batch_setup_queues(&t);
+		ublk_batch_start_fetch(&t, q);
 	}
 
 	do {
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 0a355653d64c..222501048c24 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -190,6 +190,13 @@ struct batch_commit_buf {
 	unsigned short count;
 };
 
+struct batch_fetch_buf {
+	struct io_uring_buf_ring *br;
+	void *fetch_buf;
+	unsigned int fetch_buf_size;
+	unsigned int fetch_buf_off;
+};
+
 struct ublk_thread {
 	struct ublk_dev *dev;
 	unsigned idx;
@@ -216,6 +223,9 @@ struct ublk_thread {
 #define UBLKS_T_COMMIT_BUF_INV_IDX  ((unsigned short)-1)
 	struct allocator commit_buf_alloc;
 	struct batch_commit_buf commit;
+	/* FETCH_IO_CMDS buffer */
+#define UBLKS_T_NR_FETCH_BUF 	2
+	struct batch_fetch_buf fetch[UBLKS_T_NR_FETCH_BUF];
 
 	struct io_uring ring;
 };
@@ -468,6 +478,9 @@ static inline unsigned short ublk_batch_io_buf_idx(
 
 /* Queue UBLK_U_IO_PREP_IO_CMDS for a specific queue with batch elements */
 int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q);
+/* Start fetching I/O commands using multishot UBLK_U_IO_FETCH_IO_CMDS */
+void ublk_batch_start_fetch(struct ublk_thread *t,
+			    struct ublk_queue *q);
 /* Handle completion of batch I/O commands (prep/commit) */
 void ublk_batch_compl_cmd(struct ublk_thread *t,
 			  const struct io_uring_cqe *cqe);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 26/27] selftests: ublk: add --batch/-b for enabling F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (24 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 25/27] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-21  1:58 ` [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues combination Ming Lei
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Add --batch/-b for enabling F_BATCH_IO.

Add generic_14 for covering its basic function.

Add stress_06 and stress_07 for covering stress test.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  3 ++
 tools/testing/selftests/ublk/kublk.c          | 13 +++++-
 .../testing/selftests/ublk/test_generic_14.sh | 32 +++++++++++++
 .../testing/selftests/ublk/test_stress_06.sh  | 45 +++++++++++++++++++
 .../testing/selftests/ublk/test_stress_07.sh  | 44 ++++++++++++++++++
 5 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/ublk/test_generic_14.sh
 create mode 100755 tools/testing/selftests/ublk/test_stress_06.sh
 create mode 100755 tools/testing/selftests/ublk/test_stress_07.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index a724276622d0..cbf57113b1a6 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -21,6 +21,7 @@ TEST_PROGS += test_generic_10.sh
 TEST_PROGS += test_generic_11.sh
 TEST_PROGS += test_generic_12.sh
 TEST_PROGS += test_generic_13.sh
+TEST_PROGS += test_generic_14.sh
 
 TEST_PROGS += test_null_01.sh
 TEST_PROGS += test_null_02.sh
@@ -39,6 +40,8 @@ TEST_PROGS += test_stress_02.sh
 TEST_PROGS += test_stress_03.sh
 TEST_PROGS += test_stress_04.sh
 TEST_PROGS += test_stress_05.sh
+TEST_PROGS += test_stress_06.sh
+TEST_PROGS += test_stress_07.sh
 
 TEST_GEN_PROGS_EXTENDED = kublk
 
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index cb329c7aebc4..4c45482a847c 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -1476,6 +1476,7 @@ static int cmd_dev_get_features(void)
 		FEAT_NAME(UBLK_F_QUIESCE),
 		FEAT_NAME(UBLK_F_PER_IO_DAEMON),
 		FEAT_NAME(UBLK_F_BUF_REG_OFF_DAEMON),
+		FEAT_NAME(UBLK_F_BATCH_IO),
 	};
 	struct ublk_dev *dev;
 	__u64 features = 0;
@@ -1571,6 +1572,7 @@ static void __cmd_create_help(char *exe, bool recovery)
 	printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n");
 	printf("\t[-e 0|1 ] [-i 0|1] [--no_ublk_fixed_fd]\n");
 	printf("\t[--nthreads threads] [--per_io_tasks]\n");
+	printf("\t[--batch|-b]\n");
 	printf("\t[target options] [backfile1] [backfile2] ...\n");
 	printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n");
 	printf("\tdefault: nthreads=nr_queues");
@@ -1633,6 +1635,7 @@ int main(int argc, char *argv[])
 		{ "nthreads",		1,	NULL,  0 },
 		{ "per_io_tasks",	0,	NULL,  0 },
 		{ "no_ublk_fixed_fd",	0,	NULL,  0 },
+		{ "batch",              0,      NULL, 'b'},
 		{ 0, 0, 0, 0 }
 	};
 	const struct ublk_tgt_ops *ops = NULL;
@@ -1654,9 +1657,12 @@ int main(int argc, char *argv[])
 
 	opterr = 0;
 	optind = 2;
-	while ((opt = getopt_long(argc, argv, "t:n:d:q:r:e:i:s:gaz",
+	while ((opt = getopt_long(argc, argv, "t:n:d:q:r:e:i:s:gazb",
 				  longopts, &option_idx)) != -1) {
 		switch (opt) {
+		case 'b':
+			ctx.flags |= UBLK_F_BATCH_IO;
+			break;
 		case 'a':
 			ctx.all = 1;
 			break;
@@ -1737,6 +1743,11 @@ int main(int argc, char *argv[])
 		}
 	}
 
+	if (ctx.per_io_tasks && (ctx.flags & UBLK_F_BATCH_IO)) {
+		ublk_err("per_io_task and F_BATCH_IO conflict\n");
+		return -EINVAL;
+	}
+
 	/* auto_zc_fallback depends on F_AUTO_BUF_REG & F_SUPPORT_ZERO_COPY */
 	if (ctx.auto_zc_fallback &&
 	    !((ctx.flags & UBLK_F_AUTO_BUF_REG) &&
diff --git a/tools/testing/selftests/ublk/test_generic_14.sh b/tools/testing/selftests/ublk/test_generic_14.sh
new file mode 100755
index 000000000000..e197961b07f1
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_14.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_14"
+ERR_CODE=0
+
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test basic function of UBLK_F_BATCH_IO"
+
+_create_backfile 0 256M
+_create_backfile 1 256M
+
+dev_id=$(_add_ublk_dev -t loop -q 2 -b "${UBLK_BACKFILES[0]}")
+_check_add_dev $TID $?
+
+if ! _mkfs_mount_test /dev/ublkb"${dev_id}"; then
+	_cleanup_test "generic"
+	_show_result $TID 255
+fi
+
+dev_id=$(_add_ublk_dev -t stripe -b --auto_zc "${UBLK_BACKFILES[0]}" "${UBLK_BACKFILES[1]}")
+_check_add_dev $TID $?
+_mkfs_mount_test /dev/ublkb"${dev_id}"
+ERR_CODE=$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
diff --git a/tools/testing/selftests/ublk/test_stress_06.sh b/tools/testing/selftests/ublk/test_stress_06.sh
new file mode 100755
index 000000000000..190db0b4f2ad
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_stress_06.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+TID="stress_06"
+ERR_CODE=0
+
+ublk_io_and_remove()
+{
+	run_io_and_remove "$@"
+	ERR_CODE=$?
+	if [ ${ERR_CODE} -ne 0 ]; then
+		echo "$TID failure: $*"
+		_show_result $TID $ERR_CODE
+	fi
+}
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! _have_feature "ZERO_COPY"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+if ! _have_feature "AUTO_BUF_REG"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "stress" "run IO and remove device(zero copy)"
+
+_create_backfile 0 256M
+_create_backfile 1 128M
+_create_backfile 2 128M
+
+ublk_io_and_remove 8G -t null -q 4 -b &
+ublk_io_and_remove 256M -t loop -q 4 --auto_zc -b "${UBLK_BACKFILES[0]}" &
+ublk_io_and_remove 256M -t stripe -q 4 --auto_zc -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+ublk_io_and_remove 8G -t null -q 4 -z --auto_zc --auto_zc_fallback -b &
+wait
+
+_cleanup_test "stress"
+_show_result $TID $ERR_CODE
diff --git a/tools/testing/selftests/ublk/test_stress_07.sh b/tools/testing/selftests/ublk/test_stress_07.sh
new file mode 100755
index 000000000000..1b6bdb31da03
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_stress_07.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+TID="stress_07"
+ERR_CODE=0
+
+ublk_io_and_kill_daemon()
+{
+	run_io_and_kill_daemon "$@"
+	ERR_CODE=$?
+	if [ ${ERR_CODE} -ne 0 ]; then
+		echo "$TID failure: $*"
+		_show_result $TID $ERR_CODE
+	fi
+}
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+if ! _have_feature "ZERO_COPY"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+if ! _have_feature "AUTO_BUF_REG"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "stress" "run IO and kill ublk server(zero copy)"
+
+_create_backfile 0 256M
+_create_backfile 1 128M
+_create_backfile 2 128M
+
+ublk_io_and_kill_daemon 8G -t null -q 4 -z -b &
+ublk_io_and_kill_daemon 256M -t loop -q 4 --auto_zc -b "${UBLK_BACKFILES[0]}" &
+ublk_io_and_kill_daemon 256M -t stripe -q 4 -b "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback -b &
+wait
+
+_cleanup_test "stress"
+_show_result $TID $ERR_CODE
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues combination
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (25 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 26/27] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
@ 2025-11-21  1:58 ` Ming Lei
  2025-11-28 11:59 ` [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
  2025-11-28 16:22 ` (subset) " Jens Axboe
  28 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-21  1:58 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel, Ming Lei

Enable flexible thread-to-queue mapping in batch I/O mode to support
arbitrary combinations of threads and queues, improving resource
utilization and scalability.

Key improvements:
- Support N:M thread-to-queue mapping (previously limited to 1:1)
- Dynamic buffer allocation based on actual queue assignment per thread
- Thread-safe queue preparation with spinlock protection
- Intelligent buffer index calculation for multi-queue scenarios
- Enhanced validation for thread/queue combination constraints

Implementation details:
- Add q_thread_map matrix to track queue-to-thread assignments
- Dynamic allocation of commit and fetch buffers per thread
- Round-robin queue assignment algorithm for load balancing
- Per-queue spinlock to prevent race conditions during prep
- Updated buffer index calculation using queue position within thread

This enables efficient configurations like:
- Any other N:M combinations for optimal resource matching

Testing:
- Added test_generic_15.sh: 4 threads vs 1 queue
- Added test_generic_16.sh: 1 thread vs 4 queues
- Validates correctness across different mapping scenarios

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |   2 +
 tools/testing/selftests/ublk/batch.c          | 199 +++++++++++++++---
 tools/testing/selftests/ublk/kublk.c          |  49 ++++-
 tools/testing/selftests/ublk/kublk.h          |  40 +++-
 .../testing/selftests/ublk/test_generic_15.sh |  30 +++
 .../testing/selftests/ublk/test_generic_16.sh |  30 +++
 6 files changed, 302 insertions(+), 48 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_generic_15.sh
 create mode 100755 tools/testing/selftests/ublk/test_generic_16.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index cbf57113b1a6..3dbd9a857716 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -22,6 +22,8 @@ TEST_PROGS += test_generic_11.sh
 TEST_PROGS += test_generic_12.sh
 TEST_PROGS += test_generic_13.sh
 TEST_PROGS += test_generic_14.sh
+TEST_PROGS += test_generic_15.sh
+TEST_PROGS += test_generic_16.sh
 
 TEST_PROGS += test_null_01.sh
 TEST_PROGS += test_null_02.sh
diff --git a/tools/testing/selftests/ublk/batch.c b/tools/testing/selftests/ublk/batch.c
index 7db91f910944..db0747f13768 100644
--- a/tools/testing/selftests/ublk/batch.c
+++ b/tools/testing/selftests/ublk/batch.c
@@ -76,6 +76,7 @@ static void free_batch_commit_buf(struct ublk_thread *t)
 		free(t->commit_buf);
 	}
 	allocator_deinit(&t->commit_buf_alloc);
+	free(t->commit);
 }
 
 static int alloc_batch_commit_buf(struct ublk_thread *t)
@@ -84,7 +85,13 @@ static int alloc_batch_commit_buf(struct ublk_thread *t)
 	unsigned int total = buf_size * t->nr_commit_buf;
 	unsigned int page_sz = getpagesize();
 	void *buf = NULL;
-	int ret;
+	int i, ret, j = 0;
+
+	t->commit = calloc(t->nr_queues, sizeof(*t->commit));
+	for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) {
+		if (t->q_map[i])
+			t->commit[j++].q_id = i;
+	}
 
 	allocator_init(&t->commit_buf_alloc, t->nr_commit_buf);
 
@@ -107,6 +114,17 @@ static int alloc_batch_commit_buf(struct ublk_thread *t)
 	return ret;
 }
 
+static unsigned int ublk_thread_nr_queues(const struct ublk_thread *t)
+{
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++)
+		ret += !!t->q_map[i];
+
+	return ret;
+}
+
 void ublk_batch_prepare(struct ublk_thread *t)
 {
 	/*
@@ -119,10 +137,13 @@ void ublk_batch_prepare(struct ublk_thread *t)
 	 */
 	struct ublk_queue *q = &t->dev->q[0];
 
+	/* cache nr_queues because we don't support dynamic load-balance yet */
+	t->nr_queues = ublk_thread_nr_queues(t);
+
 	t->commit_buf_elem_size = ublk_commit_elem_buf_size(t->dev);
 	t->commit_buf_size = ublk_commit_buf_size(t);
 	t->commit_buf_start = t->nr_bufs;
-	t->nr_commit_buf = 2;
+	t->nr_commit_buf = 2 * t->nr_queues;
 	t->nr_bufs += t->nr_commit_buf;
 
 	t->cmd_flags = 0;
@@ -144,11 +165,12 @@ static void free_batch_fetch_buf(struct ublk_thread *t)
 {
 	int i;
 
-	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+	for (i = 0; i < t->nr_fetch_bufs; i++) {
 		io_uring_free_buf_ring(&t->ring, t->fetch[i].br, 1, i);
 		munlock(t->fetch[i].fetch_buf, t->fetch[i].fetch_buf_size);
 		free(t->fetch[i].fetch_buf);
 	}
+	free(t->fetch);
 }
 
 static int alloc_batch_fetch_buf(struct ublk_thread *t)
@@ -159,7 +181,12 @@ static int alloc_batch_fetch_buf(struct ublk_thread *t)
 	int ret;
 	int i = 0;
 
-	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++) {
+	/* double fetch buffer for each queue */
+	t->nr_fetch_bufs = t->nr_queues * 2;
+	t->fetch = calloc(t->nr_fetch_bufs, sizeof(*t->fetch));
+
+	/* allocate one buffer for each queue */
+	for (i = 0; i < t->nr_fetch_bufs; i++) {
 		t->fetch[i].fetch_buf_size = buf_size;
 
 		if (posix_memalign((void **)&t->fetch[i].fetch_buf, pg_sz,
@@ -185,7 +212,7 @@ int ublk_batch_alloc_buf(struct ublk_thread *t)
 {
 	int ret;
 
-	ublk_assert(t->nr_commit_buf < 16);
+	ublk_assert(t->nr_commit_buf < 2 * UBLK_MAX_QUEUES);
 
 	ret = alloc_batch_commit_buf(t);
 	if (ret)
@@ -271,13 +298,20 @@ static void ublk_batch_queue_fetch(struct ublk_thread *t,
 	t->fetch[buf_idx].fetch_buf_off = 0;
 }
 
-void ublk_batch_start_fetch(struct ublk_thread *t,
-			    struct ublk_queue *q)
+void ublk_batch_start_fetch(struct ublk_thread *t)
 {
 	int i;
+	int j = 0;
+
+	for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) {
+		if (t->q_map[i]) {
+			struct ublk_queue *q = &t->dev->q[i];
 
-	for (i = 0; i < UBLKS_T_NR_FETCH_BUF; i++)
-		ublk_batch_queue_fetch(t, q, i);
+			/* submit two fetch commands for each queue */
+			ublk_batch_queue_fetch(t, q, j++);
+			ublk_batch_queue_fetch(t, q, j++);
+		}
+	}
 }
 
 static unsigned short ublk_compl_batch_fetch(struct ublk_thread *t,
@@ -317,7 +351,7 @@ static unsigned short ublk_compl_batch_fetch(struct ublk_thread *t,
 	return buf_idx;
 }
 
-int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
+static int __ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
 {
 	unsigned short nr_elem = q->q_depth;
 	unsigned short buf_idx = ublk_alloc_commit_buf(t);
@@ -354,6 +388,22 @@ int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
 	return 0;
 }
 
+int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q)
+{
+	int ret = 0;
+
+	pthread_spin_lock(&q->lock);
+	if (q->flags & UBLKS_Q_PREPARED)
+		goto unlock;
+	ret = __ublk_batch_queue_prep_io_cmds(t, q);
+	if (!ret)
+		q->flags |= UBLKS_Q_PREPARED;
+unlock:
+	pthread_spin_unlock(&q->lock);
+
+	return ret;
+}
+
 static void ublk_batch_compl_commit_cmd(struct ublk_thread *t,
 					const struct io_uring_cqe *cqe,
 					unsigned op)
@@ -398,59 +448,89 @@ void ublk_batch_compl_cmd(struct ublk_thread *t,
 	}
 }
 
-void ublk_batch_commit_io_cmds(struct ublk_thread *t)
+static void __ublk_batch_commit_io_cmds(struct ublk_thread *t,
+					struct batch_commit_buf *cb)
 {
 	struct io_uring_sqe *sqe;
 	unsigned short buf_idx;
-	unsigned short nr_elem = t->commit.done;
+	unsigned short nr_elem = cb->done;
 
 	/* nothing to commit */
 	if (!nr_elem) {
-		ublk_free_commit_buf(t, t->commit.buf_idx);
+		ublk_free_commit_buf(t, cb->buf_idx);
 		return;
 	}
 
 	ublk_io_alloc_sqes(t, &sqe, 1);
-	buf_idx = t->commit.buf_idx;
-	sqe->addr = (__u64)t->commit.elem;
+	buf_idx = cb->buf_idx;
+	sqe->addr = (__u64)cb->elem;
 	sqe->len = nr_elem * t->commit_buf_elem_size;
 
 	/* commit isn't per-queue command */
-	ublk_init_batch_cmd(t, t->commit.q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS,
+	ublk_init_batch_cmd(t, cb->q_id, sqe, UBLK_U_IO_COMMIT_IO_CMDS,
 			t->commit_buf_elem_size, nr_elem, buf_idx);
 	ublk_setup_commit_sqe(t, sqe, buf_idx);
 }
 
-static void ublk_batch_init_commit(struct ublk_thread *t,
-				   unsigned short buf_idx)
+void ublk_batch_commit_io_cmds(struct ublk_thread *t)
+{
+	int i;
+
+	for (i = 0; i < t->nr_queues; i++) {
+		struct batch_commit_buf *cb = &t->commit[i];
+
+		if (cb->buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX)
+			__ublk_batch_commit_io_cmds(t, cb);
+	}
+
+}
+
+static void __ublk_batch_init_commit(struct ublk_thread *t,
+				     struct batch_commit_buf *cb,
+				     unsigned short buf_idx)
 {
 	/* so far only support 1:1 queue/thread mapping */
-	t->commit.q_id = t->idx;
-	t->commit.buf_idx = buf_idx;
-	t->commit.elem = ublk_get_commit_buf(t, buf_idx);
-	t->commit.done = 0;
-	t->commit.count = t->commit_buf_size /
+	cb->buf_idx = buf_idx;
+	cb->elem = ublk_get_commit_buf(t, buf_idx);
+	cb->done = 0;
+	cb->count = t->commit_buf_size /
 		t->commit_buf_elem_size;
 }
 
-void ublk_batch_prep_commit(struct ublk_thread *t)
+/* COMMIT_IO_CMDS is per-queue command, so use its own commit buffer */
+static void ublk_batch_init_commit(struct ublk_thread *t,
+				   struct batch_commit_buf *cb)
 {
 	unsigned short buf_idx = ublk_alloc_commit_buf(t);
 
 	ublk_assert(buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX);
-	ublk_batch_init_commit(t, buf_idx);
+	ublk_assert(!ublk_batch_commit_prepared(cb));
+
+	__ublk_batch_init_commit(t, cb, buf_idx);
+}
+
+void ublk_batch_prep_commit(struct ublk_thread *t)
+{
+	int i;
+
+	for (i = 0; i < t->nr_queues; i++)
+		t->commit[i].buf_idx = UBLKS_T_COMMIT_BUF_INV_IDX;
 }
 
 void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 			    unsigned tag, int res)
 {
-	struct batch_commit_buf *cb = &t->commit;
-	struct ublk_batch_elem *elem = (struct ublk_batch_elem *)(cb->elem +
-			cb->done * t->commit_buf_elem_size);
+	unsigned q_t_idx = ublk_queue_idx_in_thread(t, q);
+	struct batch_commit_buf *cb = &t->commit[q_t_idx];
+	struct ublk_batch_elem *elem;
 	struct ublk_io *io = &q->ios[tag];
 
-	ublk_assert(q->q_id == t->commit.q_id);
+	if (!ublk_batch_commit_prepared(cb))
+		ublk_batch_init_commit(t, cb);
+
+	ublk_assert(q->q_id == cb->q_id);
 
+	elem = (struct ublk_batch_elem *)(cb->elem + cb->done * t->commit_buf_elem_size);
 	elem->tag = tag;
 	elem->buf_index = ublk_batch_io_buf_idx(t, q, tag);
 	elem->result = res;
@@ -461,3 +541,64 @@ void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 	cb->done += 1;
 	ublk_assert(cb->done <= cb->count);
 }
+
+void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES],
+			   int nthreads, int queues)
+{
+	int i, j;
+
+	/*
+	 * Setup round-robin queue-to-thread mapping for arbitrary N:M combinations.
+	 *
+	 * This algorithm distributes queues across threads (and threads across queues)
+	 * in a balanced round-robin fashion to ensure even load distribution.
+	 *
+	 * Examples:
+	 * - 2 threads, 4 queues: T0=[Q0,Q2], T1=[Q1,Q3]
+	 * - 4 threads, 2 queues: T0=[Q0], T1=[Q1], T2=[Q0], T3=[Q1]
+	 * - 3 threads, 3 queues: T0=[Q0], T1=[Q1], T2=[Q2] (1:1 mapping)
+	 *
+	 * Phase 1: Mark which queues each thread handles (boolean mapping)
+	 */
+	for (i = 0, j = 0; i < queues || j < nthreads; i++, j++) {
+		q_thread_map[j % nthreads][i % queues] = 1;
+	}
+
+	/*
+	 * Phase 2: Convert boolean mapping to sequential indices within each thread.
+	 *
+	 * Transform from: q_thread_map[thread][queue] = 1 (handles queue)
+	 * To:             q_thread_map[thread][queue] = N (queue index within thread)
+	 *
+	 * This allows each thread to know the local index of each queue it handles,
+	 * which is essential for buffer allocation and management. For example:
+	 * - Thread 0 handling queues [0,2] becomes: q_thread_map[0][0]=1, q_thread_map[0][2]=2
+	 * - Thread 1 handling queues [1,3] becomes: q_thread_map[1][1]=1, q_thread_map[1][3]=2
+	 */
+	for (j = 0; j < nthreads; j++) {
+		unsigned char seq = 1;
+
+		for (i = 0; i < queues; i++) {
+			if (q_thread_map[j][i])
+				q_thread_map[j][i] = seq++;
+		}
+	}
+
+#if 0
+	for (j = 0; j < nthreads; j++) {
+		printf("thread %0d: ", j);
+		for (i = 0; i < queues; i++) {
+			if (q_thread_map[j][i])
+				printf("%03u ", i);
+		}
+		printf("\n");
+	}
+	printf("\n");
+	for (j = 0; j < nthreads; j++) {
+		for (i = 0; i < queues; i++) {
+			printf("%03u ", q_thread_map[j][i]);
+		}
+		printf("\n");
+	}
+#endif
+}
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 4c45482a847c..f88a12b5a368 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -442,6 +442,7 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned long long extra_flags)
 	int cmd_buf_size, io_buf_size;
 	unsigned long off;
 
+	pthread_spin_init(&q->lock, PTHREAD_PROCESS_PRIVATE);
 	q->tgt_ops = dev->tgt.ops;
 	q->flags = 0;
 	q->q_depth = depth;
@@ -495,7 +496,7 @@ static int ublk_thread_init(struct ublk_thread *t, unsigned long long extra_flag
 
 	/* FETCH_IO_CMDS is multishot, so increase cq depth for BATCH_IO */
 	if (ublk_dev_batch_io(dev))
-		cq_depth += dev->dev_info.queue_depth;
+		cq_depth += dev->dev_info.queue_depth * 2;
 
 	ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth,
 			IORING_SETUP_COOP_TASKRUN |
@@ -878,6 +879,7 @@ struct ublk_thread_info {
 	sem_t 			*ready;
 	cpu_set_t 		*affinity;
 	unsigned long long	extra_flags;
+	unsigned char		(*q_thread_map)[UBLK_MAX_QUEUES];
 };
 
 static void ublk_thread_set_sched_affinity(const struct ublk_thread_info *info)
@@ -891,14 +893,18 @@ static void ublk_batch_setup_queues(struct ublk_thread *t)
 {
 	int i;
 
-	/* setup all queues in the 1st thread */
 	for (i = 0; i < t->dev->dev_info.nr_hw_queues; i++) {
 		struct ublk_queue *q = &t->dev->q[i];
 		int ret;
 
+		/*
+		 * Only prepare io commands in the mapped thread context,
+		 * otherwise io command buffer index may not work as expected
+		 */
+		if (t->q_map[i] == 0)
+			continue;
+
 		ret = ublk_batch_queue_prep_io_cmds(t, q);
-		ublk_assert(ret == 0);
-		ret = ublk_process_io(t);
 		ublk_assert(ret >= 0);
 	}
 }
@@ -912,6 +918,10 @@ static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_inf
 	int dev_id = info->dev->dev_info.dev_id;
 	int ret;
 
+	/* Copy per-thread queue mapping into thread-local variable */
+	if (info->q_thread_map)
+		memcpy(t.q_map, info->q_thread_map[info->idx], sizeof(t.q_map));
+
 	ret = ublk_thread_init(&t, info->extra_flags);
 	if (ret) {
 		ublk_err("ublk dev %d thread %u init failed\n",
@@ -927,12 +937,8 @@ static __attribute__((noinline)) int __ublk_io_handler_fn(struct ublk_thread_inf
 		/* submit all io commands to ublk driver */
 		ublk_submit_fetch_commands(&t);
 	} else {
-		struct ublk_queue *q = &t.dev->q[t.idx];
-
-		/* prepare all io commands in the 1st thread context */
-		if (!t.idx)
-			ublk_batch_setup_queues(&t);
-		ublk_batch_start_fetch(&t, q);
+		ublk_batch_setup_queues(&t);
+		ublk_batch_start_fetch(&t);
 	}
 
 	do {
@@ -1006,6 +1012,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 	struct ublk_thread_info *tinfo;
 	unsigned long long extra_flags = 0;
 	cpu_set_t *affinity_buf;
+	unsigned char (*q_thread_map)[UBLK_MAX_QUEUES] = NULL;
 	void *thread_ret;
 	sem_t ready;
 	int ret, i;
@@ -1025,6 +1032,16 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 	if (ret)
 		return ret;
 
+	if (ublk_dev_batch_io(dev)) {
+		q_thread_map = calloc(dev->nthreads, sizeof(*q_thread_map));
+		if (!q_thread_map) {
+			ret = -ENOMEM;
+			goto fail;
+		}
+		ublk_batch_setup_map(q_thread_map, dev->nthreads,
+				     dinfo->nr_hw_queues);
+	}
+
 	if (ctx->auto_zc_fallback)
 		extra_flags = UBLKS_Q_AUTO_BUF_REG_FALLBACK;
 	if (ctx->no_ublk_fixed_fd)
@@ -1047,6 +1064,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 		tinfo[i].idx = i;
 		tinfo[i].ready = &ready;
 		tinfo[i].extra_flags = extra_flags;
+		tinfo[i].q_thread_map = q_thread_map;
 
 		/*
 		 * If threads are not tied 1:1 to queues, setting thread
@@ -1066,6 +1084,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 	for (i = 0; i < dev->nthreads; i++)
 		sem_wait(&ready);
 	free(affinity_buf);
+	free(q_thread_map);
 
 	/* everything is fine now, start us */
 	if (ctx->recovery)
@@ -1234,7 +1253,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
 		goto fail;
 	}
 
-	if (nthreads != nr_queues && !ctx->per_io_tasks) {
+	if (nthreads != nr_queues && (!ctx->per_io_tasks &&
+				!(ctx->flags & UBLK_F_BATCH_IO))) {
 		ublk_err("%s: threads %u must be same as queues %u if "
 			"not using per_io_tasks\n",
 			__func__, nthreads, nr_queues);
@@ -1758,6 +1778,13 @@ int main(int argc, char *argv[])
 		return -EINVAL;
 	}
 
+	if ((ctx.flags & UBLK_F_AUTO_BUF_REG) &&
+			(ctx.flags & UBLK_F_BATCH_IO) &&
+			(ctx.nthreads > ctx.nr_hw_queues)) {
+		ublk_err("too many threads for F_AUTO_BUF_REG & F_BATCH_IO\n");
+		return -EINVAL;
+	}
+
 	i = optind;
 	while (i < argc && ctx.nr_files < MAX_BACK_FILES) {
 		ctx.files[ctx.nr_files++] = argv[i++];
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 222501048c24..565819cf2dfe 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -166,12 +166,16 @@ struct ublk_queue {
 	const struct ublk_tgt_ops *tgt_ops;
 	struct ublksrv_io_desc *io_cmd_buf;
 
-/* borrow one bit of ublk uapi flags, which may never be used */
+/* borrow three bit of ublk uapi flags, which may never be used */
 #define UBLKS_Q_AUTO_BUF_REG_FALLBACK	(1ULL << 63)
 #define UBLKS_Q_NO_UBLK_FIXED_FD	(1ULL << 62)
+#define UBLKS_Q_PREPARED	(1ULL << 61)
 	__u64 flags;
 	int ublk_fd;	/* cached ublk char device fd */
 	struct ublk_io ios[UBLK_QUEUE_DEPTH];
+
+	/* used for prep io commands */
+	pthread_spinlock_t lock;
 };
 
 /* align with `ublk_elem_header` */
@@ -198,8 +202,12 @@ struct batch_fetch_buf {
 };
 
 struct ublk_thread {
+	/* Thread-local copy of queue-to-thread mapping for this thread */
+	unsigned char q_map[UBLK_MAX_QUEUES];
+
 	struct ublk_dev *dev;
-	unsigned idx;
+	unsigned short idx;
+	unsigned short nr_queues;
 
 #define UBLKS_T_STOPPING	(1U << 0)
 #define UBLKS_T_IDLE	(1U << 1)
@@ -222,10 +230,10 @@ struct ublk_thread {
 	void *commit_buf;
 #define UBLKS_T_COMMIT_BUF_INV_IDX  ((unsigned short)-1)
 	struct allocator commit_buf_alloc;
-	struct batch_commit_buf commit;
+	struct batch_commit_buf *commit;
 	/* FETCH_IO_CMDS buffer */
-#define UBLKS_T_NR_FETCH_BUF 	2
-	struct batch_fetch_buf fetch[UBLKS_T_NR_FETCH_BUF];
+	unsigned short nr_fetch_bufs;
+	struct batch_fetch_buf *fetch;
 
 	struct io_uring ring;
 };
@@ -465,6 +473,21 @@ static inline int ublk_queue_no_buf(const struct ublk_queue *q)
 	return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
 }
 
+static inline int ublk_batch_commit_prepared(struct batch_commit_buf *cb)
+{
+	return cb->buf_idx != UBLKS_T_COMMIT_BUF_INV_IDX;
+}
+
+static inline unsigned ublk_queue_idx_in_thread(const struct ublk_thread *t,
+						const struct ublk_queue *q)
+{
+	unsigned char idx;
+
+	idx = t->q_map[q->q_id];
+	ublk_assert(idx != 0);
+	return idx - 1;
+}
+
 /*
  * Each IO's buffer index has to be calculated by this helper for
  * UBLKS_T_BATCH_IO
@@ -473,14 +496,13 @@ static inline unsigned short ublk_batch_io_buf_idx(
 		const struct ublk_thread *t, const struct ublk_queue *q,
 		unsigned tag)
 {
-	return tag;
+	return ublk_queue_idx_in_thread(t, q) * q->q_depth + tag;
 }
 
 /* Queue UBLK_U_IO_PREP_IO_CMDS for a specific queue with batch elements */
 int ublk_batch_queue_prep_io_cmds(struct ublk_thread *t, struct ublk_queue *q);
 /* Start fetching I/O commands using multishot UBLK_U_IO_FETCH_IO_CMDS */
-void ublk_batch_start_fetch(struct ublk_thread *t,
-			    struct ublk_queue *q);
+void ublk_batch_start_fetch(struct ublk_thread *t);
 /* Handle completion of batch I/O commands (prep/commit) */
 void ublk_batch_compl_cmd(struct ublk_thread *t,
 			  const struct io_uring_cqe *cqe);
@@ -498,6 +520,8 @@ void ublk_batch_commit_io_cmds(struct ublk_thread *t);
 /* Add a completed I/O operation to the current batch commit buffer */
 void ublk_batch_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 			    unsigned tag, int res);
+void ublk_batch_setup_map(unsigned char (*q_thread_map)[UBLK_MAX_QUEUES],
+			   int nthreads, int queues);
 
 static inline int ublk_complete_io(struct ublk_thread *t, struct ublk_queue *q,
 				   unsigned tag, int res)
diff --git a/tools/testing/selftests/ublk/test_generic_15.sh b/tools/testing/selftests/ublk/test_generic_15.sh
new file mode 100755
index 000000000000..0afd037235cf
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_15.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_15"
+ERR_CODE=0
+
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test UBLK_F_BATCH_IO with 4_threads vs. 1_queues"
+
+_create_backfile 0 512M
+
+dev_id=$(_add_ublk_dev -t loop -q 1 --nthreads 4 -b "${UBLK_BACKFILES[0]}")
+_check_add_dev $TID $?
+
+# run fio over the ublk disk
+fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio --rw=readwrite \
+	--iodepth=32 --size=100M --numjobs=4 > /dev/null 2>&1
+ERR_CODE=$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
diff --git a/tools/testing/selftests/ublk/test_generic_16.sh b/tools/testing/selftests/ublk/test_generic_16.sh
new file mode 100755
index 000000000000..32bcf4a3d0b4
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_16.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_16"
+ERR_CODE=0
+
+if ! _have_feature "BATCH_IO"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! _have_program fio; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test UBLK_F_BATCH_IO with 1_threads vs. 4_queues"
+
+_create_backfile 0 512M
+
+dev_id=$(_add_ublk_dev -t loop -q 4 --nthreads 1 -b "${UBLK_BACKFILES[0]}")
+_check_add_dev $TID $?
+
+# run fio over the ublk disk
+fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio --rw=readwrite \
+	--iodepth=32 --size=100M --numjobs=4 > /dev/null 2>&1
+ERR_CODE=$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (26 preceding siblings ...)
  2025-11-21  1:58 ` [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues combination Ming Lei
@ 2025-11-28 11:59 ` Ming Lei
  2025-11-28 16:19   ` Jens Axboe
  2025-11-28 16:22 ` (subset) " Jens Axboe
  28 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-11-28 11:59 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel

On Fri, Nov 21, 2025 at 09:58:22AM +0800, Ming Lei wrote:
> Hello,
> 
> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
> server in batching way:
> 
> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
> 
> - Batch processing: Multiple I/Os are handled in single operation
> 
> - Multishot commands: Use io_uring multishot for reducing submission overhead
> 
> - Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)
> 
> - Better load balancing: Tasks can adjust their workload dynamically
> 
> - help for future optimizations:
> 	- blk-mq batch tags free
>   	- support io-poll
> 	- per-task batch for avoiding per-io lock
> 	- fetch command priority
> 
> - simplify command cancel process with per-queue lock
> 
> selftest are provided.
> 
> 
> Performance test result(IOPS) on V3:
> 
> - page copy
> 
> tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]
> 
> - zero copy(--auto_zc)
> tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]
> 
> - IO test
> taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0
> 
> 1) 16 jobs IO
> - page copy:  			37.77M vs. 42.40M(BATCH_IO), +12%
> - zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%
> 
> 
> 2) single job IO
> - page copy:  			2.54M vs. 2.6M(BATCH_IO),   +2.3%
> - zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO),  +7%
> 
> 
> V4:
> 	- fix handling in case of running out of mshot buffer, request has to
> 	  be un-prepared for zero copy
> 	- don't expose unused tag to userspace
> 	- replace fixed buffer with plain user buffer for
> 	  UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
> 	- replace iov iterator with plain copy_from_user() for
> 	  ublk_walk_cmd_buf(), code is simplified with performance improvement
> 	- don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
> 	  UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
> 	- use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
> 	- all kinds of patch style fix(Caleb Sander Mateos)
> 	- inline __kfifo_alloc() (Caleb Sander Mateos)

Hi Caleb Sander Mateos and Jens,

Caleb have reviewed patch 1 ~ patch 8, and driver patch 9 ~ patch 18 are not
reviewed yet.

I'd want to hear your idea for how to move on. So far, looks there are
several ways:

1) merge patch 1 ~ patch 6 to v6.19 first, which can be prep patches for BATCH_IO

2) delay the whole patchset to v6.20 cycle

3) merge the whole patchset to v6.19

I am fine with either one, which one do you prefer to?

BTW, V4 pass all builtin function and stress tests, and there is just one small bug
fix not posted yet, which can be a follow-up. The new feature takes standalone
code path, so regression risk is pretty small.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
  2025-11-28 11:59 ` [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
@ 2025-11-28 16:19   ` Jens Axboe
  2025-11-28 19:07     ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2025-11-28 16:19 UTC (permalink / raw)
  To: Ming Lei, linux-block
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel

On 11/28/25 4:59 AM, Ming Lei wrote:
> On Fri, Nov 21, 2025 at 09:58:22AM +0800, Ming Lei wrote:
>> Hello,
>>
>> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
>> server in batching way:
>>
>> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
>>
>> - Batch processing: Multiple I/Os are handled in single operation
>>
>> - Multishot commands: Use io_uring multishot for reducing submission overhead
>>
>> - Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)
>>
>> - Better load balancing: Tasks can adjust their workload dynamically
>>
>> - help for future optimizations:
>> 	- blk-mq batch tags free
>>   	- support io-poll
>> 	- per-task batch for avoiding per-io lock
>> 	- fetch command priority
>>
>> - simplify command cancel process with per-queue lock
>>
>> selftest are provided.
>>
>>
>> Performance test result(IOPS) on V3:
>>
>> - page copy
>>
>> tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]
>>
>> - zero copy(--auto_zc)
>> tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]
>>
>> - IO test
>> taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0
>>
>> 1) 16 jobs IO
>> - page copy:  			37.77M vs. 42.40M(BATCH_IO), +12%
>> - zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%
>>
>>
>> 2) single job IO
>> - page copy:  			2.54M vs. 2.6M(BATCH_IO),   +2.3%
>> - zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO),  +7%
>>
>>
>> V4:
>> 	- fix handling in case of running out of mshot buffer, request has to
>> 	  be un-prepared for zero copy
>> 	- don't expose unused tag to userspace
>> 	- replace fixed buffer with plain user buffer for
>> 	  UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
>> 	- replace iov iterator with plain copy_from_user() for
>> 	  ublk_walk_cmd_buf(), code is simplified with performance improvement
>> 	- don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
>> 	  UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
>> 	- use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
>> 	- all kinds of patch style fix(Caleb Sander Mateos)
>> 	- inline __kfifo_alloc() (Caleb Sander Mateos)
> 
> Hi Caleb Sander Mateos and Jens,
> 
> Caleb have reviewed patch 1 ~ patch 8, and driver patch 9 ~ patch 18 are not
> reviewed yet.
> 
> I'd want to hear your idea for how to move on. So far, looks there are
> several ways:
> 
> 1) merge patch 1 ~ patch 6 to v6.19 first, which can be prep patches for BATCH_IO
> 
> 2) delay the whole patchset to v6.20 cycle
> 
> 3) merge the whole patchset to v6.19
> 
> I am fine with either one, which one do you prefer to?
> 
> BTW, V4 pass all builtin function and stress tests, and there is just one small bug
> fix not posted yet, which can be a follow-up. The new feature takes standalone
> code path, so regression risk is pretty small.

I'm fine taking the whole thing for 6.19. Caleb let me know if you
disagree. I'll queue 1..6 for now, then can follow up later today with
the rest as needed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: (subset) [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
  2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
                   ` (27 preceding siblings ...)
  2025-11-28 11:59 ` [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
@ 2025-11-28 16:22 ` Jens Axboe
  28 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2025-11-28 16:22 UTC (permalink / raw)
  To: linux-block, Ming Lei
  Cc: Caleb Sander Mateos, Uday Shankar, Stefani Seibold, Andrew Morton,
	linux-kernel


On Fri, 21 Nov 2025 09:58:22 +0800, Ming Lei wrote:
> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
> server in batching way:
> 
> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
> 
> - Batch processing: Multiple I/Os are handled in single operation
> 
> [...]

Applied, thanks!

[01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness
        commit: 9574b21e952256d4fa3c8797c94482a240992d18
[02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()
        commit: 3035b9b46b0611898babc0b96ede65790d3566f7
[03/27] ublk: add `union ublk_io_buf` with improved naming
        commit: 8d61ece156bd4f2b9e7d3b2a374a26d42c7a4a06
[04/27] ublk: refactor auto buffer register in ublk_dispatch_req()
        commit: 0a9beafa7c633e6ff66b05b81eea78231b7e6520
[05/27] ublk: pass const pointer to ublk_queue_is_zoned()
        commit: 3443bab2f8e44e00adaf76ba677d4219416376f2
[06/27] ublk: add helper of __ublk_fetch()
        commit: 28d7a371f021419cb6c3a243f5cf167f88eb51b9

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
  2025-11-28 16:19   ` Jens Axboe
@ 2025-11-28 19:07     ` Caleb Sander Mateos
  2025-11-29  1:24       ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-28 19:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Fri, Nov 28, 2025 at 8:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/28/25 4:59 AM, Ming Lei wrote:
> > On Fri, Nov 21, 2025 at 09:58:22AM +0800, Ming Lei wrote:
> >> Hello,
> >>
> >> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
> >> server in batching way:
> >>
> >> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
> >>
> >> - Batch processing: Multiple I/Os are handled in single operation
> >>
> >> - Multishot commands: Use io_uring multishot for reducing submission overhead
> >>
> >> - Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)
> >>
> >> - Better load balancing: Tasks can adjust their workload dynamically
> >>
> >> - help for future optimizations:
> >>      - blk-mq batch tags free
> >>      - support io-poll
> >>      - per-task batch for avoiding per-io lock
> >>      - fetch command priority
> >>
> >> - simplify command cancel process with per-queue lock
> >>
> >> selftest are provided.
> >>
> >>
> >> Performance test result(IOPS) on V3:
> >>
> >> - page copy
> >>
> >> tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]
> >>
> >> - zero copy(--auto_zc)
> >> tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]
> >>
> >> - IO test
> >> taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0
> >>
> >> 1) 16 jobs IO
> >> - page copy:                         37.77M vs. 42.40M(BATCH_IO), +12%
> >> - zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%
> >>
> >>
> >> 2) single job IO
> >> - page copy:                         2.54M vs. 2.6M(BATCH_IO),   +2.3%
> >> - zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO),  +7%
> >>
> >>
> >> V4:
> >>      - fix handling in case of running out of mshot buffer, request has to
> >>        be un-prepared for zero copy
> >>      - don't expose unused tag to userspace
> >>      - replace fixed buffer with plain user buffer for
> >>        UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
> >>      - replace iov iterator with plain copy_from_user() for
> >>        ublk_walk_cmd_buf(), code is simplified with performance improvement
> >>      - don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
> >>        UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
> >>      - use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
> >>      - all kinds of patch style fix(Caleb Sander Mateos)
> >>      - inline __kfifo_alloc() (Caleb Sander Mateos)
> >
> > Hi Caleb Sander Mateos and Jens,
> >
> > Caleb have reviewed patch 1 ~ patch 8, and driver patch 9 ~ patch 18 are not
> > reviewed yet.
> >
> > I'd want to hear your idea for how to move on. So far, looks there are
> > several ways:
> >
> > 1) merge patch 1 ~ patch 6 to v6.19 first, which can be prep patches for BATCH_IO
> >
> > 2) delay the whole patchset to v6.20 cycle
> >
> > 3) merge the whole patchset to v6.19
> >
> > I am fine with either one, which one do you prefer to?
> >
> > BTW, V4 pass all builtin function and stress tests, and there is just one small bug
> > fix not posted yet, which can be a follow-up. The new feature takes standalone
> > code path, so regression risk is pretty small.
>
> I'm fine taking the whole thing for 6.19. Caleb let me know if you
> disagree. I'll queue 1..6 for now, then can follow up later today with
> the rest as needed.

Sorry I haven't gotten around to reviewing the rest of the series yet.
I will try to take a look at them all this weekend. I'm not sure the
batching feature would make sense for our ublk application use case,
but I have no objection to it as long as it doesn't regress the
non-batched ublk behavior/performance.
No problem with queueing up patches 1-6 now (though patch 1 may need
an ack from a kfifo maintainer?).

Thanks,
Caleb
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
  2025-11-28 19:07     ` Caleb Sander Mateos
@ 2025-11-29  1:24       ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-11-29  1:24 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Fri, Nov 28, 2025 at 11:07:17AM -0800, Caleb Sander Mateos wrote:
> On Fri, Nov 28, 2025 at 8:19 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 11/28/25 4:59 AM, Ming Lei wrote:
> > > On Fri, Nov 21, 2025 at 09:58:22AM +0800, Ming Lei wrote:
> > >> Hello,
> > >>
> > >> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
> > >> server in batching way:
> > >>
> > >> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
> > >>
> > >> - Batch processing: Multiple I/Os are handled in single operation
> > >>
> > >> - Multishot commands: Use io_uring multishot for reducing submission overhead
> > >>
> > >> - Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)
> > >>
> > >> - Better load balancing: Tasks can adjust their workload dynamically
> > >>
> > >> - help for future optimizations:
> > >>      - blk-mq batch tags free
> > >>      - support io-poll
> > >>      - per-task batch for avoiding per-io lock
> > >>      - fetch command priority
> > >>
> > >> - simplify command cancel process with per-queue lock
> > >>
> > >> selftest are provided.
> > >>
> > >>
> > >> Performance test result(IOPS) on V3:
> > >>
> > >> - page copy
> > >>
> > >> tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]
> > >>
> > >> - zero copy(--auto_zc)
> > >> tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]
> > >>
> > >> - IO test
> > >> taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0
> > >>
> > >> 1) 16 jobs IO
> > >> - page copy:                         37.77M vs. 42.40M(BATCH_IO), +12%
> > >> - zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%
> > >>
> > >>
> > >> 2) single job IO
> > >> - page copy:                         2.54M vs. 2.6M(BATCH_IO),   +2.3%
> > >> - zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO),  +7%
> > >>
> > >>
> > >> V4:
> > >>      - fix handling in case of running out of mshot buffer, request has to
> > >>        be un-prepared for zero copy
> > >>      - don't expose unused tag to userspace
> > >>      - replace fixed buffer with plain user buffer for
> > >>        UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
> > >>      - replace iov iterator with plain copy_from_user() for
> > >>        ublk_walk_cmd_buf(), code is simplified with performance improvement
> > >>      - don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
> > >>        UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
> > >>      - use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
> > >>      - all kinds of patch style fix(Caleb Sander Mateos)
> > >>      - inline __kfifo_alloc() (Caleb Sander Mateos)
> > >
> > > Hi Caleb Sander Mateos and Jens,
> > >
> > > Caleb have reviewed patch 1 ~ patch 8, and driver patch 9 ~ patch 18 are not
> > > reviewed yet.
> > >
> > > I'd want to hear your idea for how to move on. So far, looks there are
> > > several ways:
> > >
> > > 1) merge patch 1 ~ patch 6 to v6.19 first, which can be prep patches for BATCH_IO
> > >
> > > 2) delay the whole patchset to v6.20 cycle
> > >
> > > 3) merge the whole patchset to v6.19
> > >
> > > I am fine with either one, which one do you prefer to?
> > >
> > > BTW, V4 pass all builtin function and stress tests, and there is just one small bug
> > > fix not posted yet, which can be a follow-up. The new feature takes standalone
> > > code path, so regression risk is pretty small.
> >
> > I'm fine taking the whole thing for 6.19. Caleb let me know if you
> > disagree. I'll queue 1..6 for now, then can follow up later today with
> > the rest as needed.
> 
> Sorry I haven't gotten around to reviewing the rest of the series yet.
> I will try to take a look at them all this weekend. I'm not sure the
> batching feature would make sense for our ublk application use case,
> but I have no objection to it as long as it doesn't regress the
> non-batched ublk behavior/performance.
> No problem with queueing up patches 1-6 now (though patch 1 may need
> an ack from a kfifo maintainer?).

BTW, there are many good things with BATCH_IO features:

- batch blk-mq completion: page copy IO mode has shown >12% IOPS improvement; and
	there is chance to apply it for zero copy too in future

- io poll become much easier to support: it can be used to poll nvme char/block device
  to get better iops

- io cancel code path becomes less fragile, and easier to debug: in typical
  implementation, there is only one or two per-queue FETCH(multishot)
  command, others are just sync one-shot commands.

- more chances to improve perf: saved lots of generic uring_cmd code
  path cost, such as, security_uring_cmd()

- `perf bug fix` for UBLK_F_PER_IO_DAEMON, meantime robust load balance
  support

	iops is improved by 4X-5X in `fio/t/io_uring -p0 /dev/ublkbN` between:
		./kublk add -t null  --nthreads 8 -q 4 --per_io_tasks
		and
		./kublk add -t null  --nthreads 8 -q 4 -b

- with per-io lock: fast io path becomes more robust, still can be bypassed
  in future in case of per-io-daemon 


The cost is some complexity in ublk server implementation for maintaining
one or two per-queue FETCH buffer, and one or two per-queue COMMIT buffer.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness
  2025-11-21  1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
@ 2025-11-29 19:12   ` Caleb Sander Mateos
  2025-12-01  1:46     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-29 19:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(),
> and define kfifo_alloc_node() macro to support NUMA-aware memory
> allocation.
>
> The new __kfifo_alloc_node() function accepts a NUMA node parameter
> and uses kmalloc_array_node() instead of kmalloc_array() for
> node-specific allocation. The existing __kfifo_alloc() now calls
> __kfifo_alloc_node() with NUMA_NO_NODE to maintain backward
> compatibility.
>
> This enables users to allocate kfifo buffers on specific NUMA nodes,
> which is important for performance in NUMA systems where the kfifo
> will be primarily accessed by threads running on specific nodes.
>
> Cc: Stefani Seibold <stefani@seibold.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/kfifo.h | 34 ++++++++++++++++++++++++++++++++--
>  lib/kfifo.c           |  8 ++++----
>  2 files changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kfifo.h b/include/linux/kfifo.h
> index fd743d4c4b4b..8b81ac74829c 100644
> --- a/include/linux/kfifo.h
> +++ b/include/linux/kfifo.h
> @@ -369,6 +369,30 @@ __kfifo_int_must_check_helper( \
>  }) \
>  )
>
> +/**
> + * kfifo_alloc_node - dynamically allocates a new fifo buffer on a NUMA node
> + * @fifo: pointer to the fifo
> + * @size: the number of elements in the fifo, this must be a power of 2
> + * @gfp_mask: get_free_pages mask, passed to kmalloc()
> + * @node: NUMA node to allocate memory on
> + *
> + * This macro dynamically allocates a new fifo buffer with NUMA node awareness.
> + *
> + * The number of elements will be rounded-up to a power of 2.
> + * The fifo will be release with kfifo_free().
> + * Return 0 if no error, otherwise an error code.
> + */
> +#define kfifo_alloc_node(fifo, size, gfp_mask, node) \
> +__kfifo_int_must_check_helper( \
> +({ \
> +       typeof((fifo) + 1) __tmp = (fifo); \
> +       struct __kfifo *__kfifo = &__tmp->kfifo; \
> +       __is_kfifo_ptr(__tmp) ? \
> +       __kfifo_alloc_node(__kfifo, size, sizeof(*__tmp->type), gfp_mask, node) : \
> +       -EINVAL; \
> +}) \
> +)

Looks like we could avoid some code duplication by defining
kfifo_alloc(fifo, size, gfp_mask) as kfifo_alloc_node(fifo, size,
gfp_mask, NUMA_NO_NODE). Otherwise, this looks good to me.

Best,
Caleb

> +
>  /**
>   * kfifo_free - frees the fifo
>   * @fifo: the fifo to be freed
> @@ -899,8 +923,14 @@ __kfifo_uint_must_check_helper( \
>  )
>
>
> -extern int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
> -       size_t esize, gfp_t gfp_mask);
> +extern int __kfifo_alloc_node(struct __kfifo *fifo, unsigned int size,
> +       size_t esize, gfp_t gfp_mask, int node);
> +
> +static inline int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
> +                               size_t esize, gfp_t gfp_mask)
> +{
> +       return __kfifo_alloc_node(fifo, size, esize, gfp_mask, NUMA_NO_NODE);
> +}
>
>  extern void __kfifo_free(struct __kfifo *fifo);
>
> diff --git a/lib/kfifo.c b/lib/kfifo.c
> index a8b2eed90599..525e66f8294c 100644
> --- a/lib/kfifo.c
> +++ b/lib/kfifo.c
> @@ -22,8 +22,8 @@ static inline unsigned int kfifo_unused(struct __kfifo *fifo)
>         return (fifo->mask + 1) - (fifo->in - fifo->out);
>  }
>
> -int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
> -               size_t esize, gfp_t gfp_mask)
> +int __kfifo_alloc_node(struct __kfifo *fifo, unsigned int size,
> +               size_t esize, gfp_t gfp_mask, int node)
>  {
>         /*
>          * round up to the next power of 2, since our 'let the indices
> @@ -41,7 +41,7 @@ int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
>                 return -EINVAL;
>         }
>
> -       fifo->data = kmalloc_array(esize, size, gfp_mask);
> +       fifo->data = kmalloc_array_node(esize, size, gfp_mask, node);
>
>         if (!fifo->data) {
>                 fifo->mask = 0;
> @@ -51,7 +51,7 @@ int __kfifo_alloc(struct __kfifo *fifo, unsigned int size,
>
>         return 0;
>  }
> -EXPORT_SYMBOL(__kfifo_alloc);
> +EXPORT_SYMBOL(__kfifo_alloc_node);
>
>  void __kfifo_free(struct __kfifo *fifo)
>  {
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-21  1:58 ` [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
@ 2025-11-29 19:19   ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-29 19:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add new command UBLK_U_IO_PREP_IO_CMDS, which is the batch version of
> UBLK_IO_FETCH_REQ.
>
> Add new command UBLK_U_IO_COMMIT_IO_CMDS, which is for committing io command
> result only, still the batch version.
>
> The new command header type is `struct ublk_batch_io`.
>
> This patch doesn't actually implement these commands yet, just validates the
> SQE fields.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
>  drivers/block/ublk_drv.c      | 85 ++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/ublk_cmd.h | 49 ++++++++++++++++++++
>  2 files changed, 133 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c62b2f2057fe..21890947ceec 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -85,6 +85,11 @@
>          UBLK_PARAM_TYPE_DEVT | UBLK_PARAM_TYPE_ZONED |    \
>          UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT)
>
> +#define UBLK_BATCH_F_ALL  \
> +       (UBLK_BATCH_F_HAS_ZONE_LBA | \
> +        UBLK_BATCH_F_HAS_BUF_ADDR | \
> +        UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
> +
>  struct ublk_uring_cmd_pdu {
>         /*
>          * Store requests in same batch temporarily for queuing them to
> @@ -108,6 +113,12 @@ struct ublk_uring_cmd_pdu {
>         u16 tag;
>  };
>
> +struct ublk_batch_io_data {
> +       struct ublk_device *ub;
> +       struct io_uring_cmd *cmd;
> +       struct ublk_batch_io header;
> +};
> +
>  /*
>   * io command is active: sqe cmd is received, and its cqe isn't done
>   *
> @@ -2520,10 +2531,82 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>         return ublk_ch_uring_cmd_local(cmd, issue_flags);
>  }
>
> +static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
> +{
> +       unsigned elem_bytes = sizeof(struct ublk_elem_header);
> +
> +       if (uc->flags & ~UBLK_BATCH_F_ALL)
> +               return -EINVAL;
> +
> +       /* UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK requires buffer index */
> +       if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) &&
> +                       (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR))
> +               return -EINVAL;
> +
> +       elem_bytes += (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA ? sizeof(u64) : 0) +
> +               (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR ? sizeof(u64) : 0);
> +       if (uc->elem_bytes != elem_bytes)
> +               return -EINVAL;
> +       return 0;
> +}
> +
> +static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
> +{
> +
> +       const struct ublk_batch_io *uc = &data->header;
> +
> +       if (uc->nr_elem > data->ub->dev_info.queue_depth)
> +               return -E2BIG;
> +
> +       if ((uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA) &&
> +                       !ublk_dev_is_zoned(data->ub))
> +               return -EINVAL;
> +
> +       if ((uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR) &&
> +                       !ublk_dev_need_map_io(data->ub))
> +               return -EINVAL;
> +
> +       if ((uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) &&
> +                       !ublk_dev_support_auto_buf_reg(data->ub))
> +               return -EINVAL;
> +
> +       return ublk_check_batch_cmd_flags(uc);
> +}
> +
>  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                                        unsigned int issue_flags)
>  {
> -       return -EOPNOTSUPP;
> +       const struct ublk_batch_io *uc = io_uring_sqe_cmd(cmd->sqe);
> +       struct ublk_device *ub = cmd->file->private_data;
> +       struct ublk_batch_io_data data = {
> +               .ub  = ub,
> +               .cmd = cmd,
> +               .header = (struct ublk_batch_io) {
> +                       .q_id = READ_ONCE(uc->q_id),
> +                       .flags = READ_ONCE(uc->flags),
> +                       .nr_elem = READ_ONCE(uc->nr_elem),
> +                       .elem_bytes = READ_ONCE(uc->elem_bytes),
> +               },
> +       };
> +       u32 cmd_op = cmd->cmd_op;
> +       int ret = -EINVAL;
> +
> +       if (data.header.q_id >= ub->dev_info.nr_hw_queues)
> +               goto out;
> +
> +       switch (cmd_op) {
> +       case UBLK_U_IO_PREP_IO_CMDS:
> +       case UBLK_U_IO_COMMIT_IO_CMDS:
> +               ret = ublk_check_batch_cmd(&data);
> +               if (ret)
> +                       goto out;
> +               ret = -EOPNOTSUPP;
> +               break;
> +       default:
> +               ret = -EOPNOTSUPP;
> +       }
> +out:
> +       return ret;
>  }
>
>  static inline bool ublk_check_ubuf_dir(const struct request *req,
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index ec77dabba45b..2ce5a496b622 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -102,6 +102,10 @@
>         _IOWR('u', 0x23, struct ublksrv_io_cmd)
>  #define        UBLK_U_IO_UNREGISTER_IO_BUF     \
>         _IOWR('u', 0x24, struct ublksrv_io_cmd)
> +#define        UBLK_U_IO_PREP_IO_CMDS  \
> +       _IOWR('u', 0x25, struct ublk_batch_io)
> +#define        UBLK_U_IO_COMMIT_IO_CMDS        \
> +       _IOWR('u', 0x26, struct ublk_batch_io)
>
>  /* only ABORT means that no re-fetch */
>  #define UBLK_IO_RES_OK                 0
> @@ -525,6 +529,51 @@ struct ublksrv_io_cmd {
>         };
>  };
>
> +struct ublk_elem_header {
> +       __u16 tag;      /* IO tag */
> +
> +       /*
> +        * Buffer index for incoming io command, only valid iff
> +        * UBLK_F_AUTO_BUF_REG is set
> +        */
> +       __u16 buf_index;
> +       __s32 result;   /* I/O completion result (commit only) */
> +};
> +
> +/*
> + * uring_cmd buffer structure for batch commands
> + *
> + * buffer includes multiple elements, which number is specified by
> + * `nr_elem`. Each element buffer is organized in the following order:
> + *
> + * struct ublk_elem_buffer {
> + *     // Mandatory fields (8 bytes)
> + *     struct ublk_elem_header header;
> + *
> + *     // Optional fields (8 bytes each, included based on flags)
> + *
> + *     // Buffer address (if UBLK_BATCH_F_HAS_BUF_ADDR) for copying data
> + *     // between ublk request and ublk server buffer
> + *     __u64 buf_addr;
> + *
> + *     // returned Zone append LBA (if UBLK_BATCH_F_HAS_ZONE_LBA)
> + *     __u64 zone_lba;
> + * }
> + *
> + * Used for `UBLK_U_IO_PREP_IO_CMDS` and `UBLK_U_IO_COMMIT_IO_CMDS`
> + */
> +struct ublk_batch_io {
> +       __u16  q_id;
> +#define UBLK_BATCH_F_HAS_ZONE_LBA      (1 << 0)
> +#define UBLK_BATCH_F_HAS_BUF_ADDR      (1 << 1)
> +#define UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK     (1 << 2)
> +       __u16   flags;
> +       __u16   nr_elem;
> +       __u8    elem_bytes;
> +       __u8    reserved;
> +       __u64   reserved2;
> +};
> +
>  struct ublk_param_basic {
>  #define UBLK_ATTR_READ_ONLY            (1 << 0)
>  #define UBLK_ATTR_ROTATIONAL           (1 << 1)
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS
  2025-11-21  1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
@ 2025-11-29 19:47   ` Caleb Sander Mateos
  2025-11-30 19:25   ` Caleb Sander Mateos
  1 sibling, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-29 19:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
> which allows userspace to prepare a batch of I/O requests.
>
> The core of this change is the `ublk_walk_cmd_buf` function, which iterates
> over the elements in the uring_cmd fixed buffer. For each element, it parses
> the I/O details, finds the corresponding `ublk_io` structure, and prepares it
> for future dispatch.
>
> Add per-io lock for protecting concurrent delivery and committing.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 193 +++++++++++++++++++++++++++++++++-
>  include/uapi/linux/ublk_cmd.h |   5 +
>  2 files changed, 197 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 21890947ceec..66c77daae955 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -117,6 +117,7 @@ struct ublk_batch_io_data {
>         struct ublk_device *ub;
>         struct io_uring_cmd *cmd;
>         struct ublk_batch_io header;
> +       unsigned int issue_flags;

This looks unused in this commit. Move it to the previous commit
introducing struct ublk_batch_io_data, or the next commit that uses
issue_flags?

Other than that,
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

>  };
>
>  /*
> @@ -201,6 +202,7 @@ struct ublk_io {
>         unsigned task_registered_buffers;
>
>         void *buf_ctx_handle;
> +       spinlock_t lock;
>  } ____cacheline_aligned_in_smp;
>
>  struct ublk_queue {
> @@ -270,6 +272,16 @@ static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
>         return false;
>  }
>
> +static inline void ublk_io_lock(struct ublk_io *io)
> +{
> +       spin_lock(&io->lock);
> +}
> +
> +static inline void ublk_io_unlock(struct ublk_io *io)
> +{
> +       spin_unlock(&io->lock);
> +}
> +
>  static inline struct ublksrv_io_desc *
>  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
>  {
> @@ -2531,6 +2543,171 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>         return ublk_ch_uring_cmd_local(cmd, issue_flags);
>  }
>
> +static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> +                                       const struct ublk_elem_header *elem)
> +{
> +       const void *buf = elem;
> +
> +       if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)
> +               return *(__u64 *)(buf + sizeof(*elem));
> +       return 0;
> +}
> +
> +static struct ublk_auto_buf_reg
> +ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> +                       const struct ublk_elem_header *elem)
> +{
> +       struct ublk_auto_buf_reg reg = {
> +               .index = elem->buf_index,
> +               .flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ?
> +                       UBLK_AUTO_BUF_REG_FALLBACK : 0,
> +       };
> +
> +       return reg;
> +}
> +
> +/*
> + * 48 can hold any type of buffer element(8, 16 and 24 bytes) because
> + * it is the least common multiple(LCM) of 8, 16 and 24
> + */
> +#define UBLK_CMD_BATCH_TMP_BUF_SZ  (48 * 10)
> +struct ublk_batch_io_iter {
> +       void __user *uaddr;
> +       unsigned done, total;
> +       unsigned char elem_bytes;
> +       /* copy to this buffer from user space */
> +       unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ];
> +};
> +
> +static inline int
> +__ublk_walk_cmd_buf(struct ublk_queue *ubq,
> +                   struct ublk_batch_io_iter *iter,
> +                   const struct ublk_batch_io_data *data,
> +                   unsigned bytes,
> +                   int (*cb)(struct ublk_queue *q,
> +                           const struct ublk_batch_io_data *data,
> +                           const struct ublk_elem_header *elem))
> +{
> +       unsigned int i;
> +       int ret = 0;
> +
> +       for (i = 0; i < bytes; i += iter->elem_bytes) {
> +               const struct ublk_elem_header *elem =
> +                       (const struct ublk_elem_header *)&iter->buf[i];
> +
> +               if (unlikely(elem->tag >= data->ub->dev_info.queue_depth)) {
> +                       ret = -EINVAL;
> +                       break;
> +               }
> +
> +               ret = cb(ubq, data, elem);
> +               if (unlikely(ret))
> +                       break;
> +       }
> +
> +       iter->done += i;
> +       return ret;
> +}
> +
> +static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> +                            const struct ublk_batch_io_data *data,
> +                            int (*cb)(struct ublk_queue *q,
> +                                    const struct ublk_batch_io_data *data,
> +                                    const struct ublk_elem_header *elem))
> +{
> +       struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
> +       int ret = 0;
> +
> +       while (iter->done < iter->total) {
> +               unsigned int len = min(sizeof(iter->buf), iter->total - iter->done);
> +
> +               if (copy_from_user(iter->buf, iter->uaddr + iter->done, len)) {
> +                       pr_warn("ublk%d: read batch cmd buffer failed\n",
> +                                       data->ub->dev_info.dev_id);
> +                       return -EFAULT;
> +               }
> +
> +               ret = __ublk_walk_cmd_buf(ubq, iter, data, len, cb);
> +               if (ret)
> +                       return ret;
> +       }
> +       return 0;
> +}
> +
> +static int ublk_batch_unprep_io(struct ublk_queue *ubq,
> +                               const struct ublk_batch_io_data *data,
> +                               const struct ublk_elem_header *elem)
> +{
> +       struct ublk_io *io = &ubq->ios[elem->tag];
> +
> +       data->ub->nr_io_ready--;
> +       ublk_io_lock(io);
> +       io->flags = 0;
> +       ublk_io_unlock(io);
> +       return 0;
> +}
> +
> +static void ublk_batch_revert_prep_cmd(struct ublk_batch_io_iter *iter,
> +                                      const struct ublk_batch_io_data *data)
> +{
> +       int ret;
> +
> +       /* Re-process only what we've already processed, starting from beginning */
> +       iter->total = iter->done;
> +       iter->done = 0;
> +
> +       ret = ublk_walk_cmd_buf(iter, data, ublk_batch_unprep_io);
> +       WARN_ON_ONCE(ret);
> +}
> +
> +static int ublk_batch_prep_io(struct ublk_queue *ubq,
> +                             const struct ublk_batch_io_data *data,
> +                             const struct ublk_elem_header *elem)
> +{
> +       struct ublk_io *io = &ubq->ios[elem->tag];
> +       const struct ublk_batch_io *uc = &data->header;
> +       union ublk_io_buf buf = { 0 };
> +       int ret;
> +
> +       if (ublk_dev_support_auto_buf_reg(data->ub))
> +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
> +       else if (ublk_dev_need_map_io(data->ub)) {
> +               buf.addr = ublk_batch_buf_addr(uc, elem);
> +
> +               ret = ublk_check_fetch_buf(data->ub, buf.addr);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       ublk_io_lock(io);
> +       ret = __ublk_fetch(data->cmd, data->ub, io);
> +       if (!ret)
> +               io->buf = buf;
> +       ublk_io_unlock(io);
> +
> +       return ret;
> +}
> +
> +static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
> +{
> +       const struct ublk_batch_io *uc = &data->header;
> +       struct io_uring_cmd *cmd = data->cmd;
> +       struct ublk_batch_io_iter iter = {
> +               .uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)),
> +               .total = uc->nr_elem * uc->elem_bytes,
> +               .elem_bytes = uc->elem_bytes,
> +       };
> +       int ret;
> +
> +       mutex_lock(&data->ub->mutex);
> +       ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_prep_io);
> +
> +       if (ret && iter.done)
> +               ublk_batch_revert_prep_cmd(&iter, data);
> +       mutex_unlock(&data->ub->mutex);
> +       return ret;
> +}
> +
>  static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
>  {
>         unsigned elem_bytes = sizeof(struct ublk_elem_header);
> @@ -2587,6 +2764,7 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                         .nr_elem = READ_ONCE(uc->nr_elem),
>                         .elem_bytes = READ_ONCE(uc->elem_bytes),
>                 },
> +               .issue_flags = issue_flags,
>         };
>         u32 cmd_op = cmd->cmd_op;
>         int ret = -EINVAL;
> @@ -2596,6 +2774,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>
>         switch (cmd_op) {
>         case UBLK_U_IO_PREP_IO_CMDS:
> +               ret = ublk_check_batch_cmd(&data);
> +               if (ret)
> +                       goto out;
> +               ret = ublk_handle_batch_prep_cmd(&data);
> +               break;
>         case UBLK_U_IO_COMMIT_IO_CMDS:
>                 ret = ublk_check_batch_cmd(&data);
>                 if (ret)
> @@ -2770,7 +2953,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>         struct ublk_queue *ubq;
>         struct page *page;
>         int numa_node;
> -       int size;
> +       int size, i;
>
>         /* Determine NUMA node based on queue's CPU affinity */
>         numa_node = ublk_get_queue_numa_node(ub, q_id);
> @@ -2795,6 +2978,9 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>         }
>         ubq->io_cmd_buf = page_address(page);
>
> +       for (i = 0; i < ubq->q_depth; i++)
> +               spin_lock_init(&ubq->ios[i].lock);
> +
>         ub->queues[q_id] = ubq;
>         ubq->dev = ub;
>         return 0;
> @@ -3021,6 +3207,11 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub,
>                 return -EINVAL;
>
>         mutex_lock(&ub->mutex);
> +       /* device may become not ready in case of F_BATCH */
> +       if (!ublk_dev_ready(ub)) {
> +               ret = -EINVAL;
> +               goto out_unlock;
> +       }
>         if (ub->dev_info.state == UBLK_S_DEV_LIVE ||
>             test_bit(UB_STATE_USED, &ub->state)) {
>                 ret = -EEXIST;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index 2ce5a496b622..c96c299057c3 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -102,6 +102,11 @@
>         _IOWR('u', 0x23, struct ublksrv_io_cmd)
>  #define        UBLK_U_IO_UNREGISTER_IO_BUF     \
>         _IOWR('u', 0x24, struct ublksrv_io_cmd)
> +
> +/*
> + * return 0 if the command is run successfully, otherwise failure code
> + * is returned
> + */
>  #define        UBLK_U_IO_PREP_IO_CMDS  \
>         _IOWR('u', 0x25, struct ublk_batch_io)
>  #define        UBLK_U_IO_COMMIT_IO_CMDS        \
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-21  1:58 ` [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
@ 2025-11-30 16:39   ` Caleb Sander Mateos
  2025-12-01 10:25     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-30 16:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:
>
> - read each element into one temp buffer in batch style
>
> - parse and apply each element for committing io result
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 117 ++++++++++++++++++++++++++++++++--
>  include/uapi/linux/ublk_cmd.h |   8 +++
>  2 files changed, 121 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 66c77daae955..ea992366af5b 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -2098,9 +2098,9 @@ static inline int ublk_set_auto_buf_reg(struct ublk_io *io, struct io_uring_cmd
>         return 0;
>  }
>
> -static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> -                                   struct io_uring_cmd *cmd,
> -                                   u16 *buf_idx)
> +static void __ublk_handle_auto_buf_reg(struct ublk_io *io,
> +                                      struct io_uring_cmd *cmd,
> +                                      u16 *buf_idx)

The name could be a bit more descriptive. How about "ublk_clear_auto_buf_reg()"?

>  {
>         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG) {
>                 io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> @@ -2118,7 +2118,13 @@ static int ublk_handle_auto_buf_reg(struct ublk_io *io,
>                 if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd))
>                         *buf_idx = io->buf.auto_reg.index;
>         }
> +}
>
> +static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> +                                   struct io_uring_cmd *cmd,
> +                                   u16 *buf_idx)
> +{
> +       __ublk_handle_auto_buf_reg(io, cmd, buf_idx);
>         return ublk_set_auto_buf_reg(io, cmd);
>  }
>
> @@ -2553,6 +2559,17 @@ static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
>         return 0;
>  }
>
> +static inline __u64 ublk_batch_zone_lba(const struct ublk_batch_io *uc,
> +                                       const struct ublk_elem_header *elem)
> +{
> +       const void *buf = (const void *)elem;

Unnecessary cast

> +
> +       if (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA)
> +               return *(__u64 *)(buf + sizeof(*elem) +
> +                               8 * !!(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR));

Cast to a const pointer?


> +       return -1;
> +}
> +
>  static struct ublk_auto_buf_reg
>  ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
>                         const struct ublk_elem_header *elem)
> @@ -2708,6 +2725,98 @@ static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
>         return ret;
>  }
>
> +static int ublk_batch_commit_io_check(const struct ublk_queue *ubq,
> +                                     struct ublk_io *io,
> +                                     union ublk_io_buf *buf)
> +{
> +       struct request *req = io->req;
> +
> +       if (!req)
> +               return -EINVAL;

This check seems redundant with the UBLK_IO_FLAG_OWNED_BY_SRV check?

> +
> +       if (io->flags & UBLK_IO_FLAG_ACTIVE)
> +               return -EBUSY;

Aren't UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV mutually
exclusive? Then this check is also redundant with the
UBLK_IO_FLAG_OWNED_BY_SRV check.

> +
> +       if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
> +               return -EINVAL;
> +
> +       if (ublk_need_map_io(ubq)) {
> +               /*
> +                * COMMIT_AND_FETCH_REQ has to provide IO buffer if
> +                * NEED GET DATA is not enabled or it is Read IO.
> +                */
> +               if (!buf->addr && (!ublk_need_get_data(ubq) ||
> +                                       req_op(req) == REQ_OP_READ))
> +                       return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +static int ublk_batch_commit_io(struct ublk_queue *ubq,
> +                               const struct ublk_batch_io_data *data,
> +                               const struct ublk_elem_header *elem)
> +{
> +       struct ublk_io *io = &ubq->ios[elem->tag];
> +       const struct ublk_batch_io *uc = &data->header;
> +       u16 buf_idx = UBLK_INVALID_BUF_IDX;
> +       union ublk_io_buf buf = { 0 };
> +       struct request *req = NULL;
> +       bool auto_reg = false;
> +       bool compl = false;
> +       int ret;
> +
> +       if (ublk_dev_support_auto_buf_reg(data->ub)) {
> +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
> +               auto_reg = true;
> +       } else if (ublk_dev_need_map_io(data->ub))
> +               buf.addr = ublk_batch_buf_addr(uc, elem);
> +
> +       ublk_io_lock(io);
> +       ret = ublk_batch_commit_io_check(ubq, io, &buf);
> +       if (!ret) {
> +               io->res = elem->result;
> +               io->buf = buf;
> +               req = ublk_fill_io_cmd(io, data->cmd);
> +
> +               if (auto_reg)
> +                       __ublk_handle_auto_buf_reg(io, data->cmd, &buf_idx);
> +               compl = ublk_need_complete_req(data->ub, io);
> +       }
> +       ublk_io_unlock(io);
> +
> +       if (unlikely(ret)) {
> +               pr_warn("%s: dev %u queue %u io %u: commit failure %d\n",
> +                       __func__, data->ub->dev_info.dev_id, ubq->q_id,
> +                       elem->tag, ret);

This warning can be triggered by userspace. It should probably be
rate-limited or changed to pr_devel().

Best,
Caleb

> +               return ret;
> +       }
> +
> +       /* can't touch 'ublk_io' any more */
> +       if (buf_idx != UBLK_INVALID_BUF_IDX)
> +               io_buffer_unregister_bvec(data->cmd, buf_idx, data->issue_flags);
> +       if (req_op(req) == REQ_OP_ZONE_APPEND)
> +               req->__sector = ublk_batch_zone_lba(uc, elem);
> +       if (compl)
> +               __ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub));
> +       return 0;
> +}
> +
> +static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
> +{
> +       const struct ublk_batch_io *uc = &data->header;
> +       struct io_uring_cmd *cmd = data->cmd;
> +       struct ublk_batch_io_iter iter = {
> +               .uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)),
> +               .total = uc->nr_elem * uc->elem_bytes,
> +               .elem_bytes = uc->elem_bytes,
> +       };
> +       int ret;
> +
> +       ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_commit_io);
> +
> +       return iter.done == 0 ? ret : iter.done;
> +}
> +
>  static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
>  {
>         unsigned elem_bytes = sizeof(struct ublk_elem_header);
> @@ -2783,7 +2892,7 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                 ret = ublk_check_batch_cmd(&data);
>                 if (ret)
>                         goto out;
> -               ret = -EOPNOTSUPP;
> +               ret = ublk_handle_batch_commit_cmd(&data);
>                 break;
>         default:
>                 ret = -EOPNOTSUPP;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index c96c299057c3..295ec8f34173 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -109,6 +109,14 @@
>   */
>  #define        UBLK_U_IO_PREP_IO_CMDS  \
>         _IOWR('u', 0x25, struct ublk_batch_io)
> +/*
> + * If failure code is returned, nothing in the command buffer is handled.
> + * Otherwise, the returned value means how many bytes in command buffer
> + * are handled actually, then number of handled IOs can be calculated with
> + * `elem_bytes` for each IO. IOs in the remained bytes are not committed,
> + * userspace has to check return value for dealing with partial committing
> + * correctly.
> + */
>  #define        UBLK_U_IO_COMMIT_IO_CMDS        \
>         _IOWR('u', 0x26, struct ublk_batch_io)
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 12/27] ublk: add io events fifo structure
  2025-11-21  1:58 ` [PATCH V4 12/27] ublk: add io events fifo structure Ming Lei
@ 2025-11-30 16:53   ` Caleb Sander Mateos
  2025-12-01  3:04     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-30 16:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add ublk io events fifo structure and prepare for supporting command
> batch, which will use io_uring multishot uring_cmd for fetching one
> batch of io commands each time.
>
> One nice feature of kfifo is to allow multiple producer vs single
> consumer. We just need lock the producer side, meantime the single
> consumer can be lockless.
>
> The producer is actually from ublk_queue_rq() or ublk_queue_rqs(), so
> lock contention can be eased by setting proper blk-mq nr_queues.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 65 ++++++++++++++++++++++++++++++++++++----
>  1 file changed, 60 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index ea992366af5b..6ff284243630 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -44,6 +44,7 @@
>  #include <linux/task_work.h>
>  #include <linux/namei.h>
>  #include <linux/kref.h>
> +#include <linux/kfifo.h>
>  #include <uapi/linux/ublk_cmd.h>
>
>  #define UBLK_MINORS            (1U << MINORBITS)
> @@ -217,6 +218,22 @@ struct ublk_queue {
>         bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
>         spinlock_t              cancel_lock;
>         struct ublk_device *dev;
> +
> +       /*
> +        * Inflight ublk request tag is saved in this fifo
> +        *
> +        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> +        * so lock is required for storing request tag to fifo
> +        *
> +        * Make sure just one reader for fetching request from task work
> +        * function to ublk server, so no need to grab the lock in reader
> +        * side.

Can you clarify that this is only used for batch mode?

> +        */
> +       struct {
> +               DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> +               spinlock_t evts_lock;
> +       }____cacheline_aligned_in_smp;
> +
>         struct ublk_io ios[] __counted_by(q_depth);
>  };
>
> @@ -282,6 +299,32 @@ static inline void ublk_io_unlock(struct ublk_io *io)
>         spin_unlock(&io->lock);
>  }
>
> +/* Initialize the queue */

"queue" -> "events queue"? Otherwise, it sounds like it's referring to
struct ublk_queue.

> +static inline int ublk_io_evts_init(struct ublk_queue *q, unsigned int size,
> +                                   int numa_node)
> +{
> +       spin_lock_init(&q->evts_lock);
> +       return kfifo_alloc_node(&q->evts_fifo, size, GFP_KERNEL, numa_node);
> +}
> +
> +/* Check if queue is empty */
> +static inline bool ublk_io_evts_empty(const struct ublk_queue *q)
> +{
> +       return kfifo_is_empty(&q->evts_fifo);
> +}
> +
> +/* Check if queue is full */
> +static inline bool ublk_io_evts_full(const struct ublk_queue *q)

Function is unused?

> +{
> +       return kfifo_is_full(&q->evts_fifo);
> +}
> +
> +static inline void ublk_io_evts_deinit(struct ublk_queue *q)
> +{
> +       WARN_ON_ONCE(!kfifo_is_empty(&q->evts_fifo));
> +       kfifo_free(&q->evts_fifo);
> +}
> +
>  static inline struct ublksrv_io_desc *
>  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
>  {
> @@ -3038,6 +3081,9 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
>         if (ubq->io_cmd_buf)
>                 free_pages((unsigned long)ubq->io_cmd_buf, get_order(size));
>
> +       if (ublk_dev_support_batch_io(ub))
> +               ublk_io_evts_deinit(ubq);
> +
>         kvfree(ubq);
>         ub->queues[q_id] = NULL;
>  }
> @@ -3062,7 +3108,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>         struct ublk_queue *ubq;
>         struct page *page;
>         int numa_node;
> -       int size, i;
> +       int size, i, ret = -ENOMEM;
>
>         /* Determine NUMA node based on queue's CPU affinity */
>         numa_node = ublk_get_queue_numa_node(ub, q_id);
> @@ -3081,18 +3127,27 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>
>         /* Allocate I/O command buffer on local NUMA node */
>         page = alloc_pages_node(numa_node, gfp_flags, get_order(size));
> -       if (!page) {
> -               kvfree(ubq);
> -               return -ENOMEM;
> -       }
> +       if (!page)
> +               goto fail_nomem;
>         ubq->io_cmd_buf = page_address(page);
>
>         for (i = 0; i < ubq->q_depth; i++)
>                 spin_lock_init(&ubq->ios[i].lock);
>
> +       if (ublk_dev_support_batch_io(ub)) {
> +               ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node);
> +               if (ret)
> +                       goto fail;
> +       }
>         ub->queues[q_id] = ubq;
>         ubq->dev = ub;
> +
>         return 0;
> +fail:
> +       ublk_deinit_queue(ub, q_id);

This is a no-op since ub->queues[q_id] hasn't been assigned yet?

Best,
Caleb

> +fail_nomem:
> +       kvfree(ubq);
> +       return ret;
>  }
>
>  static void ublk_deinit_queues(struct ublk_device *ub)
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure
  2025-11-21  1:58 ` [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure Ming Lei
@ 2025-11-30 19:24   ` Caleb Sander Mateos
  2025-11-30 21:37     ` Caleb Sander Mateos
  2025-12-01  2:32     ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-30 19:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add infrastructure for delivering I/O commands to ublk server in batches,
> preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.
>
> Key components:
>
> - struct ublk_batch_fcmd: Represents a batch fetch uring_cmd that will
>   receive multiple I/O tags in a single operation, using io_uring's
>   multishot command for efficient ublk IO delivery.
>
> - ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
>   * Pulls multiple request tags from the events FIFO (lock-free reader)
>   * Prepares each I/O for delivery (including auto buffer registration)
>   * Delivers tags to userspace via single uring_cmd notification
>   * Handles partial failures by restoring undelivered tags to FIFO
>
> The batch approach significantly reduces notification overhead by aggregating
> multiple I/O completions into single uring_cmd, while maintaining the same
> I/O processing semantics as individual operations.
>
> Error handling ensures system consistency: if buffer selection or CQE
> posting fails, undelivered tags are restored to the FIFO for retry,
> meantime IO state has to be restored.
>
> This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
> or called directly from ->uring_cmd(), enabling efficient batch processing
> without blocking the I/O submission path.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 189 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 189 insertions(+)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 6ff284243630..cc9c92d97349 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -91,6 +91,12 @@
>          UBLK_BATCH_F_HAS_BUF_ADDR | \
>          UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
>
> +/* ublk batch fetch uring_cmd */
> +struct ublk_batch_fcmd {

I would prefer "fetch_cmd" instead of "fcmd" for clarity

> +       struct io_uring_cmd *cmd;
> +       unsigned short buf_group;
> +};
> +
>  struct ublk_uring_cmd_pdu {
>         /*
>          * Store requests in same batch temporarily for queuing them to
> @@ -168,6 +174,9 @@ struct ublk_batch_io_data {
>   */
>  #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
>
> +/* used for UBLK_F_BATCH_IO only */
> +#define UBLK_BATCH_IO_UNUSED_TAG       ((unsigned short)-1)
> +
>  union ublk_io_buf {
>         __u64   addr;
>         struct ublk_auto_buf_reg auto_reg;
> @@ -616,6 +625,32 @@ static wait_queue_head_t ublk_idr_wq;      /* wait until one idr is freed */
>  static DEFINE_MUTEX(ublk_ctl_mutex);
>
>
> +static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> +                                       struct ublk_batch_fcmd *fcmd,
> +                                       int res)
> +{
> +       io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> +       fcmd->cmd = NULL;
> +}
> +
> +static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> +                                    struct io_br_sel *sel,
> +                                    unsigned int issue_flags)
> +{
> +       if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags))
> +               return -ENOBUFS;
> +       return 0;
> +}
> +
> +static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fcmd *fcmd,
> +                                      void __user *buf, const u16 *tag_buf,
> +                                      unsigned int len)
> +{
> +       if (copy_to_user(buf, tag_buf, len))
> +               return -EFAULT;
> +       return len;
> +}
> +
>  #define UBLK_MAX_UBLKS UBLK_MINORS
>
>  /*
> @@ -1378,6 +1413,160 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
>         }
>  }
>
> +static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> +                                      const struct ublk_batch_io_data *data,
> +                                      unsigned short tag)
> +{
> +       struct ublk_device *ub = data->ub;
> +       struct ublk_io *io = &ubq->ios[tag];
> +       struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +       enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK;
> +       struct io_uring_cmd *cmd = data->cmd;
> +
> +       if (!ublk_start_io(ubq, req, io))

This doesn't look correct for UBLK_F_NEED_GET_DATA. If that's not
supported in batch mode, then it should probably be disallowed when
creating a batch-mode ublk device. The ublk_need_get_data() check in
ublk_batch_commit_io_check() could also be dropped.

> +               return false;
> +
> +       if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
> +               res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
> +                               data->issue_flags);

__ublk_do_auto_buf_reg() reads io->buf.auto_reg. That seems racy
without holding the io spinlock.

> +
> +       if (res == AUTO_BUF_REG_FAIL)
> +               return false;

Could be moved into the if (ublk_support_auto_buf_reg(ubq) &&
ublk_rq_has_data(req)) statement since it won't be true otherwise?

> +
> +       ublk_io_lock(io);
> +       ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
> +       ublk_io_unlock(io);
> +
> +       return true;
> +}
> +
> +static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> +                                    const struct ublk_batch_io_data *data,
> +                                    unsigned short *tag_buf,
> +                                    unsigned int len)
> +{
> +       bool has_unused = false;
> +       int i;

unsigned?

> +
> +       for (i = 0; i < len; i += 1) {

i++?

> +               unsigned short tag = tag_buf[i];
> +
> +               if (!__ublk_batch_prep_dispatch(ubq, data, tag)) {
> +                       tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG;
> +                       has_unused = true;
> +               }
> +       }
> +
> +       return has_unused;
> +}
> +
> +/*
> + * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf.
> + * Returns the new length after filtering.
> + */
> +static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf,
> +                                           unsigned int len)
> +{
> +       unsigned int i, j;
> +
> +       for (i = 0, j = 0; i < len; i++) {
> +               if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) {
> +                       if (i != j)
> +                               tag_buf[j] = tag_buf[i];
> +                       j++;
> +               }
> +       }
> +
> +       return j;
> +}
> +
> +#define MAX_NR_TAG 128
> +static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> +                                const struct ublk_batch_io_data *data,
> +                                struct ublk_batch_fcmd *fcmd)
> +{
> +       unsigned short tag_buf[MAX_NR_TAG];
> +       struct io_br_sel sel;
> +       size_t len = 0;
> +       bool needs_filter;
> +       int ret;
> +
> +       sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> +                                        data->issue_flags);
> +       if (sel.val < 0)
> +               return sel.val;
> +       if (!sel.addr)
> +               return -ENOBUFS;
> +
> +       /* single reader needn't lock and sizeof(kfifo element) is 2 bytes */
> +       len = min(len, sizeof(tag_buf)) / 2;

sizeof(unsigned short) instead of 2?

> +       len = kfifo_out(&ubq->evts_fifo, tag_buf, len);
> +
> +       needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len);
> +       /* Filter out unused tags before posting to userspace */
> +       if (unlikely(needs_filter)) {
> +               int new_len = ublk_filter_unused_tags(tag_buf, len);
> +
> +               if (!new_len)
> +                       return len;

Is the purpose of this return value just to make ublk_batch_dispatch()
retry __ublk_batch_dispatch()? Otherwise, it seems like a strange
value to return.

Also, shouldn't this path release the selected buffer to avoid leaking it?

> +               len = new_len;
> +       }
> +
> +       sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * 2);

sizeof(unsigned short)?

> +       ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags);
> +       if (unlikely(ret < 0)) {
> +               int i, res;
> +
> +               /*
> +                * Undo prep state for all IOs since userspace never received them.
> +                * This restores IOs to pre-prepared state so they can be cleanly
> +                * re-prepared when tags are pulled from FIFO again.
> +                */
> +               for (i = 0; i < len; i++) {
> +                       struct ublk_io *io = &ubq->ios[tag_buf[i]];
> +                       int index = -1;
> +
> +                       ublk_io_lock(io);
> +                       if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> +                               index = io->buf.auto_reg.index;

This is missing the io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd)
check from ublk_handle_auto_buf_reg().

> +                       io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG);
> +                       io->flags |= UBLK_IO_FLAG_ACTIVE;
> +                       ublk_io_unlock(io);
> +
> +                       if (index != -1)
> +                               io_buffer_unregister_bvec(data->cmd, index,
> +                                               data->issue_flags);
> +               }
> +
> +               res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo,
> +                       tag_buf, len, &ubq->evts_lock);
> +
> +               pr_warn("%s: copy tags or post CQE failure, move back "
> +                               "tags(%d %zu) ret %d\n", __func__, res, len,
> +                               ret);
> +       }
> +       return ret;
> +}
> +
> +static __maybe_unused int

The return value looks completely unused. Just return void instead?

Best,
Caleb

> +ublk_batch_dispatch(struct ublk_queue *ubq,
> +                   const struct ublk_batch_io_data *data,
> +                   struct ublk_batch_fcmd *fcmd)
> +{
> +       int ret = 0;
> +
> +       while (!ublk_io_evts_empty(ubq)) {
> +               ret = __ublk_batch_dispatch(ubq, data, fcmd);
> +               if (ret <= 0)
> +                       break;
> +       }
> +
> +       if (ret < 0)
> +               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> +
> +       return ret;
> +}
> +
>  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
>                            unsigned int issue_flags)
>  {
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS
  2025-11-21  1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
  2025-11-29 19:47   ` Caleb Sander Mateos
@ 2025-11-30 19:25   ` Caleb Sander Mateos
  1 sibling, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-30 19:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
> which allows userspace to prepare a batch of I/O requests.
>
> The core of this change is the `ublk_walk_cmd_buf` function, which iterates
> over the elements in the uring_cmd fixed buffer. For each element, it parses
> the I/O details, finds the corresponding `ublk_io` structure, and prepares it
> for future dispatch.
>
> Add per-io lock for protecting concurrent delivery and committing.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 193 +++++++++++++++++++++++++++++++++-
>  include/uapi/linux/ublk_cmd.h |   5 +
>  2 files changed, 197 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 21890947ceec..66c77daae955 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -117,6 +117,7 @@ struct ublk_batch_io_data {
>         struct ublk_device *ub;
>         struct io_uring_cmd *cmd;
>         struct ublk_batch_io header;
> +       unsigned int issue_flags;
>  };
>
>  /*
> @@ -201,6 +202,7 @@ struct ublk_io {
>         unsigned task_registered_buffers;
>
>         void *buf_ctx_handle;
> +       spinlock_t lock;
>  } ____cacheline_aligned_in_smp;
>
>  struct ublk_queue {
> @@ -270,6 +272,16 @@ static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
>         return false;
>  }
>
> +static inline void ublk_io_lock(struct ublk_io *io)
> +{
> +       spin_lock(&io->lock);
> +}
> +
> +static inline void ublk_io_unlock(struct ublk_io *io)
> +{
> +       spin_unlock(&io->lock);
> +}
> +
>  static inline struct ublksrv_io_desc *
>  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
>  {
> @@ -2531,6 +2543,171 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>         return ublk_ch_uring_cmd_local(cmd, issue_flags);
>  }
>
> +static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> +                                       const struct ublk_elem_header *elem)
> +{
> +       const void *buf = elem;
> +
> +       if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)
> +               return *(__u64 *)(buf + sizeof(*elem));

Sorry, one more minor suggestion: cast to a const pointer?

Best,
Caleb

> +       return 0;
> +}
> +
> +static struct ublk_auto_buf_reg
> +ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> +                       const struct ublk_elem_header *elem)
> +{
> +       struct ublk_auto_buf_reg reg = {
> +               .index = elem->buf_index,
> +               .flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ?
> +                       UBLK_AUTO_BUF_REG_FALLBACK : 0,
> +       };
> +
> +       return reg;
> +}
> +
> +/*
> + * 48 can hold any type of buffer element(8, 16 and 24 bytes) because
> + * it is the least common multiple(LCM) of 8, 16 and 24
> + */
> +#define UBLK_CMD_BATCH_TMP_BUF_SZ  (48 * 10)
> +struct ublk_batch_io_iter {
> +       void __user *uaddr;
> +       unsigned done, total;
> +       unsigned char elem_bytes;
> +       /* copy to this buffer from user space */
> +       unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ];
> +};
> +
> +static inline int
> +__ublk_walk_cmd_buf(struct ublk_queue *ubq,
> +                   struct ublk_batch_io_iter *iter,
> +                   const struct ublk_batch_io_data *data,
> +                   unsigned bytes,
> +                   int (*cb)(struct ublk_queue *q,
> +                           const struct ublk_batch_io_data *data,
> +                           const struct ublk_elem_header *elem))
> +{
> +       unsigned int i;
> +       int ret = 0;
> +
> +       for (i = 0; i < bytes; i += iter->elem_bytes) {
> +               const struct ublk_elem_header *elem =
> +                       (const struct ublk_elem_header *)&iter->buf[i];
> +
> +               if (unlikely(elem->tag >= data->ub->dev_info.queue_depth)) {
> +                       ret = -EINVAL;
> +                       break;
> +               }
> +
> +               ret = cb(ubq, data, elem);
> +               if (unlikely(ret))
> +                       break;
> +       }
> +
> +       iter->done += i;
> +       return ret;
> +}
> +
> +static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> +                            const struct ublk_batch_io_data *data,
> +                            int (*cb)(struct ublk_queue *q,
> +                                    const struct ublk_batch_io_data *data,
> +                                    const struct ublk_elem_header *elem))
> +{
> +       struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
> +       int ret = 0;
> +
> +       while (iter->done < iter->total) {
> +               unsigned int len = min(sizeof(iter->buf), iter->total - iter->done);
> +
> +               if (copy_from_user(iter->buf, iter->uaddr + iter->done, len)) {
> +                       pr_warn("ublk%d: read batch cmd buffer failed\n",
> +                                       data->ub->dev_info.dev_id);
> +                       return -EFAULT;
> +               }
> +
> +               ret = __ublk_walk_cmd_buf(ubq, iter, data, len, cb);
> +               if (ret)
> +                       return ret;
> +       }
> +       return 0;
> +}
> +
> +static int ublk_batch_unprep_io(struct ublk_queue *ubq,
> +                               const struct ublk_batch_io_data *data,
> +                               const struct ublk_elem_header *elem)
> +{
> +       struct ublk_io *io = &ubq->ios[elem->tag];
> +
> +       data->ub->nr_io_ready--;
> +       ublk_io_lock(io);
> +       io->flags = 0;
> +       ublk_io_unlock(io);
> +       return 0;
> +}
> +
> +static void ublk_batch_revert_prep_cmd(struct ublk_batch_io_iter *iter,
> +                                      const struct ublk_batch_io_data *data)
> +{
> +       int ret;
> +
> +       /* Re-process only what we've already processed, starting from beginning */
> +       iter->total = iter->done;
> +       iter->done = 0;
> +
> +       ret = ublk_walk_cmd_buf(iter, data, ublk_batch_unprep_io);
> +       WARN_ON_ONCE(ret);
> +}
> +
> +static int ublk_batch_prep_io(struct ublk_queue *ubq,
> +                             const struct ublk_batch_io_data *data,
> +                             const struct ublk_elem_header *elem)
> +{
> +       struct ublk_io *io = &ubq->ios[elem->tag];
> +       const struct ublk_batch_io *uc = &data->header;
> +       union ublk_io_buf buf = { 0 };
> +       int ret;
> +
> +       if (ublk_dev_support_auto_buf_reg(data->ub))
> +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
> +       else if (ublk_dev_need_map_io(data->ub)) {
> +               buf.addr = ublk_batch_buf_addr(uc, elem);
> +
> +               ret = ublk_check_fetch_buf(data->ub, buf.addr);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       ublk_io_lock(io);
> +       ret = __ublk_fetch(data->cmd, data->ub, io);
> +       if (!ret)
> +               io->buf = buf;
> +       ublk_io_unlock(io);
> +
> +       return ret;
> +}
> +
> +static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
> +{
> +       const struct ublk_batch_io *uc = &data->header;
> +       struct io_uring_cmd *cmd = data->cmd;
> +       struct ublk_batch_io_iter iter = {
> +               .uaddr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr)),
> +               .total = uc->nr_elem * uc->elem_bytes,
> +               .elem_bytes = uc->elem_bytes,
> +       };
> +       int ret;
> +
> +       mutex_lock(&data->ub->mutex);
> +       ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_prep_io);
> +
> +       if (ret && iter.done)
> +               ublk_batch_revert_prep_cmd(&iter, data);
> +       mutex_unlock(&data->ub->mutex);
> +       return ret;
> +}
> +
>  static int ublk_check_batch_cmd_flags(const struct ublk_batch_io *uc)
>  {
>         unsigned elem_bytes = sizeof(struct ublk_elem_header);
> @@ -2587,6 +2764,7 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                         .nr_elem = READ_ONCE(uc->nr_elem),
>                         .elem_bytes = READ_ONCE(uc->elem_bytes),
>                 },
> +               .issue_flags = issue_flags,
>         };
>         u32 cmd_op = cmd->cmd_op;
>         int ret = -EINVAL;
> @@ -2596,6 +2774,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>
>         switch (cmd_op) {
>         case UBLK_U_IO_PREP_IO_CMDS:
> +               ret = ublk_check_batch_cmd(&data);
> +               if (ret)
> +                       goto out;
> +               ret = ublk_handle_batch_prep_cmd(&data);
> +               break;
>         case UBLK_U_IO_COMMIT_IO_CMDS:
>                 ret = ublk_check_batch_cmd(&data);
>                 if (ret)
> @@ -2770,7 +2953,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>         struct ublk_queue *ubq;
>         struct page *page;
>         int numa_node;
> -       int size;
> +       int size, i;
>
>         /* Determine NUMA node based on queue's CPU affinity */
>         numa_node = ublk_get_queue_numa_node(ub, q_id);
> @@ -2795,6 +2978,9 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>         }
>         ubq->io_cmd_buf = page_address(page);
>
> +       for (i = 0; i < ubq->q_depth; i++)
> +               spin_lock_init(&ubq->ios[i].lock);
> +
>         ub->queues[q_id] = ubq;
>         ubq->dev = ub;
>         return 0;
> @@ -3021,6 +3207,11 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub,
>                 return -EINVAL;
>
>         mutex_lock(&ub->mutex);
> +       /* device may become not ready in case of F_BATCH */
> +       if (!ublk_dev_ready(ub)) {
> +               ret = -EINVAL;
> +               goto out_unlock;
> +       }
>         if (ub->dev_info.state == UBLK_S_DEV_LIVE ||
>             test_bit(UB_STATE_USED, &ub->state)) {
>                 ret = -EEXIST;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index 2ce5a496b622..c96c299057c3 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -102,6 +102,11 @@
>         _IOWR('u', 0x23, struct ublksrv_io_cmd)
>  #define        UBLK_U_IO_UNREGISTER_IO_BUF     \
>         _IOWR('u', 0x24, struct ublksrv_io_cmd)
> +
> +/*
> + * return 0 if the command is run successfully, otherwise failure code
> + * is returned
> + */
>  #define        UBLK_U_IO_PREP_IO_CMDS  \
>         _IOWR('u', 0x25, struct ublk_batch_io)
>  #define        UBLK_U_IO_COMMIT_IO_CMDS        \
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure
  2025-11-30 19:24   ` Caleb Sander Mateos
@ 2025-11-30 21:37     ` Caleb Sander Mateos
  2025-12-01  2:32     ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-11-30 21:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 11:24 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add infrastructure for delivering I/O commands to ublk server in batches,
> > preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.
> >
> > Key components:
> >
> > - struct ublk_batch_fcmd: Represents a batch fetch uring_cmd that will
> >   receive multiple I/O tags in a single operation, using io_uring's
> >   multishot command for efficient ublk IO delivery.
> >
> > - ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
> >   * Pulls multiple request tags from the events FIFO (lock-free reader)
> >   * Prepares each I/O for delivery (including auto buffer registration)
> >   * Delivers tags to userspace via single uring_cmd notification
> >   * Handles partial failures by restoring undelivered tags to FIFO
> >
> > The batch approach significantly reduces notification overhead by aggregating
> > multiple I/O completions into single uring_cmd, while maintaining the same
> > I/O processing semantics as individual operations.
> >
> > Error handling ensures system consistency: if buffer selection or CQE
> > posting fails, undelivered tags are restored to the FIFO for retry,
> > meantime IO state has to be restored.
> >
> > This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
> > or called directly from ->uring_cmd(), enabling efficient batch processing
> > without blocking the I/O submission path.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 189 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 189 insertions(+)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 6ff284243630..cc9c92d97349 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -91,6 +91,12 @@
> >          UBLK_BATCH_F_HAS_BUF_ADDR | \
> >          UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
> >
> > +/* ublk batch fetch uring_cmd */
> > +struct ublk_batch_fcmd {
>
> I would prefer "fetch_cmd" instead of "fcmd" for clarity
>
> > +       struct io_uring_cmd *cmd;
> > +       unsigned short buf_group;
> > +};
> > +
> >  struct ublk_uring_cmd_pdu {
> >         /*
> >          * Store requests in same batch temporarily for queuing them to
> > @@ -168,6 +174,9 @@ struct ublk_batch_io_data {
> >   */
> >  #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
> >
> > +/* used for UBLK_F_BATCH_IO only */
> > +#define UBLK_BATCH_IO_UNUSED_TAG       ((unsigned short)-1)
> > +
> >  union ublk_io_buf {
> >         __u64   addr;
> >         struct ublk_auto_buf_reg auto_reg;
> > @@ -616,6 +625,32 @@ static wait_queue_head_t ublk_idr_wq;      /* wait until one idr is freed */
> >  static DEFINE_MUTEX(ublk_ctl_mutex);
> >
> >
> > +static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > +                                       struct ublk_batch_fcmd *fcmd,
> > +                                       int res)
> > +{
> > +       io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > +       fcmd->cmd = NULL;
> > +}
> > +
> > +static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > +                                    struct io_br_sel *sel,
> > +                                    unsigned int issue_flags)
> > +{
> > +       if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags))
> > +               return -ENOBUFS;
> > +       return 0;
> > +}
> > +
> > +static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fcmd *fcmd,
> > +                                      void __user *buf, const u16 *tag_buf,
> > +                                      unsigned int len)
> > +{
> > +       if (copy_to_user(buf, tag_buf, len))
> > +               return -EFAULT;
> > +       return len;
> > +}
> > +
> >  #define UBLK_MAX_UBLKS UBLK_MINORS
> >
> >  /*
> > @@ -1378,6 +1413,160 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
> >         }
> >  }
> >
> > +static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > +                                      const struct ublk_batch_io_data *data,
> > +                                      unsigned short tag)
> > +{
> > +       struct ublk_device *ub = data->ub;
> > +       struct ublk_io *io = &ubq->ios[tag];
> > +       struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> > +       enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK;
> > +       struct io_uring_cmd *cmd = data->cmd;
> > +
> > +       if (!ublk_start_io(ubq, req, io))
>
> This doesn't look correct for UBLK_F_NEED_GET_DATA. If that's not
> supported in batch mode, then it should probably be disallowed when
> creating a batch-mode ublk device. The ublk_need_get_data() check in
> ublk_batch_commit_io_check() could also be dropped.
>
> > +               return false;
> > +
> > +       if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
> > +               res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
> > +                               data->issue_flags);
>
> __ublk_do_auto_buf_reg() reads io->buf.auto_reg. That seems racy
> without holding the io spinlock.
>
> > +
> > +       if (res == AUTO_BUF_REG_FAIL)
> > +               return false;
>
> Could be moved into the if (ublk_support_auto_buf_reg(ubq) &&
> ublk_rq_has_data(req)) statement since it won't be true otherwise?
>
> > +
> > +       ublk_io_lock(io);
> > +       ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
> > +       ublk_io_unlock(io);
> > +
> > +       return true;
> > +}
> > +
> > +static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > +                                    const struct ublk_batch_io_data *data,
> > +                                    unsigned short *tag_buf,
> > +                                    unsigned int len)
> > +{
> > +       bool has_unused = false;
> > +       int i;
>
> unsigned?
>
> > +
> > +       for (i = 0; i < len; i += 1) {
>
> i++?
>
> > +               unsigned short tag = tag_buf[i];
> > +
> > +               if (!__ublk_batch_prep_dispatch(ubq, data, tag)) {
> > +                       tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG;
> > +                       has_unused = true;
> > +               }
> > +       }
> > +
> > +       return has_unused;
> > +}
> > +
> > +/*
> > + * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf.
> > + * Returns the new length after filtering.
> > + */
> > +static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf,
> > +                                           unsigned int len)
> > +{
> > +       unsigned int i, j;
> > +
> > +       for (i = 0, j = 0; i < len; i++) {
> > +               if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) {
> > +                       if (i != j)
> > +                               tag_buf[j] = tag_buf[i];
> > +                       j++;
> > +               }
> > +       }
> > +
> > +       return j;
> > +}
> > +
> > +#define MAX_NR_TAG 128
> > +static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > +                                const struct ublk_batch_io_data *data,
> > +                                struct ublk_batch_fcmd *fcmd)
> > +{
> > +       unsigned short tag_buf[MAX_NR_TAG];
> > +       struct io_br_sel sel;
> > +       size_t len = 0;
> > +       bool needs_filter;
> > +       int ret;
> > +
> > +       sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > +                                        data->issue_flags);
> > +       if (sel.val < 0)
> > +               return sel.val;
> > +       if (!sel.addr)
> > +               return -ENOBUFS;
> > +
> > +       /* single reader needn't lock and sizeof(kfifo element) is 2 bytes */
> > +       len = min(len, sizeof(tag_buf)) / 2;
>
> sizeof(unsigned short) instead of 2?
>
> > +       len = kfifo_out(&ubq->evts_fifo, tag_buf, len);
> > +
> > +       needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len);
> > +       /* Filter out unused tags before posting to userspace */
> > +       if (unlikely(needs_filter)) {
> > +               int new_len = ublk_filter_unused_tags(tag_buf, len);
> > +
> > +               if (!new_len)
> > +                       return len;
>
> Is the purpose of this return value just to make ublk_batch_dispatch()
> retry __ublk_batch_dispatch()? Otherwise, it seems like a strange
> value to return.
>
> Also, shouldn't this path release the selected buffer to avoid leaking it?
>
> > +               len = new_len;
> > +       }
> > +
> > +       sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * 2);
>
> sizeof(unsigned short)?
>
> > +       ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags);
> > +       if (unlikely(ret < 0)) {
> > +               int i, res;
> > +
> > +               /*
> > +                * Undo prep state for all IOs since userspace never received them.
> > +                * This restores IOs to pre-prepared state so they can be cleanly
> > +                * re-prepared when tags are pulled from FIFO again.
> > +                */
> > +               for (i = 0; i < len; i++) {
> > +                       struct ublk_io *io = &ubq->ios[tag_buf[i]];
> > +                       int index = -1;
> > +
> > +                       ublk_io_lock(io);
> > +                       if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> > +                               index = io->buf.auto_reg.index;
>
> This is missing the io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd)
> check from ublk_handle_auto_buf_reg().

Never mind, I guess that's okay because both the register and register
are using data->cmd as the io_uring_cmd.

>
> > +                       io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG);
> > +                       io->flags |= UBLK_IO_FLAG_ACTIVE;
> > +                       ublk_io_unlock(io);
> > +
> > +                       if (index != -1)
> > +                               io_buffer_unregister_bvec(data->cmd, index,
> > +                                               data->issue_flags);
> > +               }
> > +
> > +               res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo,
> > +                       tag_buf, len, &ubq->evts_lock);
> > +
> > +               pr_warn("%s: copy tags or post CQE failure, move back "
> > +                               "tags(%d %zu) ret %d\n", __func__, res, len,
> > +                               ret);
> > +       }
> > +       return ret;
> > +}
> > +
> > +static __maybe_unused int
>
> The return value looks completely unused. Just return void instead?
>
> Best,
> Caleb
>
> > +ublk_batch_dispatch(struct ublk_queue *ubq,
> > +                   const struct ublk_batch_io_data *data,
> > +                   struct ublk_batch_fcmd *fcmd)
> > +{
> > +       int ret = 0;
> > +
> > +       while (!ublk_io_evts_empty(ubq)) {
> > +               ret = __ublk_batch_dispatch(ubq, data, fcmd);
> > +               if (ret <= 0)
> > +                       break;
> > +       }
> > +
> > +       if (ret < 0)
> > +               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > +
> > +       return ret;
> > +}
> > +
> >  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> >                            unsigned int issue_flags)
> >  {
> > --
> > 2.47.0
> >

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness
  2025-11-29 19:12   ` Caleb Sander Mateos
@ 2025-12-01  1:46     ` Ming Lei
  2025-12-01  5:58       ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-01  1:46 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sat, Nov 29, 2025 at 11:12:43AM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(),
> > and define kfifo_alloc_node() macro to support NUMA-aware memory
> > allocation.
> >
> > The new __kfifo_alloc_node() function accepts a NUMA node parameter
> > and uses kmalloc_array_node() instead of kmalloc_array() for
> > node-specific allocation. The existing __kfifo_alloc() now calls
> > __kfifo_alloc_node() with NUMA_NO_NODE to maintain backward
> > compatibility.
> >
> > This enables users to allocate kfifo buffers on specific NUMA nodes,
> > which is important for performance in NUMA systems where the kfifo
> > will be primarily accessed by threads running on specific nodes.
> >
> > Cc: Stefani Seibold <stefani@seibold.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/kfifo.h | 34 ++++++++++++++++++++++++++++++++--
> >  lib/kfifo.c           |  8 ++++----
> >  2 files changed, 36 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/kfifo.h b/include/linux/kfifo.h
> > index fd743d4c4b4b..8b81ac74829c 100644
> > --- a/include/linux/kfifo.h
> > +++ b/include/linux/kfifo.h
> > @@ -369,6 +369,30 @@ __kfifo_int_must_check_helper( \
> >  }) \
> >  )
> >
> > +/**
> > + * kfifo_alloc_node - dynamically allocates a new fifo buffer on a NUMA node
> > + * @fifo: pointer to the fifo
> > + * @size: the number of elements in the fifo, this must be a power of 2
> > + * @gfp_mask: get_free_pages mask, passed to kmalloc()
> > + * @node: NUMA node to allocate memory on
> > + *
> > + * This macro dynamically allocates a new fifo buffer with NUMA node awareness.
> > + *
> > + * The number of elements will be rounded-up to a power of 2.
> > + * The fifo will be release with kfifo_free().
> > + * Return 0 if no error, otherwise an error code.
> > + */
> > +#define kfifo_alloc_node(fifo, size, gfp_mask, node) \
> > +__kfifo_int_must_check_helper( \
> > +({ \
> > +       typeof((fifo) + 1) __tmp = (fifo); \
> > +       struct __kfifo *__kfifo = &__tmp->kfifo; \
> > +       __is_kfifo_ptr(__tmp) ? \
> > +       __kfifo_alloc_node(__kfifo, size, sizeof(*__tmp->type), gfp_mask, node) : \
> > +       -EINVAL; \
> > +}) \
> > +)
> 
> Looks like we could avoid some code duplication by defining
> kfifo_alloc(fifo, size, gfp_mask) as kfifo_alloc_node(fifo, size,
> gfp_mask, NUMA_NO_NODE). Otherwise, this looks good to me.

It is just a single-line inline, and shouldn't introduce any code
duplication. Switching to kfifo_alloc_node() doesn't change result of
`size vmlinux` actually.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure
  2025-11-30 19:24   ` Caleb Sander Mateos
  2025-11-30 21:37     ` Caleb Sander Mateos
@ 2025-12-01  2:32     ` Ming Lei
  2025-12-01 17:37       ` Caleb Sander Mateos
  1 sibling, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-01  2:32 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 11:24:12AM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add infrastructure for delivering I/O commands to ublk server in batches,
> > preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.
> >
> > Key components:
> >
> > - struct ublk_batch_fcmd: Represents a batch fetch uring_cmd that will
> >   receive multiple I/O tags in a single operation, using io_uring's
> >   multishot command for efficient ublk IO delivery.
> >
> > - ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
> >   * Pulls multiple request tags from the events FIFO (lock-free reader)
> >   * Prepares each I/O for delivery (including auto buffer registration)
> >   * Delivers tags to userspace via single uring_cmd notification
> >   * Handles partial failures by restoring undelivered tags to FIFO
> >
> > The batch approach significantly reduces notification overhead by aggregating
> > multiple I/O completions into single uring_cmd, while maintaining the same
> > I/O processing semantics as individual operations.
> >
> > Error handling ensures system consistency: if buffer selection or CQE
> > posting fails, undelivered tags are restored to the FIFO for retry,
> > meantime IO state has to be restored.
> >
> > This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
> > or called directly from ->uring_cmd(), enabling efficient batch processing
> > without blocking the I/O submission path.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 189 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 189 insertions(+)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 6ff284243630..cc9c92d97349 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -91,6 +91,12 @@
> >          UBLK_BATCH_F_HAS_BUF_ADDR | \
> >          UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
> >
> > +/* ublk batch fetch uring_cmd */
> > +struct ublk_batch_fcmd {
> 
> I would prefer "fetch_cmd" instead of "fcmd" for clarity
> 
> > +       struct io_uring_cmd *cmd;
> > +       unsigned short buf_group;
> > +};
> > +
> >  struct ublk_uring_cmd_pdu {
> >         /*
> >          * Store requests in same batch temporarily for queuing them to
> > @@ -168,6 +174,9 @@ struct ublk_batch_io_data {
> >   */
> >  #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
> >
> > +/* used for UBLK_F_BATCH_IO only */
> > +#define UBLK_BATCH_IO_UNUSED_TAG       ((unsigned short)-1)
> > +
> >  union ublk_io_buf {
> >         __u64   addr;
> >         struct ublk_auto_buf_reg auto_reg;
> > @@ -616,6 +625,32 @@ static wait_queue_head_t ublk_idr_wq;      /* wait until one idr is freed */
> >  static DEFINE_MUTEX(ublk_ctl_mutex);
> >
> >
> > +static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > +                                       struct ublk_batch_fcmd *fcmd,
> > +                                       int res)
> > +{
> > +       io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > +       fcmd->cmd = NULL;
> > +}
> > +
> > +static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > +                                    struct io_br_sel *sel,
> > +                                    unsigned int issue_flags)
> > +{
> > +       if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags))
> > +               return -ENOBUFS;
> > +       return 0;
> > +}
> > +
> > +static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fcmd *fcmd,
> > +                                      void __user *buf, const u16 *tag_buf,
> > +                                      unsigned int len)
> > +{
> > +       if (copy_to_user(buf, tag_buf, len))
> > +               return -EFAULT;
> > +       return len;
> > +}
> > +
> >  #define UBLK_MAX_UBLKS UBLK_MINORS
> >
> >  /*
> > @@ -1378,6 +1413,160 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
> >         }
> >  }
> >
> > +static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > +                                      const struct ublk_batch_io_data *data,
> > +                                      unsigned short tag)
> > +{
> > +       struct ublk_device *ub = data->ub;
> > +       struct ublk_io *io = &ubq->ios[tag];
> > +       struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> > +       enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK;
> > +       struct io_uring_cmd *cmd = data->cmd;
> > +
> > +       if (!ublk_start_io(ubq, req, io))
> 
> This doesn't look correct for UBLK_F_NEED_GET_DATA. If that's not
> supported in batch mode, then it should probably be disallowed when
> creating a batch-mode ublk device. The ublk_need_get_data() check in
> ublk_batch_commit_io_check() could also be dropped.

OK.

BTW UBLK_F_NEED_GET_DATA isn't necessary any more since user copy.

It is only for handling WRITE io command, and ublk server can copy data to
new buffer by user copy.

> 
> > +               return false;
> > +
> > +       if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
> > +               res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
> > +                               data->issue_flags);
> 
> __ublk_do_auto_buf_reg() reads io->buf.auto_reg. That seems racy
> without holding the io spinlock.

The io lock isn't needed.  Now the io state is guaranteed to be ACTIVE,
so UBLK_U_IO_COMMIT_IO_CMDS can't commit anything for this io.

> 
> > +
> > +       if (res == AUTO_BUF_REG_FAIL)
> > +               return false;
> 
> Could be moved into the if (ublk_support_auto_buf_reg(ubq) &&
> ublk_rq_has_data(req)) statement since it won't be true otherwise?

OK.

> 
> > +
> > +       ublk_io_lock(io);
> > +       ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
> > +       ublk_io_unlock(io);
> > +
> > +       return true;
> > +}
> > +
> > +static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > +                                    const struct ublk_batch_io_data *data,
> > +                                    unsigned short *tag_buf,
> > +                                    unsigned int len)
> > +{
> > +       bool has_unused = false;
> > +       int i;
> 
> unsigned?
> 
> > +
> > +       for (i = 0; i < len; i += 1) {
> 
> i++?
> 
> > +               unsigned short tag = tag_buf[i];
> > +
> > +               if (!__ublk_batch_prep_dispatch(ubq, data, tag)) {
> > +                       tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG;
> > +                       has_unused = true;
> > +               }
> > +       }
> > +
> > +       return has_unused;
> > +}
> > +
> > +/*
> > + * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf.
> > + * Returns the new length after filtering.
> > + */
> > +static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf,
> > +                                           unsigned int len)
> > +{
> > +       unsigned int i, j;
> > +
> > +       for (i = 0, j = 0; i < len; i++) {
> > +               if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) {
> > +                       if (i != j)
> > +                               tag_buf[j] = tag_buf[i];
> > +                       j++;
> > +               }
> > +       }
> > +
> > +       return j;
> > +}
> > +
> > +#define MAX_NR_TAG 128
> > +static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > +                                const struct ublk_batch_io_data *data,
> > +                                struct ublk_batch_fcmd *fcmd)
> > +{
> > +       unsigned short tag_buf[MAX_NR_TAG];
> > +       struct io_br_sel sel;
> > +       size_t len = 0;
> > +       bool needs_filter;
> > +       int ret;
> > +
> > +       sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > +                                        data->issue_flags);
> > +       if (sel.val < 0)
> > +               return sel.val;
> > +       if (!sel.addr)
> > +               return -ENOBUFS;
> > +
> > +       /* single reader needn't lock and sizeof(kfifo element) is 2 bytes */
> > +       len = min(len, sizeof(tag_buf)) / 2;
> 
> sizeof(unsigned short) instead of 2?

OK

> 
> > +       len = kfifo_out(&ubq->evts_fifo, tag_buf, len);
> > +
> > +       needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len);
> > +       /* Filter out unused tags before posting to userspace */
> > +       if (unlikely(needs_filter)) {
> > +               int new_len = ublk_filter_unused_tags(tag_buf, len);
> > +
> > +               if (!new_len)
> > +                       return len;
> 
> Is the purpose of this return value just to make ublk_batch_dispatch()
> retry __ublk_batch_dispatch()? Otherwise, it seems like a strange
> value to return.

If `new_len` becomes zero, it means all these requests are handled already,
either fail or requeue, so return `len` to tell the caller to move on. I
can comment this behavior.

> 
> Also, shouldn't this path release the selected buffer to avoid leaking it?

Good catch, but io_kbuf_recycle() isn't exported, we may have to call
io_uring_mshot_cmd_post_cqe() by zeroing sel->val.

> 
> > +               len = new_len;
> > +       }
> > +
> > +       sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * 2);
> 
> sizeof(unsigned short)?

OK

> 
> > +       ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags);
> > +       if (unlikely(ret < 0)) {
> > +               int i, res;
> > +
> > +               /*
> > +                * Undo prep state for all IOs since userspace never received them.
> > +                * This restores IOs to pre-prepared state so they can be cleanly
> > +                * re-prepared when tags are pulled from FIFO again.
> > +                */
> > +               for (i = 0; i < len; i++) {
> > +                       struct ublk_io *io = &ubq->ios[tag_buf[i]];
> > +                       int index = -1;
> > +
> > +                       ublk_io_lock(io);
> > +                       if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> > +                               index = io->buf.auto_reg.index;
> 
> This is missing the io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd)
> check from ublk_handle_auto_buf_reg().

As you replied, it isn't needed because it is the same multishot command
for registering bvec buf.

> 
> > +                       io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG);
> > +                       io->flags |= UBLK_IO_FLAG_ACTIVE;
> > +                       ublk_io_unlock(io);
> > +
> > +                       if (index != -1)
> > +                               io_buffer_unregister_bvec(data->cmd, index,
> > +                                               data->issue_flags);
> > +               }
> > +
> > +               res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo,
> > +                       tag_buf, len, &ubq->evts_lock);
> > +
> > +               pr_warn("%s: copy tags or post CQE failure, move back "
> > +                               "tags(%d %zu) ret %d\n", __func__, res, len,
> > +                               ret);
> > +       }
> > +       return ret;
> > +}
> > +
> > +static __maybe_unused int
> 
> The return value looks completely unused. Just return void instead?

Yes, looks it is removed in following patch.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 12/27] ublk: add io events fifo structure
  2025-11-30 16:53   ` Caleb Sander Mateos
@ 2025-12-01  3:04     ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-12-01  3:04 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 08:53:03AM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add ublk io events fifo structure and prepare for supporting command
> > batch, which will use io_uring multishot uring_cmd for fetching one
> > batch of io commands each time.
> >
> > One nice feature of kfifo is to allow multiple producer vs single
> > consumer. We just need lock the producer side, meantime the single
> > consumer can be lockless.
> >
> > The producer is actually from ublk_queue_rq() or ublk_queue_rqs(), so
> > lock contention can be eased by setting proper blk-mq nr_queues.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 65 ++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 60 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index ea992366af5b..6ff284243630 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -44,6 +44,7 @@
> >  #include <linux/task_work.h>
> >  #include <linux/namei.h>
> >  #include <linux/kref.h>
> > +#include <linux/kfifo.h>
> >  #include <uapi/linux/ublk_cmd.h>
> >
> >  #define UBLK_MINORS            (1U << MINORBITS)
> > @@ -217,6 +218,22 @@ struct ublk_queue {
> >         bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
> >         spinlock_t              cancel_lock;
> >         struct ublk_device *dev;
> > +
> > +       /*
> > +        * Inflight ublk request tag is saved in this fifo
> > +        *
> > +        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > +        * so lock is required for storing request tag to fifo
> > +        *
> > +        * Make sure just one reader for fetching request from task work
> > +        * function to ublk server, so no need to grab the lock in reader
> > +        * side.
> 
> Can you clarify that this is only used for batch mode?

Yes.

> 
> > +        */
> > +       struct {
> > +               DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > +               spinlock_t evts_lock;
> > +       }____cacheline_aligned_in_smp;
> > +
> >         struct ublk_io ios[] __counted_by(q_depth);
> >  };
> >
> > @@ -282,6 +299,32 @@ static inline void ublk_io_unlock(struct ublk_io *io)
> >         spin_unlock(&io->lock);
> >  }
> >
> > +/* Initialize the queue */
> 
> "queue" -> "events queue"? Otherwise, it sounds like it's referring to
> struct ublk_queue.

OK.

> 
> > +static inline int ublk_io_evts_init(struct ublk_queue *q, unsigned int size,
> > +                                   int numa_node)
> > +{
> > +       spin_lock_init(&q->evts_lock);
> > +       return kfifo_alloc_node(&q->evts_fifo, size, GFP_KERNEL, numa_node);
> > +}
> > +
> > +/* Check if queue is empty */
> > +static inline bool ublk_io_evts_empty(const struct ublk_queue *q)
> > +{
> > +       return kfifo_is_empty(&q->evts_fifo);
> > +}
> > +
> > +/* Check if queue is full */
> > +static inline bool ublk_io_evts_full(const struct ublk_queue *q)
> 
> Function is unused?

Yes, will remove it.

> 
> > +{
> > +       return kfifo_is_full(&q->evts_fifo);
> > +}
> > +
> > +static inline void ublk_io_evts_deinit(struct ublk_queue *q)
> > +{
> > +       WARN_ON_ONCE(!kfifo_is_empty(&q->evts_fifo));
> > +       kfifo_free(&q->evts_fifo);
> > +}
> > +
> >  static inline struct ublksrv_io_desc *
> >  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
> >  {
> > @@ -3038,6 +3081,9 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
> >         if (ubq->io_cmd_buf)
> >                 free_pages((unsigned long)ubq->io_cmd_buf, get_order(size));
> >
> > +       if (ublk_dev_support_batch_io(ub))
> > +               ublk_io_evts_deinit(ubq);
> > +
> >         kvfree(ubq);
> >         ub->queues[q_id] = NULL;
> >  }
> > @@ -3062,7 +3108,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
> >         struct ublk_queue *ubq;
> >         struct page *page;
> >         int numa_node;
> > -       int size, i;
> > +       int size, i, ret = -ENOMEM;
> >
> >         /* Determine NUMA node based on queue's CPU affinity */
> >         numa_node = ublk_get_queue_numa_node(ub, q_id);
> > @@ -3081,18 +3127,27 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
> >
> >         /* Allocate I/O command buffer on local NUMA node */
> >         page = alloc_pages_node(numa_node, gfp_flags, get_order(size));
> > -       if (!page) {
> > -               kvfree(ubq);
> > -               return -ENOMEM;
> > -       }
> > +       if (!page)
> > +               goto fail_nomem;
> >         ubq->io_cmd_buf = page_address(page);
> >
> >         for (i = 0; i < ubq->q_depth; i++)
> >                 spin_lock_init(&ubq->ios[i].lock);
> >
> > +       if (ublk_dev_support_batch_io(ub)) {
> > +               ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node);
> > +               if (ret)
> > +                       goto fail;
> > +       }
> >         ub->queues[q_id] = ubq;
> >         ubq->dev = ub;
> > +
> >         return 0;
> > +fail:
> > +       ublk_deinit_queue(ub, q_id);
> 
> This is a no-op since ub->queues[q_id] hasn't been assigned yet?

Good catch, __ublk_deinit_queue(ub, ubq) can be added for solving the
failure handling.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-11-21  1:58 ` [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
@ 2025-12-01  5:55   ` Caleb Sander Mateos
  2025-12-01  9:41     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01  5:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> multiple I/O commands in a single operation, significantly reducing
> submission overhead compared to individual FETCH_REQ* commands.
>
> Key Design Features:
>
> 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
>    commands, with the batch size limited by the provided buffer length.
>
> 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
>    simultaneously, but only one is active at any time. This enables
>    efficient load distribution across multiple server task contexts.
>
> 3. Implicit State Management: The implementation uses three key variables
>    to track state:
>    - evts_fifo: Queue of request tags awaiting processing
>    - fcmd_head: List of available fetch commands
>    - active_fcmd: Currently active fetch command (NULL = none active)
>
>    States are derived implicitly:
>    - IDLE: No fetch commands available
>    - READY: Fetch commands available, none active
>    - ACTIVE: One fetch command processing events
>
> 4. Lockless Reader Optimization: The active fetch command can read from
>    evts_fifo without locking (single reader guarantee), while writers
>    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
>    barrier pairing plays key role for the single lockless reader
>    optimization.
>
> Implementation Details:
>
> - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> - __ublk_pick_active_fcmd() selects an available fetch command when
>   events arrive and no command is currently active

What is __ublk_pick_active_fcmd()? I don't see a function with that name.

> - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
>   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> - State transitions are coordinated via evts_lock to maintain consistency
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
>  include/uapi/linux/ublk_cmd.h |   7 +
>  2 files changed, 388 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index cc9c92d97349..2e5e392c939e 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -93,6 +93,7 @@
>
>  /* ublk batch fetch uring_cmd */
>  struct ublk_batch_fcmd {
> +       struct list_head node;
>         struct io_uring_cmd *cmd;
>         unsigned short buf_group;
>  };
> @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
>          */
>         struct ublk_queue *ubq;
>
> -       u16 tag;
> +       union {
> +               u16 tag;
> +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> +       };
>  };
>
>  struct ublk_batch_io_data {
> @@ -229,18 +233,36 @@ struct ublk_queue {
>         struct ublk_device *dev;
>
>         /*
> -        * Inflight ublk request tag is saved in this fifo
> +        * Batch I/O State Management:
> +        *
> +        * The batch I/O system uses implicit state management based on the
> +        * combination of three key variables below.
> +        *
> +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> +        *   No fetch commands available, events queue in evts_fifo
> +        *
> +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> +        *   Fetch commands available but none processing events
>          *
> -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> -        * so lock is required for storing request tag to fifo
> +        * - ACTIVE: active_fcmd
> +        *   One fetch command actively processing events from evts_fifo
>          *
> -        * Make sure just one reader for fetching request from task work
> -        * function to ublk server, so no need to grab the lock in reader
> -        * side.
> +        * Key Invariants:
> +        * - At most one active_fcmd at any time (single reader)
> +        * - active_fcmd is always from fcmd_head list when non-NULL
> +        * - evts_fifo can be read locklessly by the single active reader
> +        * - All state transitions require evts_lock protection
> +        * - Multiple writers to evts_fifo require lock protection
>          */
>         struct {
>                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
>                 spinlock_t evts_lock;
> +
> +               /* List of fetch commands available to process events */
> +               struct list_head fcmd_head;
> +
> +               /* Currently active fetch command (NULL = none active) */
> +               struct ublk_batch_fcmd  *active_fcmd;
>         }____cacheline_aligned_in_smp;
>
>         struct ublk_io ios[] __counted_by(q_depth);
> @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
>                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
>  static inline unsigned int ublk_req_build_flags(struct request *req);
> +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> +                               struct ublk_batch_io_data *data,
> +                               struct ublk_batch_fcmd *fcmd);
>
>  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
>  {
>         return false;
>  }
>
> +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> +{
> +       return false;
> +}
> +
>  static inline void ublk_io_lock(struct ublk_io *io)
>  {
>         spin_lock(&io->lock);
> @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
>
>  static DEFINE_MUTEX(ublk_ctl_mutex);
>
> +static struct ublk_batch_fcmd *
> +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> +{
> +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);

An allocation in the I/O path seems unfortunate. Is there not room to
store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> +
> +       if (fcmd) {
> +               fcmd->cmd = cmd;
> +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);

Is it necessary to store sample this here just to pass it back to the
io_uring layer? Wouldn't the io_uring layer already have access to it
in struct io_kiocb's buf_index field?

> +       }
> +       return fcmd;
> +}
> +
> +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> +{
> +       kfree(fcmd);
> +}
> +
> +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> +{
> +       WRITE_ONCE(ubq->active_fcmd, NULL);
> +}
>
> -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> +/*
> + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> + * dispatching
> + */
> +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> +                                       const struct ublk_batch_io_data *data,
>                                         struct ublk_batch_fcmd *fcmd,
>                                         int res)
>  {
> +       spin_lock(&ubq->evts_lock);
> +       list_del(&fcmd->node);
> +       WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> +       __ublk_release_fcmd(ubq);
> +       spin_unlock(&ubq->evts_lock);
> +
>         io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> -       fcmd->cmd = NULL;
> +       ublk_batch_free_fcmd(fcmd);
>  }
>
>  static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
>         bool needs_filter;
>         int ret;
>
> +       WARN_ON_ONCE(data->cmd != fcmd->cmd);
> +
>         sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
>                                          data->issue_flags);
>         if (sel.val < 0)
> @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
>         return ret;
>  }
>
> -static __maybe_unused int
> -ublk_batch_dispatch(struct ublk_queue *ubq,
> -                   const struct ublk_batch_io_data *data,
> -                   struct ublk_batch_fcmd *fcmd)
> +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> +               struct ublk_queue *ubq)
> +{
> +       struct ublk_batch_fcmd *fcmd;
> +
> +       lockdep_assert_held(&ubq->evts_lock);
> +
> +       /*
> +        * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> +        *
> +        * The pair is the smp_mb() in ublk_batch_dispatch().
> +        *
> +        * If ubq->active_fcmd is observed as non-NULL, the new added tags
> +        * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> +        */
> +       smp_mb();
> +       if (READ_ONCE(ubq->active_fcmd)) {
> +               fcmd = NULL;
> +       } else {
> +               fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> +                               struct ublk_batch_fcmd, node);
> +               WRITE_ONCE(ubq->active_fcmd, fcmd);
> +       }
> +       return fcmd;
> +}
> +
> +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> +                          unsigned int issue_flags)
> +{
> +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> +       struct ublk_batch_io_data data = {
> +               .ub = pdu->ubq->dev,
> +               .cmd = fcmd->cmd,
> +               .issue_flags = issue_flags,
> +       };
> +
> +       WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> +
> +       ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> +}
> +
> +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> +                               struct ublk_batch_io_data *data,
> +                               struct ublk_batch_fcmd *fcmd)
>  {
> +       struct ublk_batch_fcmd *new_fcmd;

Is the new_fcmd variable necessary? Can fcmd be reused instead?

> +       void *handle;
> +       bool empty;
>         int ret = 0;
>
> +again:
>         while (!ublk_io_evts_empty(ubq)) {
>                 ret = __ublk_batch_dispatch(ubq, data, fcmd);
>                 if (ret <= 0)
>                         break;
>         }
>
> -       if (ret < 0)
> -               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> +       if (ret < 0) {
> +               ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> +               return;
> +       }
>
> -       return ret;
> +       handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> +       __ublk_release_fcmd(ubq);
> +       /*
> +        * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> +        * checking ubq->evts_fifo.
> +        *
> +        * The pair is the smp_mb() in __ublk_acquire_fcmd().
> +        */
> +       smp_mb();
> +       empty = ublk_io_evts_empty(ubq);
> +       if (likely(empty))

nit: empty variable seems unnecessary

> +               return;
> +
> +       spin_lock(&ubq->evts_lock);
> +       new_fcmd = __ublk_acquire_fcmd(ubq);
> +       spin_unlock(&ubq->evts_lock);
> +
> +       if (!new_fcmd)
> +               return;
> +       if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {

This check seems to be meant to decide whether the new and old
UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?
But belonging to the same io_uring context doesn't necessarily mean
that the same task issued them. It seems like it would be safer to
always dispatch new_fcmd->cmd to task work.

> +               data->cmd = new_fcmd->cmd;
> +               fcmd = new_fcmd;
> +               goto again;
> +       }
> +       io_uring_cmd_complete_in_task(new_fcmd->cmd, ublk_batch_tw_cb);
>  }
>
>  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> @@ -1576,13 +1711,27 @@ static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
>         ublk_dispatch_req(ubq, pdu->req, issue_flags);
>  }
>
> -static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> +static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq, bool last)
>  {
> -       struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +       if (ublk_support_batch_io(ubq)) {
> +               unsigned short tag = rq->tag;
> +               struct ublk_batch_fcmd *fcmd = NULL;
>
> -       pdu->req = rq;
> -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> +               spin_lock(&ubq->evts_lock);
> +               kfifo_put(&ubq->evts_fifo, tag);
> +               if (last)
> +                       fcmd = __ublk_acquire_fcmd(ubq);
> +               spin_unlock(&ubq->evts_lock);
> +
> +               if (fcmd)
> +                       io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> +       } else {
> +               struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +
> +               pdu->req = rq;
> +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> +       }
>  }
>
>  static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> @@ -1600,14 +1749,44 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
>         } while (rq);
>  }
>
> -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> +static void ublk_batch_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
>  {
> -       struct io_uring_cmd *cmd = io->cmd;
> -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +       unsigned short tags[MAX_NR_TAG];
> +       struct ublk_batch_fcmd *fcmd;
> +       struct request *rq;
> +       unsigned cnt = 0;
> +
> +       spin_lock(&ubq->evts_lock);
> +       rq_list_for_each(l, rq) {
> +               tags[cnt++] = (unsigned short)rq->tag;
> +               if (cnt >= MAX_NR_TAG) {
> +                       kfifo_in(&ubq->evts_fifo, tags, cnt);
> +                       cnt = 0;
> +               }
> +       }
> +       if (cnt)
> +               kfifo_in(&ubq->evts_fifo, tags, cnt);
> +       fcmd = __ublk_acquire_fcmd(ubq);
> +       spin_unlock(&ubq->evts_lock);
>
> -       pdu->req_list = rq_list_peek(l);
>         rq_list_init(l);
> -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> +       if (fcmd)
> +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> +}
> +
> +static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct ublk_io *io,
> +                               struct rq_list *l, bool batch)
> +{
> +       if (batch) {
> +               ublk_batch_queue_cmd_list(ubq, l);
> +       } else {
> +               struct io_uring_cmd *cmd = io->cmd;
> +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +
> +               pdu->req_list = rq_list_peek(l);
> +               rq_list_init(l);
> +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> +       }
>  }
>
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> @@ -1686,7 +1865,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>                 return BLK_STS_OK;
>         }
>
> -       ublk_queue_cmd(ubq, rq);
> +       ublk_queue_cmd(ubq, rq, bd->last);
>         return BLK_STS_OK;
>  }
>
> @@ -1698,11 +1877,25 @@ static inline bool ublk_belong_to_same_batch(const struct ublk_io *io,
>                 (io->task == io2->task);
>  }
>
> -static void ublk_queue_rqs(struct rq_list *rqlist)
> +static void ublk_commit_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +       struct ublk_queue *ubq = hctx->driver_data;
> +       struct ublk_batch_fcmd *fcmd;
> +
> +       spin_lock(&ubq->evts_lock);
> +       fcmd = __ublk_acquire_fcmd(ubq);
> +       spin_unlock(&ubq->evts_lock);
> +
> +       if (fcmd)
> +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> +}
> +
> +static void __ublk_queue_rqs(struct rq_list *rqlist, bool batch)
>  {
>         struct rq_list requeue_list = { };
>         struct rq_list submit_list = { };
>         struct ublk_io *io = NULL;
> +       struct ublk_queue *ubq = NULL;
>         struct request *req;
>
>         while ((req = rq_list_pop(rqlist))) {
> @@ -1716,16 +1909,27 @@ static void ublk_queue_rqs(struct rq_list *rqlist)
>
>                 if (io && !ublk_belong_to_same_batch(io, this_io) &&
>                                 !rq_list_empty(&submit_list))
> -                       ublk_queue_cmd_list(io, &submit_list);
> +                       ublk_queue_cmd_list(ubq, io, &submit_list, batch);

This seems to assume that all the requests belong to the same
ublk_queue, which isn't required

>                 io = this_io;
> +               ubq = this_q;
>                 rq_list_add_tail(&submit_list, req);
>         }
>
>         if (!rq_list_empty(&submit_list))
> -               ublk_queue_cmd_list(io, &submit_list);
> +               ublk_queue_cmd_list(ubq, io, &submit_list, batch);

Same here

>         *rqlist = requeue_list;
>  }
>
> +static void ublk_queue_rqs(struct rq_list *rqlist)
> +{
> +       __ublk_queue_rqs(rqlist, false);
> +}
> +
> +static void ublk_batch_queue_rqs(struct rq_list *rqlist)
> +{
> +       __ublk_queue_rqs(rqlist, true);
> +}
> +
>  static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
>                 unsigned int hctx_idx)
>  {
> @@ -1743,6 +1947,14 @@ static const struct blk_mq_ops ublk_mq_ops = {
>         .timeout        = ublk_timeout,
>  };
>
> +static const struct blk_mq_ops ublk_batch_mq_ops = {
> +       .commit_rqs     = ublk_commit_rqs,
> +       .queue_rq       = ublk_queue_rq,
> +       .queue_rqs      = ublk_batch_queue_rqs,
> +       .init_hctx      = ublk_init_hctx,
> +       .timeout        = ublk_timeout,
> +};
> +
>  static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
>  {
>         int i;
> @@ -2120,6 +2332,56 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag,
>                 io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, issue_flags);
>  }
>
> +static void ublk_batch_cancel_cmd(struct ublk_queue *ubq,
> +                                 struct ublk_batch_fcmd *fcmd,
> +                                 unsigned int issue_flags)
> +{
> +       bool done;
> +
> +       spin_lock(&ubq->evts_lock);
> +       done = (ubq->active_fcmd != fcmd);

Needs to use READ_ONCE() since __ublk_release_fcmd() can be called
without holding evts_lock?

> +       if (done)
> +               list_del(&fcmd->node);
> +       spin_unlock(&ubq->evts_lock);
> +
> +       if (done) {
> +               io_uring_cmd_done(fcmd->cmd, UBLK_IO_RES_ABORT, issue_flags);
> +               ublk_batch_free_fcmd(fcmd);
> +       }
> +}
> +
> +static void ublk_batch_cancel_queue(struct ublk_queue *ubq)
> +{
> +       LIST_HEAD(fcmd_list);
> +
> +       spin_lock(&ubq->evts_lock);
> +       ubq->force_abort = true;
> +       list_splice_init(&ubq->fcmd_head, &fcmd_list);
> +       if (ubq->active_fcmd)
> +               list_move(&ubq->active_fcmd->node, &ubq->fcmd_head);

Similarly, needs READ_ONCE()?

> +       spin_unlock(&ubq->evts_lock);
> +
> +       while (!list_empty(&fcmd_list)) {
> +               struct ublk_batch_fcmd *fcmd = list_first_entry(&fcmd_list,
> +                               struct ublk_batch_fcmd, node);
> +
> +               ublk_batch_cancel_cmd(ubq, fcmd, IO_URING_F_UNLOCKED);
> +       }
> +}
> +
> +static void ublk_batch_cancel_fn(struct io_uring_cmd *cmd,
> +                                unsigned int issue_flags)
> +{
> +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> +       struct ublk_queue *ubq = pdu->ubq;
> +
> +       if (!ubq->canceling)

Is it not racy to access ubq->canceling without any lock held?

> +               ublk_start_cancel(ubq->dev);
> +
> +       ublk_batch_cancel_cmd(ubq, fcmd, issue_flags);
> +}
> +
>  /*
>   * The ublk char device won't be closed when calling cancel fn, so both
>   * ublk device and queue are guaranteed to be live
> @@ -2171,6 +2433,11 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
>  {
>         int i;
>
> +       if (ublk_support_batch_io(ubq)) {
> +               ublk_batch_cancel_queue(ubq);
> +               return;
> +       }
> +
>         for (i = 0; i < ubq->q_depth; i++)
>                 ublk_cancel_cmd(ubq, i, IO_URING_F_UNLOCKED);
>  }
> @@ -3091,6 +3358,74 @@ static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
>         return ublk_check_batch_cmd_flags(uc);
>  }
>
> +static int ublk_batch_attach(struct ublk_queue *ubq,
> +                            struct ublk_batch_io_data *data,
> +                            struct ublk_batch_fcmd *fcmd)
> +{
> +       struct ublk_batch_fcmd *new_fcmd = NULL;
> +       bool free = false;
> +
> +       spin_lock(&ubq->evts_lock);
> +       if (unlikely(ubq->force_abort || ubq->canceling)) {
> +               free = true;
> +       } else {
> +               list_add_tail(&fcmd->node, &ubq->fcmd_head);
> +               new_fcmd = __ublk_acquire_fcmd(ubq);
> +       }
> +       spin_unlock(&ubq->evts_lock);
> +
> +       /*
> +        * If the two fetch commands are originated from same io_ring_ctx,
> +        * run batch dispatch directly. Otherwise, schedule task work for
> +        * doing it.
> +        */
> +       if (new_fcmd && io_uring_cmd_ctx_handle(new_fcmd->cmd) ==
> +                       io_uring_cmd_ctx_handle(fcmd->cmd)) {
> +               data->cmd = new_fcmd->cmd;
> +               ublk_batch_dispatch(ubq, data, new_fcmd);
> +       } else if (new_fcmd) {
> +               io_uring_cmd_complete_in_task(new_fcmd->cmd,
> +                               ublk_batch_tw_cb);
> +       }

Return early if (!new_fcmd) to reduce indentation?

> +
> +       if (free) {
> +               ublk_batch_free_fcmd(fcmd);
> +               return -ENODEV;
> +       }

Move the if (free) check directly after spin_unlock(&ubq->evts_lock)?

> +       return -EIOCBQUEUED;

> +}
> +
> +static int ublk_handle_batch_fetch_cmd(struct ublk_batch_io_data *data)
> +{
> +       struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
> +       struct ublk_batch_fcmd *fcmd = ublk_batch_alloc_fcmd(data->cmd);
> +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(data->cmd);
> +
> +       if (!fcmd)
> +               return -ENOMEM;
> +
> +       pdu->ubq = ubq;
> +       pdu->fcmd = fcmd;
> +       io_uring_cmd_mark_cancelable(data->cmd, data->issue_flags);
> +
> +       return ublk_batch_attach(ubq, data, fcmd);
> +}
> +
> +static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> +                                        const struct ublk_batch_io *uc)
> +{
> +       if (!(data->cmd->flags & IORING_URING_CMD_MULTISHOT))
> +               return -EINVAL;
> +
> +       if (uc->elem_bytes != sizeof(__u16))
> +               return -EINVAL;
> +
> +       if (uc->flags != 0)
> +               return -E2BIG;
> +
> +       return 0;
> +}
> +
>  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                                        unsigned int issue_flags)
>  {
> @@ -3113,6 +3448,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>         if (data.header.q_id >= ub->dev_info.nr_hw_queues)
>                 goto out;
>
> +       if (unlikely(issue_flags & IO_URING_F_CANCEL)) {
> +               ublk_batch_cancel_fn(cmd, issue_flags);
> +               return 0;
> +       }

Move this to the top of the function before the other logic that's not
necessary in the cancel case?

Best,
Caleb

> +
>         switch (cmd_op) {
>         case UBLK_U_IO_PREP_IO_CMDS:
>                 ret = ublk_check_batch_cmd(&data);
> @@ -3126,6 +3466,12 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                         goto out;
>                 ret = ublk_handle_batch_commit_cmd(&data);
>                 break;
> +       case UBLK_U_IO_FETCH_IO_CMDS:
> +               ret = ublk_validate_batch_fetch_cmd(&data, uc);
> +               if (ret)
> +                       goto out;
> +               ret = ublk_handle_batch_fetch_cmd(&data);
> +               break;
>         default:
>                 ret = -EOPNOTSUPP;
>         }
> @@ -3327,6 +3673,7 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id)
>                 ret = ublk_io_evts_init(ubq, ubq->q_depth, numa_node);
>                 if (ret)
>                         goto fail;
> +               INIT_LIST_HEAD(&ubq->fcmd_head);
>         }
>         ub->queues[q_id] = ubq;
>         ubq->dev = ub;
> @@ -3451,7 +3798,10 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
>
>  static int ublk_add_tag_set(struct ublk_device *ub)
>  {
> -       ub->tag_set.ops = &ublk_mq_ops;
> +       if (ublk_dev_support_batch_io(ub))
> +               ub->tag_set.ops = &ublk_batch_mq_ops;
> +       else
> +               ub->tag_set.ops = &ublk_mq_ops;
>         ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
>         ub->tag_set.queue_depth = ub->dev_info.queue_depth;
>         ub->tag_set.numa_node = NUMA_NO_NODE;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index 295ec8f34173..cd894c1d188e 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -120,6 +120,13 @@
>  #define        UBLK_U_IO_COMMIT_IO_CMDS        \
>         _IOWR('u', 0x26, struct ublk_batch_io)
>
> +/*
> + * Fetch io commands to provided buffer in multishot style,
> + * `IORING_URING_CMD_MULTISHOT` is required for this command.
> + */
> +#define        UBLK_U_IO_FETCH_IO_CMDS         \
> +       _IOWR('u', 0x27, struct ublk_batch_io)
> +
>  /* only ABORT means that no re-fetch */
>  #define UBLK_IO_RES_OK                 0
>  #define UBLK_IO_RES_NEED_GET_DATA      1
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness
  2025-12-01  1:46     ` Ming Lei
@ 2025-12-01  5:58       ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01  5:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 5:46 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sat, Nov 29, 2025 at 11:12:43AM -0800, Caleb Sander Mateos wrote:
> > On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(),
> > > and define kfifo_alloc_node() macro to support NUMA-aware memory
> > > allocation.
> > >
> > > The new __kfifo_alloc_node() function accepts a NUMA node parameter
> > > and uses kmalloc_array_node() instead of kmalloc_array() for
> > > node-specific allocation. The existing __kfifo_alloc() now calls
> > > __kfifo_alloc_node() with NUMA_NO_NODE to maintain backward
> > > compatibility.
> > >
> > > This enables users to allocate kfifo buffers on specific NUMA nodes,
> > > which is important for performance in NUMA systems where the kfifo
> > > will be primarily accessed by threads running on specific nodes.
> > >
> > > Cc: Stefani Seibold <stefani@seibold.net>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  include/linux/kfifo.h | 34 ++++++++++++++++++++++++++++++++--
> > >  lib/kfifo.c           |  8 ++++----
> > >  2 files changed, 36 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/kfifo.h b/include/linux/kfifo.h
> > > index fd743d4c4b4b..8b81ac74829c 100644
> > > --- a/include/linux/kfifo.h
> > > +++ b/include/linux/kfifo.h
> > > @@ -369,6 +369,30 @@ __kfifo_int_must_check_helper( \
> > >  }) \
> > >  )
> > >
> > > +/**
> > > + * kfifo_alloc_node - dynamically allocates a new fifo buffer on a NUMA node
> > > + * @fifo: pointer to the fifo
> > > + * @size: the number of elements in the fifo, this must be a power of 2
> > > + * @gfp_mask: get_free_pages mask, passed to kmalloc()
> > > + * @node: NUMA node to allocate memory on
> > > + *
> > > + * This macro dynamically allocates a new fifo buffer with NUMA node awareness.
> > > + *
> > > + * The number of elements will be rounded-up to a power of 2.
> > > + * The fifo will be release with kfifo_free().
> > > + * Return 0 if no error, otherwise an error code.
> > > + */
> > > +#define kfifo_alloc_node(fifo, size, gfp_mask, node) \
> > > +__kfifo_int_must_check_helper( \
> > > +({ \
> > > +       typeof((fifo) + 1) __tmp = (fifo); \
> > > +       struct __kfifo *__kfifo = &__tmp->kfifo; \
> > > +       __is_kfifo_ptr(__tmp) ? \
> > > +       __kfifo_alloc_node(__kfifo, size, sizeof(*__tmp->type), gfp_mask, node) : \
> > > +       -EINVAL; \
> > > +}) \
> > > +)
> >
> > Looks like we could avoid some code duplication by defining
> > kfifo_alloc(fifo, size, gfp_mask) as kfifo_alloc_node(fifo, size,
> > gfp_mask, NUMA_NO_NODE). Otherwise, this looks good to me.
>
> It is just a single-line inline, and shouldn't introduce any code
> duplication. Switching to kfifo_alloc_node() doesn't change result of
> `size vmlinux` actually.

Right, I know they expand to the same thing. I'm just saying we can
avoid repeating the nearly identical implementations by writing
kfifo_alloc() in terms of kfifo_alloc_node().

Best,
Caleb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-01  5:55   ` Caleb Sander Mateos
@ 2025-12-01  9:41     ` Ming Lei
  2025-12-01 17:51       ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-01  9:41 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > multiple I/O commands in a single operation, significantly reducing
> > submission overhead compared to individual FETCH_REQ* commands.
> >
> > Key Design Features:
> >
> > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> >    commands, with the batch size limited by the provided buffer length.
> >
> > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> >    simultaneously, but only one is active at any time. This enables
> >    efficient load distribution across multiple server task contexts.
> >
> > 3. Implicit State Management: The implementation uses three key variables
> >    to track state:
> >    - evts_fifo: Queue of request tags awaiting processing
> >    - fcmd_head: List of available fetch commands
> >    - active_fcmd: Currently active fetch command (NULL = none active)
> >
> >    States are derived implicitly:
> >    - IDLE: No fetch commands available
> >    - READY: Fetch commands available, none active
> >    - ACTIVE: One fetch command processing events
> >
> > 4. Lockless Reader Optimization: The active fetch command can read from
> >    evts_fifo without locking (single reader guarantee), while writers
> >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> >    barrier pairing plays key role for the single lockless reader
> >    optimization.
> >
> > Implementation Details:
> >
> > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > - __ublk_pick_active_fcmd() selects an available fetch command when
> >   events arrive and no command is currently active
> 
> What is __ublk_pick_active_fcmd()? I don't see a function with that name.

It is renamed as __ublk_acquire_fcmd(), and its counter pair is
__ublk_release_fcmd().

> 
> > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > - State transitions are coordinated via evts_lock to maintain consistency
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> >  include/uapi/linux/ublk_cmd.h |   7 +
> >  2 files changed, 388 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index cc9c92d97349..2e5e392c939e 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -93,6 +93,7 @@
> >
> >  /* ublk batch fetch uring_cmd */
> >  struct ublk_batch_fcmd {
> > +       struct list_head node;
> >         struct io_uring_cmd *cmd;
> >         unsigned short buf_group;
> >  };
> > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> >          */
> >         struct ublk_queue *ubq;
> >
> > -       u16 tag;
> > +       union {
> > +               u16 tag;
> > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > +       };
> >  };
> >
> >  struct ublk_batch_io_data {
> > @@ -229,18 +233,36 @@ struct ublk_queue {
> >         struct ublk_device *dev;
> >
> >         /*
> > -        * Inflight ublk request tag is saved in this fifo
> > +        * Batch I/O State Management:
> > +        *
> > +        * The batch I/O system uses implicit state management based on the
> > +        * combination of three key variables below.
> > +        *
> > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > +        *   No fetch commands available, events queue in evts_fifo
> > +        *
> > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > +        *   Fetch commands available but none processing events
> >          *
> > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > -        * so lock is required for storing request tag to fifo
> > +        * - ACTIVE: active_fcmd
> > +        *   One fetch command actively processing events from evts_fifo
> >          *
> > -        * Make sure just one reader for fetching request from task work
> > -        * function to ublk server, so no need to grab the lock in reader
> > -        * side.
> > +        * Key Invariants:
> > +        * - At most one active_fcmd at any time (single reader)
> > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > +        * - evts_fifo can be read locklessly by the single active reader
> > +        * - All state transitions require evts_lock protection
> > +        * - Multiple writers to evts_fifo require lock protection
> >          */
> >         struct {
> >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> >                 spinlock_t evts_lock;
> > +
> > +               /* List of fetch commands available to process events */
> > +               struct list_head fcmd_head;
> > +
> > +               /* Currently active fetch command (NULL = none active) */
> > +               struct ublk_batch_fcmd  *active_fcmd;
> >         }____cacheline_aligned_in_smp;
> >
> >         struct ublk_io ios[] __counted_by(q_depth);
> > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > +                               struct ublk_batch_io_data *data,
> > +                               struct ublk_batch_fcmd *fcmd);
> >
> >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> >  {
> >         return false;
> >  }
> >
> > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void ublk_io_lock(struct ublk_io *io)
> >  {
> >         spin_lock(&io->lock);
> > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> >
> >  static DEFINE_MUTEX(ublk_ctl_mutex);
> >
> > +static struct ublk_batch_fcmd *
> > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > +{
> > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> 
> An allocation in the I/O path seems unfortunate. Is there not room to
> store the struct ublk_batch_fcmd in the io_uring_cmd pdu?

It is allocated once for one mshot request, which covers many IOs.

It can't be held in uring_cmd pdu, but the allocation can be optimized in
future. Not a big deal in enablement stage.

> > +
> > +       if (fcmd) {
> > +               fcmd->cmd = cmd;
> > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> 
> Is it necessary to store sample this here just to pass it back to the
> io_uring layer? Wouldn't the io_uring layer already have access to it
> in struct io_kiocb's buf_index field?

->buf_group is used by io_uring_cmd_buffer_select(), and this way also
follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.

More importantly req->buf_index is used in io_uring/kbuf.c internally, see
io_ring_buffer_select(), so we can't reuse req->buf_index here.

> 
> > +       }
> > +       return fcmd;
> > +}
> > +
> > +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> > +{
> > +       kfree(fcmd);
> > +}
> > +
> > +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> > +{
> > +       WRITE_ONCE(ubq->active_fcmd, NULL);
> > +}
> >
> > -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > +/*
> > + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> > + * dispatching
> > + */
> > +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> > +                                       const struct ublk_batch_io_data *data,
> >                                         struct ublk_batch_fcmd *fcmd,
> >                                         int res)
> >  {
> > +       spin_lock(&ubq->evts_lock);
> > +       list_del(&fcmd->node);
> > +       WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> > +       __ublk_release_fcmd(ubq);
> > +       spin_unlock(&ubq->evts_lock);
> > +
> >         io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > -       fcmd->cmd = NULL;
> > +       ublk_batch_free_fcmd(fcmd);
> >  }
> >
> >  static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> >         bool needs_filter;
> >         int ret;
> >
> > +       WARN_ON_ONCE(data->cmd != fcmd->cmd);
> > +
> >         sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> >                                          data->issue_flags);
> >         if (sel.val < 0)
> > @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> >         return ret;
> >  }
> >
> > -static __maybe_unused int
> > -ublk_batch_dispatch(struct ublk_queue *ubq,
> > -                   const struct ublk_batch_io_data *data,
> > -                   struct ublk_batch_fcmd *fcmd)
> > +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> > +               struct ublk_queue *ubq)
> > +{
> > +       struct ublk_batch_fcmd *fcmd;
> > +
> > +       lockdep_assert_held(&ubq->evts_lock);
> > +
> > +       /*
> > +        * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> > +        *
> > +        * The pair is the smp_mb() in ublk_batch_dispatch().
> > +        *
> > +        * If ubq->active_fcmd is observed as non-NULL, the new added tags
> > +        * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> > +        */
> > +       smp_mb();
> > +       if (READ_ONCE(ubq->active_fcmd)) {
> > +               fcmd = NULL;
> > +       } else {
> > +               fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> > +                               struct ublk_batch_fcmd, node);
> > +               WRITE_ONCE(ubq->active_fcmd, fcmd);
> > +       }
> > +       return fcmd;
> > +}
> > +
> > +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> > +                          unsigned int issue_flags)
> > +{
> > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > +       struct ublk_batch_io_data data = {
> > +               .ub = pdu->ubq->dev,
> > +               .cmd = fcmd->cmd,
> > +               .issue_flags = issue_flags,
> > +       };
> > +
> > +       WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> > +
> > +       ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> > +}
> > +
> > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > +                               struct ublk_batch_io_data *data,
> > +                               struct ublk_batch_fcmd *fcmd)
> >  {
> > +       struct ublk_batch_fcmd *new_fcmd;
> 
> Is the new_fcmd variable necessary? Can fcmd be reused instead?
> 
> > +       void *handle;
> > +       bool empty;
> >         int ret = 0;
> >
> > +again:
> >         while (!ublk_io_evts_empty(ubq)) {
> >                 ret = __ublk_batch_dispatch(ubq, data, fcmd);
> >                 if (ret <= 0)
> >                         break;
> >         }
> >
> > -       if (ret < 0)
> > -               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > +       if (ret < 0) {
> > +               ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> > +               return;
> > +       }
> >
> > -       return ret;
> > +       handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> > +       __ublk_release_fcmd(ubq);
> > +       /*
> > +        * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> > +        * checking ubq->evts_fifo.
> > +        *
> > +        * The pair is the smp_mb() in __ublk_acquire_fcmd().
> > +        */
> > +       smp_mb();
> > +       empty = ublk_io_evts_empty(ubq);
> > +       if (likely(empty))
> 
> nit: empty variable seems unnecessary
> 
> > +               return;
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       new_fcmd = __ublk_acquire_fcmd(ubq);
> > +       spin_unlock(&ubq->evts_lock);
> > +
> > +       if (!new_fcmd)
> > +               return;
> > +       if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
> 
> This check seems to be meant to decide whether the new and old
> UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?

Actually not.

> But belonging to the same io_uring context doesn't necessarily mean
> that the same task issued them. It seems like it would be safer to
> always dispatch new_fcmd->cmd to task work.

What matters is just that ctx->uring_lock & issue_flag matches from ublk
viewpoint, so it is safe to do so.

However, given it is hit in slow path, so starting new dispatch
is easier.

> 
> > +               data->cmd = new_fcmd->cmd;
> > +               fcmd = new_fcmd;
> > +               goto again;
> > +       }
> > +       io_uring_cmd_complete_in_task(new_fcmd->cmd, ublk_batch_tw_cb);
> >  }
> >
> >  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> > @@ -1576,13 +1711,27 @@ static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> >         ublk_dispatch_req(ubq, pdu->req, issue_flags);
> >  }
> >
> > -static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > +static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq, bool last)
> >  {
> > -       struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> > -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +       if (ublk_support_batch_io(ubq)) {
> > +               unsigned short tag = rq->tag;
> > +               struct ublk_batch_fcmd *fcmd = NULL;
> >
> > -       pdu->req = rq;
> > -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> > +               spin_lock(&ubq->evts_lock);
> > +               kfifo_put(&ubq->evts_fifo, tag);
> > +               if (last)
> > +                       fcmd = __ublk_acquire_fcmd(ubq);
> > +               spin_unlock(&ubq->evts_lock);
> > +
> > +               if (fcmd)
> > +                       io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > +       } else {
> > +               struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> > +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +
> > +               pdu->req = rq;
> > +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> > +       }
> >  }
> >
> >  static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> > @@ -1600,14 +1749,44 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> >         } while (rq);
> >  }
> >
> > -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> > +static void ublk_batch_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
> >  {
> > -       struct io_uring_cmd *cmd = io->cmd;
> > -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +       unsigned short tags[MAX_NR_TAG];
> > +       struct ublk_batch_fcmd *fcmd;
> > +       struct request *rq;
> > +       unsigned cnt = 0;
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       rq_list_for_each(l, rq) {
> > +               tags[cnt++] = (unsigned short)rq->tag;
> > +               if (cnt >= MAX_NR_TAG) {
> > +                       kfifo_in(&ubq->evts_fifo, tags, cnt);
> > +                       cnt = 0;
> > +               }
> > +       }
> > +       if (cnt)
> > +               kfifo_in(&ubq->evts_fifo, tags, cnt);
> > +       fcmd = __ublk_acquire_fcmd(ubq);
> > +       spin_unlock(&ubq->evts_lock);
> >
> > -       pdu->req_list = rq_list_peek(l);
> >         rq_list_init(l);
> > -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> > +       if (fcmd)
> > +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > +}
> > +
> > +static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct ublk_io *io,
> > +                               struct rq_list *l, bool batch)
> > +{
> > +       if (batch) {
> > +               ublk_batch_queue_cmd_list(ubq, l);
> > +       } else {
> > +               struct io_uring_cmd *cmd = io->cmd;
> > +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +
> > +               pdu->req_list = rq_list_peek(l);
> > +               rq_list_init(l);
> > +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> > +       }
> >  }
> >
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > @@ -1686,7 +1865,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >                 return BLK_STS_OK;
> >         }
> >
> > -       ublk_queue_cmd(ubq, rq);
> > +       ublk_queue_cmd(ubq, rq, bd->last);
> >         return BLK_STS_OK;
> >  }
> >
> > @@ -1698,11 +1877,25 @@ static inline bool ublk_belong_to_same_batch(const struct ublk_io *io,
> >                 (io->task == io2->task);
> >  }
> >
> > -static void ublk_queue_rqs(struct rq_list *rqlist)
> > +static void ublk_commit_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +       struct ublk_queue *ubq = hctx->driver_data;
> > +       struct ublk_batch_fcmd *fcmd;
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       fcmd = __ublk_acquire_fcmd(ubq);
> > +       spin_unlock(&ubq->evts_lock);
> > +
> > +       if (fcmd)
> > +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > +}
> > +
> > +static void __ublk_queue_rqs(struct rq_list *rqlist, bool batch)
> >  {
> >         struct rq_list requeue_list = { };
> >         struct rq_list submit_list = { };
> >         struct ublk_io *io = NULL;
> > +       struct ublk_queue *ubq = NULL;
> >         struct request *req;
> >
> >         while ((req = rq_list_pop(rqlist))) {
> > @@ -1716,16 +1909,27 @@ static void ublk_queue_rqs(struct rq_list *rqlist)
> >
> >                 if (io && !ublk_belong_to_same_batch(io, this_io) &&
> >                                 !rq_list_empty(&submit_list))
> > -                       ublk_queue_cmd_list(io, &submit_list);
> > +                       ublk_queue_cmd_list(ubq, io, &submit_list, batch);
> 
> This seems to assume that all the requests belong to the same
> ublk_queue, which isn't required

Here, it is required for BATCH_IO, which needs new __ublk_queue_rqs()
implementation now.

Nice catch!

> 
> >                 io = this_io;
> > +               ubq = this_q;
> >                 rq_list_add_tail(&submit_list, req);
> >         }
> >
> >         if (!rq_list_empty(&submit_list))
> > -               ublk_queue_cmd_list(io, &submit_list);
> > +               ublk_queue_cmd_list(ubq, io, &submit_list, batch);
> 
> Same here
> 
> >         *rqlist = requeue_list;
> >  }
> >
> > +static void ublk_queue_rqs(struct rq_list *rqlist)
> > +{
> > +       __ublk_queue_rqs(rqlist, false);
> > +}
> > +
> > +static void ublk_batch_queue_rqs(struct rq_list *rqlist)
> > +{
> > +       __ublk_queue_rqs(rqlist, true);
> > +}
> > +
> >  static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
> >                 unsigned int hctx_idx)
> >  {
> > @@ -1743,6 +1947,14 @@ static const struct blk_mq_ops ublk_mq_ops = {
> >         .timeout        = ublk_timeout,
> >  };
> >
> > +static const struct blk_mq_ops ublk_batch_mq_ops = {
> > +       .commit_rqs     = ublk_commit_rqs,
> > +       .queue_rq       = ublk_queue_rq,
> > +       .queue_rqs      = ublk_batch_queue_rqs,
> > +       .init_hctx      = ublk_init_hctx,
> > +       .timeout        = ublk_timeout,
> > +};
> > +
> >  static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> >  {
> >         int i;
> > @@ -2120,6 +2332,56 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag,
> >                 io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, issue_flags);
> >  }
> >
> > +static void ublk_batch_cancel_cmd(struct ublk_queue *ubq,
> > +                                 struct ublk_batch_fcmd *fcmd,
> > +                                 unsigned int issue_flags)
> > +{
> > +       bool done;
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       done = (ubq->active_fcmd != fcmd);
> 
> Needs to use READ_ONCE() since __ublk_release_fcmd() can be called
> without holding evts_lock?

OK.

> 
> > +       if (done)
> > +               list_del(&fcmd->node);
> > +       spin_unlock(&ubq->evts_lock);
> > +
> > +       if (done) {
> > +               io_uring_cmd_done(fcmd->cmd, UBLK_IO_RES_ABORT, issue_flags);
> > +               ublk_batch_free_fcmd(fcmd);
> > +       }
> > +}
> > +
> > +static void ublk_batch_cancel_queue(struct ublk_queue *ubq)
> > +{
> > +       LIST_HEAD(fcmd_list);
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       ubq->force_abort = true;
> > +       list_splice_init(&ubq->fcmd_head, &fcmd_list);
> > +       if (ubq->active_fcmd)
> > +               list_move(&ubq->active_fcmd->node, &ubq->fcmd_head);
> 
> Similarly, needs READ_ONCE()?

OK.

But this one may not be necessary, since now everything is just quiesced,
and the lockless code path won't hit any more.

> 
> > +       spin_unlock(&ubq->evts_lock);
> > +
> > +       while (!list_empty(&fcmd_list)) {
> > +               struct ublk_batch_fcmd *fcmd = list_first_entry(&fcmd_list,
> > +                               struct ublk_batch_fcmd, node);
> > +
> > +               ublk_batch_cancel_cmd(ubq, fcmd, IO_URING_F_UNLOCKED);
> > +       }
> > +}
> > +
> > +static void ublk_batch_cancel_fn(struct io_uring_cmd *cmd,
> > +                                unsigned int issue_flags)
> > +{
> > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > +       struct ublk_queue *ubq = pdu->ubq;
> > +
> > +       if (!ubq->canceling)
> 
> Is it not racy to access ubq->canceling without any lock held?

OK, will switch to call ublk_start_cancel() unconditionally.

> 
> > +               ublk_start_cancel(ubq->dev);
> > +
> > +       ublk_batch_cancel_cmd(ubq, fcmd, issue_flags);
> > +}
> > +
> >  /*
> >   * The ublk char device won't be closed when calling cancel fn, so both
> >   * ublk device and queue are guaranteed to be live
> > @@ -2171,6 +2433,11 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
> >  {
> >         int i;
> >
> > +       if (ublk_support_batch_io(ubq)) {
> > +               ublk_batch_cancel_queue(ubq);
> > +               return;
> > +       }
> > +
> >         for (i = 0; i < ubq->q_depth; i++)
> >                 ublk_cancel_cmd(ubq, i, IO_URING_F_UNLOCKED);
> >  }
> > @@ -3091,6 +3358,74 @@ static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
> >         return ublk_check_batch_cmd_flags(uc);
> >  }
> >
> > +static int ublk_batch_attach(struct ublk_queue *ubq,
> > +                            struct ublk_batch_io_data *data,
> > +                            struct ublk_batch_fcmd *fcmd)
> > +{
> > +       struct ublk_batch_fcmd *new_fcmd = NULL;
> > +       bool free = false;
> > +
> > +       spin_lock(&ubq->evts_lock);
> > +       if (unlikely(ubq->force_abort || ubq->canceling)) {
> > +               free = true;
> > +       } else {
> > +               list_add_tail(&fcmd->node, &ubq->fcmd_head);
> > +               new_fcmd = __ublk_acquire_fcmd(ubq);
> > +       }
> > +       spin_unlock(&ubq->evts_lock);
> > +
> > +       /*
> > +        * If the two fetch commands are originated from same io_ring_ctx,
> > +        * run batch dispatch directly. Otherwise, schedule task work for
> > +        * doing it.
> > +        */
> > +       if (new_fcmd && io_uring_cmd_ctx_handle(new_fcmd->cmd) ==
> > +                       io_uring_cmd_ctx_handle(fcmd->cmd)) {
> > +               data->cmd = new_fcmd->cmd;
> > +               ublk_batch_dispatch(ubq, data, new_fcmd);
> > +       } else if (new_fcmd) {
> > +               io_uring_cmd_complete_in_task(new_fcmd->cmd,
> > +                               ublk_batch_tw_cb);
> > +       }
> 
> Return early if (!new_fcmd) to reduce indentation?
> 
> > +
> > +       if (free) {
> > +               ublk_batch_free_fcmd(fcmd);
> > +               return -ENODEV;
> > +       }
> 
> Move the if (free) check directly after spin_unlock(&ubq->evts_lock)?

Yeah, this is better.

> 
> > +       return -EIOCBQUEUED;
> 
> > +}
> > +
> > +static int ublk_handle_batch_fetch_cmd(struct ublk_batch_io_data *data)
> > +{
> > +       struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
> > +       struct ublk_batch_fcmd *fcmd = ublk_batch_alloc_fcmd(data->cmd);
> > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(data->cmd);
> > +
> > +       if (!fcmd)
> > +               return -ENOMEM;
> > +
> > +       pdu->ubq = ubq;
> > +       pdu->fcmd = fcmd;
> > +       io_uring_cmd_mark_cancelable(data->cmd, data->issue_flags);
> > +
> > +       return ublk_batch_attach(ubq, data, fcmd);
> > +}
> > +
> > +static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> > +                                        const struct ublk_batch_io *uc)
> > +{
> > +       if (!(data->cmd->flags & IORING_URING_CMD_MULTISHOT))
> > +               return -EINVAL;
> > +
> > +       if (uc->elem_bytes != sizeof(__u16))
> > +               return -EINVAL;
> > +
> > +       if (uc->flags != 0)
> > +               return -E2BIG;
> > +
> > +       return 0;
> > +}
> > +
> >  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> >                                        unsigned int issue_flags)
> >  {
> > @@ -3113,6 +3448,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> >         if (data.header.q_id >= ub->dev_info.nr_hw_queues)
> >                 goto out;
> >
> > +       if (unlikely(issue_flags & IO_URING_F_CANCEL)) {
> > +               ublk_batch_cancel_fn(cmd, issue_flags);
> > +               return 0;
> > +       }
> 
> Move this to the top of the function before the other logic that's not
> necessary in the cancel case?

Yeah, looks better.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  2025-11-30 16:39   ` Caleb Sander Mateos
@ 2025-12-01 10:25     ` Ming Lei
  2025-12-01 16:43       ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-01 10:25 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 08:39:49AM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:
> >
> > - read each element into one temp buffer in batch style
> >
> > - parse and apply each element for committing io result
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c      | 117 ++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/ublk_cmd.h |   8 +++
> >  2 files changed, 121 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 66c77daae955..ea992366af5b 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -2098,9 +2098,9 @@ static inline int ublk_set_auto_buf_reg(struct ublk_io *io, struct io_uring_cmd
> >         return 0;
> >  }
> >
> > -static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> > -                                   struct io_uring_cmd *cmd,
> > -                                   u16 *buf_idx)
> > +static void __ublk_handle_auto_buf_reg(struct ublk_io *io,
> > +                                      struct io_uring_cmd *cmd,
> > +                                      u16 *buf_idx)
> 
> The name could be a bit more descriptive. How about "ublk_clear_auto_buf_reg()"?

Looks fine.

> 
> >  {
> >         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG) {
> >                 io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> > @@ -2118,7 +2118,13 @@ static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> >                 if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd))
> >                         *buf_idx = io->buf.auto_reg.index;
> >         }
> > +}
> >
> > +static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> > +                                   struct io_uring_cmd *cmd,
> > +                                   u16 *buf_idx)
> > +{
> > +       __ublk_handle_auto_buf_reg(io, cmd, buf_idx);
> >         return ublk_set_auto_buf_reg(io, cmd);
> >  }
> >
> > @@ -2553,6 +2559,17 @@ static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> >         return 0;
> >  }
> >
> > +static inline __u64 ublk_batch_zone_lba(const struct ublk_batch_io *uc,
> > +                                       const struct ublk_elem_header *elem)
> > +{
> > +       const void *buf = (const void *)elem;
> 
> Unnecessary cast

OK

> 
> > +
> > +       if (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA)
> > +               return *(__u64 *)(buf + sizeof(*elem) +
> > +                               8 * !!(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR));
> 
> Cast to a const pointer?

OK, but I feel it isn't necessary.

> 
> 
> > +       return -1;
> > +}
> > +
> >  static struct ublk_auto_buf_reg
> >  ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> >                         const struct ublk_elem_header *elem)
> > @@ -2708,6 +2725,98 @@ static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
> >         return ret;
> >  }
> >
> > +static int ublk_batch_commit_io_check(const struct ublk_queue *ubq,
> > +                                     struct ublk_io *io,
> > +                                     union ublk_io_buf *buf)
> > +{
> > +       struct request *req = io->req;
> > +
> > +       if (!req)
> > +               return -EINVAL;
> 
> This check seems redundant with the UBLK_IO_FLAG_OWNED_BY_SRV check?

I'd keep the check, which has document benefit, or warn_on()?

> 
> > +
> > +       if (io->flags & UBLK_IO_FLAG_ACTIVE)
> > +               return -EBUSY;
> 
> Aren't UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV mutually
> exclusive? Then this check is also redundant with the
> UBLK_IO_FLAG_OWNED_BY_SRV check.

OK.

> 
> > +
> > +       if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
> > +               return -EINVAL;
> > +
> > +       if (ublk_need_map_io(ubq)) {
> > +               /*
> > +                * COMMIT_AND_FETCH_REQ has to provide IO buffer if
> > +                * NEED GET DATA is not enabled or it is Read IO.
> > +                */
> > +               if (!buf->addr && (!ublk_need_get_data(ubq) ||
> > +                                       req_op(req) == REQ_OP_READ))
> > +                       return -EINVAL;
> > +       }
> > +       return 0;
> > +}
> > +
> > +static int ublk_batch_commit_io(struct ublk_queue *ubq,
> > +                               const struct ublk_batch_io_data *data,
> > +                               const struct ublk_elem_header *elem)
> > +{
> > +       struct ublk_io *io = &ubq->ios[elem->tag];
> > +       const struct ublk_batch_io *uc = &data->header;
> > +       u16 buf_idx = UBLK_INVALID_BUF_IDX;
> > +       union ublk_io_buf buf = { 0 };
> > +       struct request *req = NULL;
> > +       bool auto_reg = false;
> > +       bool compl = false;
> > +       int ret;
> > +
> > +       if (ublk_dev_support_auto_buf_reg(data->ub)) {
> > +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
> > +               auto_reg = true;
> > +       } else if (ublk_dev_need_map_io(data->ub))
> > +               buf.addr = ublk_batch_buf_addr(uc, elem);
> > +
> > +       ublk_io_lock(io);
> > +       ret = ublk_batch_commit_io_check(ubq, io, &buf);
> > +       if (!ret) {
> > +               io->res = elem->result;
> > +               io->buf = buf;
> > +               req = ublk_fill_io_cmd(io, data->cmd);
> > +
> > +               if (auto_reg)
> > +                       __ublk_handle_auto_buf_reg(io, data->cmd, &buf_idx);
> > +               compl = ublk_need_complete_req(data->ub, io);
> > +       }
> > +       ublk_io_unlock(io);
> > +
> > +       if (unlikely(ret)) {
> > +               pr_warn("%s: dev %u queue %u io %u: commit failure %d\n",
> > +                       __func__, data->ub->dev_info.dev_id, ubq->q_id,
> > +                       elem->tag, ret);
> 
> This warning can be triggered by userspace. It should probably be
> rate-limited or changed to pr_devel().

Looks fine.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS
  2025-12-01 10:25     ` Ming Lei
@ 2025-12-01 16:43       ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 16:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 1, 2025 at 2:26 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Nov 30, 2025 at 08:39:49AM -0800, Caleb Sander Mateos wrote:
> > On Thu, Nov 20, 2025 at 5:59 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Handle UBLK_U_IO_COMMIT_IO_CMDS by walking the uring_cmd fixed buffer:
> > >
> > > - read each element into one temp buffer in batch style
> > >
> > > - parse and apply each element for committing io result
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/block/ublk_drv.c      | 117 ++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/ublk_cmd.h |   8 +++
> > >  2 files changed, 121 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 66c77daae955..ea992366af5b 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -2098,9 +2098,9 @@ static inline int ublk_set_auto_buf_reg(struct ublk_io *io, struct io_uring_cmd
> > >         return 0;
> > >  }
> > >
> > > -static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> > > -                                   struct io_uring_cmd *cmd,
> > > -                                   u16 *buf_idx)
> > > +static void __ublk_handle_auto_buf_reg(struct ublk_io *io,
> > > +                                      struct io_uring_cmd *cmd,
> > > +                                      u16 *buf_idx)
> >
> > The name could be a bit more descriptive. How about "ublk_clear_auto_buf_reg()"?
>
> Looks fine.
>
> >
> > >  {
> > >         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG) {
> > >                 io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> > > @@ -2118,7 +2118,13 @@ static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> > >                 if (io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd))
> > >                         *buf_idx = io->buf.auto_reg.index;
> > >         }
> > > +}
> > >
> > > +static int ublk_handle_auto_buf_reg(struct ublk_io *io,
> > > +                                   struct io_uring_cmd *cmd,
> > > +                                   u16 *buf_idx)
> > > +{
> > > +       __ublk_handle_auto_buf_reg(io, cmd, buf_idx);
> > >         return ublk_set_auto_buf_reg(io, cmd);
> > >  }
> > >
> > > @@ -2553,6 +2559,17 @@ static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> > >         return 0;
> > >  }
> > >
> > > +static inline __u64 ublk_batch_zone_lba(const struct ublk_batch_io *uc,
> > > +                                       const struct ublk_elem_header *elem)
> > > +{
> > > +       const void *buf = (const void *)elem;
> >
> > Unnecessary cast
>
> OK
>
> >
> > > +
> > > +       if (uc->flags & UBLK_BATCH_F_HAS_ZONE_LBA)
> > > +               return *(__u64 *)(buf + sizeof(*elem) +
> > > +                               8 * !!(uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR));
> >
> > Cast to a const pointer?
>
> OK, but I feel it isn't necessary.

I don't feel strongly, just seems like the purpose of the cast is
clearer when it doesn't change the const-ness of the pointer.

>
> >
> >
> > > +       return -1;
> > > +}
> > > +
> > >  static struct ublk_auto_buf_reg
> > >  ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> > >                         const struct ublk_elem_header *elem)
> > > @@ -2708,6 +2725,98 @@ static int ublk_handle_batch_prep_cmd(const struct ublk_batch_io_data *data)
> > >         return ret;
> > >  }
> > >
> > > +static int ublk_batch_commit_io_check(const struct ublk_queue *ubq,
> > > +                                     struct ublk_io *io,
> > > +                                     union ublk_io_buf *buf)
> > > +{
> > > +       struct request *req = io->req;
> > > +
> > > +       if (!req)
> > > +               return -EINVAL;
> >
> > This check seems redundant with the UBLK_IO_FLAG_OWNED_BY_SRV check?
>
> I'd keep the check, which has document benefit, or warn_on()?

WARN_ON() seems okay, though not sure it's necessary. Though there are
several existing places that assume io->req is set when
UBLK_IO_FLAG_OWNED_BY_SRV is set. (And that's the documented
precondition for using io->req: "valid if UBLK_IO_FLAG_OWNED_BY_SRV is
set".)

Best,
Caleb

>
> >
> > > +
> > > +       if (io->flags & UBLK_IO_FLAG_ACTIVE)
> > > +               return -EBUSY;
> >
> > Aren't UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV mutually
> > exclusive? Then this check is also redundant with the
> > UBLK_IO_FLAG_OWNED_BY_SRV check.
>
> OK.
>
> >
> > > +
> > > +       if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
> > > +               return -EINVAL;
> > > +
> > > +       if (ublk_need_map_io(ubq)) {
> > > +               /*
> > > +                * COMMIT_AND_FETCH_REQ has to provide IO buffer if
> > > +                * NEED GET DATA is not enabled or it is Read IO.
> > > +                */
> > > +               if (!buf->addr && (!ublk_need_get_data(ubq) ||
> > > +                                       req_op(req) == REQ_OP_READ))
> > > +                       return -EINVAL;
> > > +       }
> > > +       return 0;
> > > +}
> > > +
> > > +static int ublk_batch_commit_io(struct ublk_queue *ubq,
> > > +                               const struct ublk_batch_io_data *data,
> > > +                               const struct ublk_elem_header *elem)
> > > +{
> > > +       struct ublk_io *io = &ubq->ios[elem->tag];
> > > +       const struct ublk_batch_io *uc = &data->header;
> > > +       u16 buf_idx = UBLK_INVALID_BUF_IDX;
> > > +       union ublk_io_buf buf = { 0 };
> > > +       struct request *req = NULL;
> > > +       bool auto_reg = false;
> > > +       bool compl = false;
> > > +       int ret;
> > > +
> > > +       if (ublk_dev_support_auto_buf_reg(data->ub)) {
> > > +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, elem);
> > > +               auto_reg = true;
> > > +       } else if (ublk_dev_need_map_io(data->ub))
> > > +               buf.addr = ublk_batch_buf_addr(uc, elem);
> > > +
> > > +       ublk_io_lock(io);
> > > +       ret = ublk_batch_commit_io_check(ubq, io, &buf);
> > > +       if (!ret) {
> > > +               io->res = elem->result;
> > > +               io->buf = buf;
> > > +               req = ublk_fill_io_cmd(io, data->cmd);
> > > +
> > > +               if (auto_reg)
> > > +                       __ublk_handle_auto_buf_reg(io, data->cmd, &buf_idx);
> > > +               compl = ublk_need_complete_req(data->ub, io);
> > > +       }
> > > +       ublk_io_unlock(io);
> > > +
> > > +       if (unlikely(ret)) {
> > > +               pr_warn("%s: dev %u queue %u io %u: commit failure %d\n",
> > > +                       __func__, data->ub->dev_info.dev_id, ubq->q_id,
> > > +                       elem->tag, ret);
> >
> > This warning can be triggered by userspace. It should probably be
> > rate-limited or changed to pr_devel().
>
> Looks fine.
>
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure
  2025-12-01  2:32     ` Ming Lei
@ 2025-12-01 17:37       ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 17:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Sun, Nov 30, 2025 at 6:32 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Nov 30, 2025 at 11:24:12AM -0800, Caleb Sander Mateos wrote:
> > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Add infrastructure for delivering I/O commands to ublk server in batches,
> > > preparing for the upcoming UBLK_U_IO_FETCH_IO_CMDS feature.
> > >
> > > Key components:
> > >
> > > - struct ublk_batch_fcmd: Represents a batch fetch uring_cmd that will
> > >   receive multiple I/O tags in a single operation, using io_uring's
> > >   multishot command for efficient ublk IO delivery.
> > >
> > > - ublk_batch_dispatch(): Batch version of ublk_dispatch_req() that:
> > >   * Pulls multiple request tags from the events FIFO (lock-free reader)
> > >   * Prepares each I/O for delivery (including auto buffer registration)
> > >   * Delivers tags to userspace via single uring_cmd notification
> > >   * Handles partial failures by restoring undelivered tags to FIFO
> > >
> > > The batch approach significantly reduces notification overhead by aggregating
> > > multiple I/O completions into single uring_cmd, while maintaining the same
> > > I/O processing semantics as individual operations.
> > >
> > > Error handling ensures system consistency: if buffer selection or CQE
> > > posting fails, undelivered tags are restored to the FIFO for retry,
> > > meantime IO state has to be restored.
> > >
> > > This runs in task work context, scheduled via io_uring_cmd_complete_in_task()
> > > or called directly from ->uring_cmd(), enabling efficient batch processing
> > > without blocking the I/O submission path.
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/block/ublk_drv.c | 189 +++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 189 insertions(+)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 6ff284243630..cc9c92d97349 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -91,6 +91,12 @@
> > >          UBLK_BATCH_F_HAS_BUF_ADDR | \
> > >          UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK)
> > >
> > > +/* ublk batch fetch uring_cmd */
> > > +struct ublk_batch_fcmd {
> >
> > I would prefer "fetch_cmd" instead of "fcmd" for clarity
> >
> > > +       struct io_uring_cmd *cmd;
> > > +       unsigned short buf_group;
> > > +};
> > > +
> > >  struct ublk_uring_cmd_pdu {
> > >         /*
> > >          * Store requests in same batch temporarily for queuing them to
> > > @@ -168,6 +174,9 @@ struct ublk_batch_io_data {
> > >   */
> > >  #define UBLK_REFCOUNT_INIT (REFCOUNT_MAX / 2)
> > >
> > > +/* used for UBLK_F_BATCH_IO only */
> > > +#define UBLK_BATCH_IO_UNUSED_TAG       ((unsigned short)-1)
> > > +
> > >  union ublk_io_buf {
> > >         __u64   addr;
> > >         struct ublk_auto_buf_reg auto_reg;
> > > @@ -616,6 +625,32 @@ static wait_queue_head_t ublk_idr_wq;      /* wait until one idr is freed */
> > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > >
> > >
> > > +static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > > +                                       struct ublk_batch_fcmd *fcmd,
> > > +                                       int res)
> > > +{
> > > +       io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > > +       fcmd->cmd = NULL;
> > > +}
> > > +
> > > +static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > > +                                    struct io_br_sel *sel,
> > > +                                    unsigned int issue_flags)
> > > +{
> > > +       if (io_uring_mshot_cmd_post_cqe(fcmd->cmd, sel, issue_flags))
> > > +               return -ENOBUFS;
> > > +       return 0;
> > > +}
> > > +
> > > +static ssize_t ublk_batch_copy_io_tags(struct ublk_batch_fcmd *fcmd,
> > > +                                      void __user *buf, const u16 *tag_buf,
> > > +                                      unsigned int len)
> > > +{
> > > +       if (copy_to_user(buf, tag_buf, len))
> > > +               return -EFAULT;
> > > +       return len;
> > > +}
> > > +
> > >  #define UBLK_MAX_UBLKS UBLK_MINORS
> > >
> > >  /*
> > > @@ -1378,6 +1413,160 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
> > >         }
> > >  }
> > >
> > > +static bool __ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > > +                                      const struct ublk_batch_io_data *data,
> > > +                                      unsigned short tag)
> > > +{
> > > +       struct ublk_device *ub = data->ub;
> > > +       struct ublk_io *io = &ubq->ios[tag];
> > > +       struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> > > +       enum auto_buf_reg_res res = AUTO_BUF_REG_FALLBACK;
> > > +       struct io_uring_cmd *cmd = data->cmd;
> > > +
> > > +       if (!ublk_start_io(ubq, req, io))
> >
> > This doesn't look correct for UBLK_F_NEED_GET_DATA. If that's not
> > supported in batch mode, then it should probably be disallowed when
> > creating a batch-mode ublk device. The ublk_need_get_data() check in
> > ublk_batch_commit_io_check() could also be dropped.
>
> OK.
>
> BTW UBLK_F_NEED_GET_DATA isn't necessary any more since user copy.
>
> It is only for handling WRITE io command, and ublk server can copy data to
> new buffer by user copy.
>
> >
> > > +               return false;
> > > +
> > > +       if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
> > > +               res = __ublk_do_auto_buf_reg(ubq, req, io, cmd,
> > > +                               data->issue_flags);
> >
> > __ublk_do_auto_buf_reg() reads io->buf.auto_reg. That seems racy
> > without holding the io spinlock.
>
> The io lock isn't needed.  Now the io state is guaranteed to be ACTIVE,
> so UBLK_U_IO_COMMIT_IO_CMDS can't commit anything for this io.

Makes sense.

Thanks,
Caleb

>
> >
> > > +
> > > +       if (res == AUTO_BUF_REG_FAIL)
> > > +               return false;
> >
> > Could be moved into the if (ublk_support_auto_buf_reg(ubq) &&
> > ublk_rq_has_data(req)) statement since it won't be true otherwise?
>
> OK.
>
> >
> > > +
> > > +       ublk_io_lock(io);
> > > +       ublk_prep_auto_buf_reg_io(ubq, req, io, cmd, res);
> > > +       ublk_io_unlock(io);
> > > +
> > > +       return true;
> > > +}
> > > +
> > > +static bool ublk_batch_prep_dispatch(struct ublk_queue *ubq,
> > > +                                    const struct ublk_batch_io_data *data,
> > > +                                    unsigned short *tag_buf,
> > > +                                    unsigned int len)
> > > +{
> > > +       bool has_unused = false;
> > > +       int i;
> >
> > unsigned?
> >
> > > +
> > > +       for (i = 0; i < len; i += 1) {
> >
> > i++?
> >
> > > +               unsigned short tag = tag_buf[i];
> > > +
> > > +               if (!__ublk_batch_prep_dispatch(ubq, data, tag)) {
> > > +                       tag_buf[i] = UBLK_BATCH_IO_UNUSED_TAG;
> > > +                       has_unused = true;
> > > +               }
> > > +       }
> > > +
> > > +       return has_unused;
> > > +}
> > > +
> > > +/*
> > > + * Filter out UBLK_BATCH_IO_UNUSED_TAG entries from tag_buf.
> > > + * Returns the new length after filtering.
> > > + */
> > > +static unsigned int ublk_filter_unused_tags(unsigned short *tag_buf,
> > > +                                           unsigned int len)
> > > +{
> > > +       unsigned int i, j;
> > > +
> > > +       for (i = 0, j = 0; i < len; i++) {
> > > +               if (tag_buf[i] != UBLK_BATCH_IO_UNUSED_TAG) {
> > > +                       if (i != j)
> > > +                               tag_buf[j] = tag_buf[i];
> > > +                       j++;
> > > +               }
> > > +       }
> > > +
> > > +       return j;
> > > +}
> > > +
> > > +#define MAX_NR_TAG 128
> > > +static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > +                                const struct ublk_batch_io_data *data,
> > > +                                struct ublk_batch_fcmd *fcmd)
> > > +{
> > > +       unsigned short tag_buf[MAX_NR_TAG];
> > > +       struct io_br_sel sel;
> > > +       size_t len = 0;
> > > +       bool needs_filter;
> > > +       int ret;
> > > +
> > > +       sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > > +                                        data->issue_flags);
> > > +       if (sel.val < 0)
> > > +               return sel.val;
> > > +       if (!sel.addr)
> > > +               return -ENOBUFS;
> > > +
> > > +       /* single reader needn't lock and sizeof(kfifo element) is 2 bytes */
> > > +       len = min(len, sizeof(tag_buf)) / 2;
> >
> > sizeof(unsigned short) instead of 2?
>
> OK
>
> >
> > > +       len = kfifo_out(&ubq->evts_fifo, tag_buf, len);
> > > +
> > > +       needs_filter = ublk_batch_prep_dispatch(ubq, data, tag_buf, len);
> > > +       /* Filter out unused tags before posting to userspace */
> > > +       if (unlikely(needs_filter)) {
> > > +               int new_len = ublk_filter_unused_tags(tag_buf, len);
> > > +
> > > +               if (!new_len)
> > > +                       return len;
> >
> > Is the purpose of this return value just to make ublk_batch_dispatch()
> > retry __ublk_batch_dispatch()? Otherwise, it seems like a strange
> > value to return.
>
> If `new_len` becomes zero, it means all these requests are handled already,
> either fail or requeue, so return `len` to tell the caller to move on. I
> can comment this behavior.
>
> >
> > Also, shouldn't this path release the selected buffer to avoid leaking it?
>
> Good catch, but io_kbuf_recycle() isn't exported, we may have to call
> io_uring_mshot_cmd_post_cqe() by zeroing sel->val.
>
> >
> > > +               len = new_len;
> > > +       }
> > > +
> > > +       sel.val = ublk_batch_copy_io_tags(fcmd, sel.addr, tag_buf, len * 2);
> >
> > sizeof(unsigned short)?
>
> OK
>
> >
> > > +       ret = ublk_batch_fetch_post_cqe(fcmd, &sel, data->issue_flags);
> > > +       if (unlikely(ret < 0)) {
> > > +               int i, res;
> > > +
> > > +               /*
> > > +                * Undo prep state for all IOs since userspace never received them.
> > > +                * This restores IOs to pre-prepared state so they can be cleanly
> > > +                * re-prepared when tags are pulled from FIFO again.
> > > +                */
> > > +               for (i = 0; i < len; i++) {
> > > +                       struct ublk_io *io = &ubq->ios[tag_buf[i]];
> > > +                       int index = -1;
> > > +
> > > +                       ublk_io_lock(io);
> > > +                       if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> > > +                               index = io->buf.auto_reg.index;
> >
> > This is missing the io->buf_ctx_handle == io_uring_cmd_ctx_handle(cmd)
> > check from ublk_handle_auto_buf_reg().
>
> As you replied, it isn't needed because it is the same multishot command
> for registering bvec buf.
>
> >
> > > +                       io->flags &= ~(UBLK_IO_FLAG_OWNED_BY_SRV | UBLK_IO_FLAG_AUTO_BUF_REG);
> > > +                       io->flags |= UBLK_IO_FLAG_ACTIVE;
> > > +                       ublk_io_unlock(io);
> > > +
> > > +                       if (index != -1)
> > > +                               io_buffer_unregister_bvec(data->cmd, index,
> > > +                                               data->issue_flags);
> > > +               }
> > > +
> > > +               res = kfifo_in_spinlocked_noirqsave(&ubq->evts_fifo,
> > > +                       tag_buf, len, &ubq->evts_lock);
> > > +
> > > +               pr_warn("%s: copy tags or post CQE failure, move back "
> > > +                               "tags(%d %zu) ret %d\n", __func__, res, len,
> > > +                               ret);
> > > +       }
> > > +       return ret;
> > > +}
> > > +
> > > +static __maybe_unused int
> >
> > The return value looks completely unused. Just return void instead?
>
> Yes, looks it is removed in following patch.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-01  9:41     ` Ming Lei
@ 2025-12-01 17:51       ` Caleb Sander Mateos
  2025-12-02  1:27         ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 17:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > multiple I/O commands in a single operation, significantly reducing
> > > submission overhead compared to individual FETCH_REQ* commands.
> > >
> > > Key Design Features:
> > >
> > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > >    commands, with the batch size limited by the provided buffer length.
> > >
> > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > >    simultaneously, but only one is active at any time. This enables
> > >    efficient load distribution across multiple server task contexts.
> > >
> > > 3. Implicit State Management: The implementation uses three key variables
> > >    to track state:
> > >    - evts_fifo: Queue of request tags awaiting processing
> > >    - fcmd_head: List of available fetch commands
> > >    - active_fcmd: Currently active fetch command (NULL = none active)
> > >
> > >    States are derived implicitly:
> > >    - IDLE: No fetch commands available
> > >    - READY: Fetch commands available, none active
> > >    - ACTIVE: One fetch command processing events
> > >
> > > 4. Lockless Reader Optimization: The active fetch command can read from
> > >    evts_fifo without locking (single reader guarantee), while writers
> > >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > >    barrier pairing plays key role for the single lockless reader
> > >    optimization.
> > >
> > > Implementation Details:
> > >
> > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > >   events arrive and no command is currently active
> >
> > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
>
> It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> __ublk_release_fcmd().

Okay, update the commit message then?

>
> >
> > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > - State transitions are coordinated via evts_lock to maintain consistency
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> > >  include/uapi/linux/ublk_cmd.h |   7 +
> > >  2 files changed, 388 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index cc9c92d97349..2e5e392c939e 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -93,6 +93,7 @@
> > >
> > >  /* ublk batch fetch uring_cmd */
> > >  struct ublk_batch_fcmd {
> > > +       struct list_head node;
> > >         struct io_uring_cmd *cmd;
> > >         unsigned short buf_group;
> > >  };
> > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > >          */
> > >         struct ublk_queue *ubq;
> > >
> > > -       u16 tag;
> > > +       union {
> > > +               u16 tag;
> > > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > +       };
> > >  };
> > >
> > >  struct ublk_batch_io_data {
> > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > >         struct ublk_device *dev;
> > >
> > >         /*
> > > -        * Inflight ublk request tag is saved in this fifo
> > > +        * Batch I/O State Management:
> > > +        *
> > > +        * The batch I/O system uses implicit state management based on the
> > > +        * combination of three key variables below.
> > > +        *
> > > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > +        *   No fetch commands available, events queue in evts_fifo
> > > +        *
> > > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > +        *   Fetch commands available but none processing events
> > >          *
> > > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > -        * so lock is required for storing request tag to fifo
> > > +        * - ACTIVE: active_fcmd
> > > +        *   One fetch command actively processing events from evts_fifo
> > >          *
> > > -        * Make sure just one reader for fetching request from task work
> > > -        * function to ublk server, so no need to grab the lock in reader
> > > -        * side.
> > > +        * Key Invariants:
> > > +        * - At most one active_fcmd at any time (single reader)
> > > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > > +        * - evts_fifo can be read locklessly by the single active reader
> > > +        * - All state transitions require evts_lock protection
> > > +        * - Multiple writers to evts_fifo require lock protection
> > >          */
> > >         struct {
> > >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > >                 spinlock_t evts_lock;
> > > +
> > > +               /* List of fetch commands available to process events */
> > > +               struct list_head fcmd_head;
> > > +
> > > +               /* Currently active fetch command (NULL = none active) */
> > > +               struct ublk_batch_fcmd  *active_fcmd;
> > >         }____cacheline_aligned_in_smp;
> > >
> > >         struct ublk_io ios[] __counted_by(q_depth);
> > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > +                               struct ublk_batch_io_data *data,
> > > +                               struct ublk_batch_fcmd *fcmd);
> > >
> > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > >  {
> > >         return false;
> > >  }
> > >
> > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > >  static inline void ublk_io_lock(struct ublk_io *io)
> > >  {
> > >         spin_lock(&io->lock);
> > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> > >
> > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > >
> > > +static struct ublk_batch_fcmd *
> > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > +{
> > > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> >
> > An allocation in the I/O path seems unfortunate. Is there not room to
> > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
>
> It is allocated once for one mshot request, which covers many IOs.
>
> It can't be held in uring_cmd pdu, but the allocation can be optimized in
> future. Not a big deal in enablement stage.

Okay, seems fine to optimize it in the future.

>
> > > +
> > > +       if (fcmd) {
> > > +               fcmd->cmd = cmd;
> > > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> >
> > Is it necessary to store sample this here just to pass it back to the
> > io_uring layer? Wouldn't the io_uring layer already have access to it
> > in struct io_kiocb's buf_index field?
>
> ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
>
>
> io_ring_buffer_select(), so we can't reuse req->buf_index here.

But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
from req->buf_index instead of the SQE, for example:
if (req->flags & REQ_F_BUFFER_SELECT)
        sr->buf_group = req->buf_index;

Seems like it would make sense to do the same for
UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.

>
> >
> > > +       }
> > > +       return fcmd;
> > > +}
> > > +
> > > +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> > > +{
> > > +       kfree(fcmd);
> > > +}
> > > +
> > > +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> > > +{
> > > +       WRITE_ONCE(ubq->active_fcmd, NULL);
> > > +}
> > >
> > > -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > > +/*
> > > + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> > > + * dispatching
> > > + */
> > > +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> > > +                                       const struct ublk_batch_io_data *data,
> > >                                         struct ublk_batch_fcmd *fcmd,
> > >                                         int res)
> > >  {
> > > +       spin_lock(&ubq->evts_lock);
> > > +       list_del(&fcmd->node);
> > > +       WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> > > +       __ublk_release_fcmd(ubq);
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > >         io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > > -       fcmd->cmd = NULL;
> > > +       ublk_batch_free_fcmd(fcmd);
> > >  }
> > >
> > >  static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > > @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > >         bool needs_filter;
> > >         int ret;
> > >
> > > +       WARN_ON_ONCE(data->cmd != fcmd->cmd);
> > > +
> > >         sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > >                                          data->issue_flags);
> > >         if (sel.val < 0)
> > > @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > >         return ret;
> > >  }
> > >
> > > -static __maybe_unused int
> > > -ublk_batch_dispatch(struct ublk_queue *ubq,
> > > -                   const struct ublk_batch_io_data *data,
> > > -                   struct ublk_batch_fcmd *fcmd)
> > > +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> > > +               struct ublk_queue *ubq)
> > > +{
> > > +       struct ublk_batch_fcmd *fcmd;
> > > +
> > > +       lockdep_assert_held(&ubq->evts_lock);
> > > +
> > > +       /*
> > > +        * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> > > +        *
> > > +        * The pair is the smp_mb() in ublk_batch_dispatch().
> > > +        *
> > > +        * If ubq->active_fcmd is observed as non-NULL, the new added tags
> > > +        * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> > > +        */
> > > +       smp_mb();
> > > +       if (READ_ONCE(ubq->active_fcmd)) {
> > > +               fcmd = NULL;
> > > +       } else {
> > > +               fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> > > +                               struct ublk_batch_fcmd, node);
> > > +               WRITE_ONCE(ubq->active_fcmd, fcmd);
> > > +       }
> > > +       return fcmd;
> > > +}
> > > +
> > > +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> > > +                          unsigned int issue_flags)
> > > +{
> > > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > > +       struct ublk_batch_io_data data = {
> > > +               .ub = pdu->ubq->dev,
> > > +               .cmd = fcmd->cmd,
> > > +               .issue_flags = issue_flags,
> > > +       };
> > > +
> > > +       WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> > > +
> > > +       ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> > > +}
> > > +
> > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > +                               struct ublk_batch_io_data *data,
> > > +                               struct ublk_batch_fcmd *fcmd)
> > >  {
> > > +       struct ublk_batch_fcmd *new_fcmd;
> >
> > Is the new_fcmd variable necessary? Can fcmd be reused instead?
> >
> > > +       void *handle;
> > > +       bool empty;
> > >         int ret = 0;
> > >
> > > +again:
> > >         while (!ublk_io_evts_empty(ubq)) {
> > >                 ret = __ublk_batch_dispatch(ubq, data, fcmd);
> > >                 if (ret <= 0)
> > >                         break;
> > >         }
> > >
> > > -       if (ret < 0)
> > > -               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > > +       if (ret < 0) {
> > > +               ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> > > +               return;
> > > +       }
> > >
> > > -       return ret;
> > > +       handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> > > +       __ublk_release_fcmd(ubq);
> > > +       /*
> > > +        * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> > > +        * checking ubq->evts_fifo.
> > > +        *
> > > +        * The pair is the smp_mb() in __ublk_acquire_fcmd().
> > > +        */
> > > +       smp_mb();
> > > +       empty = ublk_io_evts_empty(ubq);
> > > +       if (likely(empty))
> >
> > nit: empty variable seems unnecessary
> >
> > > +               return;
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       new_fcmd = __ublk_acquire_fcmd(ubq);
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > > +       if (!new_fcmd)
> > > +               return;
> > > +       if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
> >
> > This check seems to be meant to decide whether the new and old
> > UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?
>
> Actually not.
>
> > But belonging to the same io_uring context doesn't necessarily mean
> > that the same task issued them. It seems like it would be safer to
> > always dispatch new_fcmd->cmd to task work.
>
> What matters is just that ctx->uring_lock & issue_flag matches from ublk
> viewpoint, so it is safe to do so.

Okay, that makes sense.

>
> However, given it is hit in slow path, so starting new dispatch
> is easier.

Yeah, I'd agree it makes sense to keep the unexpected path code
simpler. There may also be fairness concerns from looping indefinitely
here if the evts_fifo continues to be nonempty, so dispatching to task
work seems safer.

>
> >
> > > +               data->cmd = new_fcmd->cmd;
> > > +               fcmd = new_fcmd;
> > > +               goto again;
> > > +       }
> > > +       io_uring_cmd_complete_in_task(new_fcmd->cmd, ublk_batch_tw_cb);
> > >  }
> > >
> > >  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> > > @@ -1576,13 +1711,27 @@ static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> > >         ublk_dispatch_req(ubq, pdu->req, issue_flags);
> > >  }
> > >
> > > -static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> > > +static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq, bool last)
> > >  {
> > > -       struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> > > -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +       if (ublk_support_batch_io(ubq)) {
> > > +               unsigned short tag = rq->tag;
> > > +               struct ublk_batch_fcmd *fcmd = NULL;
> > >
> > > -       pdu->req = rq;
> > > -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> > > +               spin_lock(&ubq->evts_lock);
> > > +               kfifo_put(&ubq->evts_fifo, tag);
> > > +               if (last)
> > > +                       fcmd = __ublk_acquire_fcmd(ubq);
> > > +               spin_unlock(&ubq->evts_lock);
> > > +
> > > +               if (fcmd)
> > > +                       io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > > +       } else {
> > > +               struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
> > > +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +
> > > +               pdu->req = rq;
> > > +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_tw_cb);
> > > +       }
> > >  }
> > >
> > >  static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> > > @@ -1600,14 +1749,44 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> > >         } while (rq);
> > >  }
> > >
> > > -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> > > +static void ublk_batch_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
> > >  {
> > > -       struct io_uring_cmd *cmd = io->cmd;
> > > -       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +       unsigned short tags[MAX_NR_TAG];
> > > +       struct ublk_batch_fcmd *fcmd;
> > > +       struct request *rq;
> > > +       unsigned cnt = 0;
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       rq_list_for_each(l, rq) {
> > > +               tags[cnt++] = (unsigned short)rq->tag;
> > > +               if (cnt >= MAX_NR_TAG) {
> > > +                       kfifo_in(&ubq->evts_fifo, tags, cnt);
> > > +                       cnt = 0;
> > > +               }
> > > +       }
> > > +       if (cnt)
> > > +               kfifo_in(&ubq->evts_fifo, tags, cnt);
> > > +       fcmd = __ublk_acquire_fcmd(ubq);
> > > +       spin_unlock(&ubq->evts_lock);
> > >
> > > -       pdu->req_list = rq_list_peek(l);
> > >         rq_list_init(l);
> > > -       io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> > > +       if (fcmd)
> > > +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > > +}
> > > +
> > > +static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct ublk_io *io,
> > > +                               struct rq_list *l, bool batch)
> > > +{
> > > +       if (batch) {
> > > +               ublk_batch_queue_cmd_list(ubq, l);
> > > +       } else {
> > > +               struct io_uring_cmd *cmd = io->cmd;
> > > +               struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +
> > > +               pdu->req_list = rq_list_peek(l);
> > > +               rq_list_init(l);
> > > +               io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> > > +       }
> > >  }
> > >
> > >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> > > @@ -1686,7 +1865,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> > >                 return BLK_STS_OK;
> > >         }
> > >
> > > -       ublk_queue_cmd(ubq, rq);
> > > +       ublk_queue_cmd(ubq, rq, bd->last);
> > >         return BLK_STS_OK;
> > >  }
> > >
> > > @@ -1698,11 +1877,25 @@ static inline bool ublk_belong_to_same_batch(const struct ublk_io *io,
> > >                 (io->task == io2->task);
> > >  }
> > >
> > > -static void ublk_queue_rqs(struct rq_list *rqlist)
> > > +static void ublk_commit_rqs(struct blk_mq_hw_ctx *hctx)
> > > +{
> > > +       struct ublk_queue *ubq = hctx->driver_data;
> > > +       struct ublk_batch_fcmd *fcmd;
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       fcmd = __ublk_acquire_fcmd(ubq);
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > > +       if (fcmd)
> > > +               io_uring_cmd_complete_in_task(fcmd->cmd, ublk_batch_tw_cb);
> > > +}
> > > +
> > > +static void __ublk_queue_rqs(struct rq_list *rqlist, bool batch)
> > >  {
> > >         struct rq_list requeue_list = { };
> > >         struct rq_list submit_list = { };
> > >         struct ublk_io *io = NULL;
> > > +       struct ublk_queue *ubq = NULL;
> > >         struct request *req;
> > >
> > >         while ((req = rq_list_pop(rqlist))) {
> > > @@ -1716,16 +1909,27 @@ static void ublk_queue_rqs(struct rq_list *rqlist)
> > >
> > >                 if (io && !ublk_belong_to_same_batch(io, this_io) &&
> > >                                 !rq_list_empty(&submit_list))
> > > -                       ublk_queue_cmd_list(io, &submit_list);
> > > +                       ublk_queue_cmd_list(ubq, io, &submit_list, batch);
> >
> > This seems to assume that all the requests belong to the same
> > ublk_queue, which isn't required
>
> Here, it is required for BATCH_IO, which needs new __ublk_queue_rqs()
> implementation now.
>
> Nice catch!
>
> >
> > >                 io = this_io;
> > > +               ubq = this_q;
> > >                 rq_list_add_tail(&submit_list, req);
> > >         }
> > >
> > >         if (!rq_list_empty(&submit_list))
> > > -               ublk_queue_cmd_list(io, &submit_list);
> > > +               ublk_queue_cmd_list(ubq, io, &submit_list, batch);
> >
> > Same here
> >
> > >         *rqlist = requeue_list;
> > >  }
> > >
> > > +static void ublk_queue_rqs(struct rq_list *rqlist)
> > > +{
> > > +       __ublk_queue_rqs(rqlist, false);
> > > +}
> > > +
> > > +static void ublk_batch_queue_rqs(struct rq_list *rqlist)
> > > +{
> > > +       __ublk_queue_rqs(rqlist, true);
> > > +}
> > > +
> > >  static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
> > >                 unsigned int hctx_idx)
> > >  {
> > > @@ -1743,6 +1947,14 @@ static const struct blk_mq_ops ublk_mq_ops = {
> > >         .timeout        = ublk_timeout,
> > >  };
> > >
> > > +static const struct blk_mq_ops ublk_batch_mq_ops = {
> > > +       .commit_rqs     = ublk_commit_rqs,
> > > +       .queue_rq       = ublk_queue_rq,
> > > +       .queue_rqs      = ublk_batch_queue_rqs,
> > > +       .init_hctx      = ublk_init_hctx,
> > > +       .timeout        = ublk_timeout,
> > > +};
> > > +
> > >  static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> > >  {
> > >         int i;
> > > @@ -2120,6 +2332,56 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, unsigned tag,
> > >                 io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, issue_flags);
> > >  }
> > >
> > > +static void ublk_batch_cancel_cmd(struct ublk_queue *ubq,
> > > +                                 struct ublk_batch_fcmd *fcmd,
> > > +                                 unsigned int issue_flags)
> > > +{
> > > +       bool done;
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       done = (ubq->active_fcmd != fcmd);
> >
> > Needs to use READ_ONCE() since __ublk_release_fcmd() can be called
> > without holding evts_lock?
>
> OK.
>
> >
> > > +       if (done)
> > > +               list_del(&fcmd->node);
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > > +       if (done) {
> > > +               io_uring_cmd_done(fcmd->cmd, UBLK_IO_RES_ABORT, issue_flags);
> > > +               ublk_batch_free_fcmd(fcmd);
> > > +       }
> > > +}
> > > +
> > > +static void ublk_batch_cancel_queue(struct ublk_queue *ubq)
> > > +{
> > > +       LIST_HEAD(fcmd_list);
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       ubq->force_abort = true;
> > > +       list_splice_init(&ubq->fcmd_head, &fcmd_list);
> > > +       if (ubq->active_fcmd)
> > > +               list_move(&ubq->active_fcmd->node, &ubq->fcmd_head);
> >
> > Similarly, needs READ_ONCE()?
>
> OK.
>
> But this one may not be necessary, since now everything is just quiesced,
> and the lockless code path won't hit any more.

Good point. I think a comment to that effect would be helpful.

Best,
Caleb

>
> >
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > > +       while (!list_empty(&fcmd_list)) {
> > > +               struct ublk_batch_fcmd *fcmd = list_first_entry(&fcmd_list,
> > > +                               struct ublk_batch_fcmd, node);
> > > +
> > > +               ublk_batch_cancel_cmd(ubq, fcmd, IO_URING_F_UNLOCKED);
> > > +       }
> > > +}
> > > +
> > > +static void ublk_batch_cancel_fn(struct io_uring_cmd *cmd,
> > > +                                unsigned int issue_flags)
> > > +{
> > > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > > +       struct ublk_queue *ubq = pdu->ubq;
> > > +
> > > +       if (!ubq->canceling)
> >
> > Is it not racy to access ubq->canceling without any lock held?
>
> OK, will switch to call ublk_start_cancel() unconditionally.
>
> >
> > > +               ublk_start_cancel(ubq->dev);
> > > +
> > > +       ublk_batch_cancel_cmd(ubq, fcmd, issue_flags);
> > > +}
> > > +
> > >  /*
> > >   * The ublk char device won't be closed when calling cancel fn, so both
> > >   * ublk device and queue are guaranteed to be live
> > > @@ -2171,6 +2433,11 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
> > >  {
> > >         int i;
> > >
> > > +       if (ublk_support_batch_io(ubq)) {
> > > +               ublk_batch_cancel_queue(ubq);
> > > +               return;
> > > +       }
> > > +
> > >         for (i = 0; i < ubq->q_depth; i++)
> > >                 ublk_cancel_cmd(ubq, i, IO_URING_F_UNLOCKED);
> > >  }
> > > @@ -3091,6 +3358,74 @@ static int ublk_check_batch_cmd(const struct ublk_batch_io_data *data)
> > >         return ublk_check_batch_cmd_flags(uc);
> > >  }
> > >
> > > +static int ublk_batch_attach(struct ublk_queue *ubq,
> > > +                            struct ublk_batch_io_data *data,
> > > +                            struct ublk_batch_fcmd *fcmd)
> > > +{
> > > +       struct ublk_batch_fcmd *new_fcmd = NULL;
> > > +       bool free = false;
> > > +
> > > +       spin_lock(&ubq->evts_lock);
> > > +       if (unlikely(ubq->force_abort || ubq->canceling)) {
> > > +               free = true;
> > > +       } else {
> > > +               list_add_tail(&fcmd->node, &ubq->fcmd_head);
> > > +               new_fcmd = __ublk_acquire_fcmd(ubq);
> > > +       }
> > > +       spin_unlock(&ubq->evts_lock);
> > > +
> > > +       /*
> > > +        * If the two fetch commands are originated from same io_ring_ctx,
> > > +        * run batch dispatch directly. Otherwise, schedule task work for
> > > +        * doing it.
> > > +        */
> > > +       if (new_fcmd && io_uring_cmd_ctx_handle(new_fcmd->cmd) ==
> > > +                       io_uring_cmd_ctx_handle(fcmd->cmd)) {
> > > +               data->cmd = new_fcmd->cmd;
> > > +               ublk_batch_dispatch(ubq, data, new_fcmd);
> > > +       } else if (new_fcmd) {
> > > +               io_uring_cmd_complete_in_task(new_fcmd->cmd,
> > > +                               ublk_batch_tw_cb);
> > > +       }
> >
> > Return early if (!new_fcmd) to reduce indentation?
> >
> > > +
> > > +       if (free) {
> > > +               ublk_batch_free_fcmd(fcmd);
> > > +               return -ENODEV;
> > > +       }
> >
> > Move the if (free) check directly after spin_unlock(&ubq->evts_lock)?
>
> Yeah, this is better.
>
> >
> > > +       return -EIOCBQUEUED;
> >
> > > +}
> > > +
> > > +static int ublk_handle_batch_fetch_cmd(struct ublk_batch_io_data *data)
> > > +{
> > > +       struct ublk_queue *ubq = ublk_get_queue(data->ub, data->header.q_id);
> > > +       struct ublk_batch_fcmd *fcmd = ublk_batch_alloc_fcmd(data->cmd);
> > > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(data->cmd);
> > > +
> > > +       if (!fcmd)
> > > +               return -ENOMEM;
> > > +
> > > +       pdu->ubq = ubq;
> > > +       pdu->fcmd = fcmd;
> > > +       io_uring_cmd_mark_cancelable(data->cmd, data->issue_flags);
> > > +
> > > +       return ublk_batch_attach(ubq, data, fcmd);
> > > +}
> > > +
> > > +static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> > > +                                        const struct ublk_batch_io *uc)
> > > +{
> > > +       if (!(data->cmd->flags & IORING_URING_CMD_MULTISHOT))
> > > +               return -EINVAL;
> > > +
> > > +       if (uc->elem_bytes != sizeof(__u16))
> > > +               return -EINVAL;
> > > +
> > > +       if (uc->flags != 0)
> > > +               return -E2BIG;
> > > +
> > > +       return 0;
> > > +}
> > > +
> > >  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > >                                        unsigned int issue_flags)
> > >  {
> > > @@ -3113,6 +3448,11 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > >         if (data.header.q_id >= ub->dev_info.nr_hw_queues)
> > >                 goto out;
> > >
> > > +       if (unlikely(issue_flags & IO_URING_F_CANCEL)) {
> > > +               ublk_batch_cancel_fn(cmd, issue_flags);
> > > +               return 0;
> > > +       }
> >
> > Move this to the top of the function before the other logic that's not
> > necessary in the cancel case?
>
> Yeah, looks better.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 15/27] ublk: abort requests filled in event kfifo
  2025-11-21  1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
@ 2025-12-01 18:52   ` Caleb Sander Mateos
  2025-12-02  1:29     ` Ming Lei
  2025-12-01 19:00   ` Caleb Sander Mateos
  1 sibling, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 18:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> In case of BATCH_IO, any request filled in event kfifo, they don't get
> chance to be dispatched any more when releasing ublk char device, so
> we have to abort them too.
>
> Add ublk_abort_batch_queue() for aborting this kind of requests.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 2e5e392c939e..849199771f86 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -2241,7 +2241,8 @@ static int ublk_ch_mmap(struct file *filp, struct vm_area_struct *vma)
>  static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
>                 struct request *req)
>  {
> -       WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> +       WARN_ON_ONCE(!ublk_dev_support_batch_io(ub) &&
> +                       io->flags & UBLK_IO_FLAG_ACTIVE);
>
>         if (ublk_nosrv_should_reissue_outstanding(ub))
>                 blk_mq_requeue_request(req, false);
> @@ -2251,6 +2252,26 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
>         }
>  }
>
> +/*
> + * Request tag may just be filled to event kfifo, not get chance to
> + * dispatch, abort these requests too
> + */
> +static void ublk_abort_batch_queue(struct ublk_device *ub,
> +                                  struct ublk_queue *ubq)
> +{
> +       while (true) {
> +               struct request *req;
> +               short tag;

unsigned short?

> +
> +               if (!kfifo_out(&ubq->evts_fifo, &tag, 1))
> +                       break;
> +
> +               req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +               if (req && blk_mq_request_started(req))

If the tag is in the evts_fifo, how would it be possible for the
request not to have been started yet?

Best,
Caleb

> +                       __ublk_fail_req(ub, &ubq->ios[tag], req);
> +       }
> +}
> +
>  /*
>   * Called from ublk char device release handler, when any uring_cmd is
>   * done, meantime request queue is "quiesced" since all inflight requests
> @@ -2269,6 +2290,9 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>                 if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
>                         __ublk_fail_req(ub, io, io->req);
>         }
> +
> +       if (ublk_support_batch_io(ubq))
> +               ublk_abort_batch_queue(ub, ubq);
>  }
>
>  static void ublk_start_cancel(struct ublk_device *ub)
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 15/27] ublk: abort requests filled in event kfifo
  2025-11-21  1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
  2025-12-01 18:52   ` Caleb Sander Mateos
@ 2025-12-01 19:00   ` Caleb Sander Mateos
  1 sibling, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 19:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> In case of BATCH_IO, any request filled in event kfifo, they don't get
> chance to be dispatched any more when releasing ublk char device, so
> we have to abort them too.
>
> Add ublk_abort_batch_queue() for aborting this kind of requests.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 2e5e392c939e..849199771f86 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -2241,7 +2241,8 @@ static int ublk_ch_mmap(struct file *filp, struct vm_area_struct *vma)
>  static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
>                 struct request *req)
>  {
> -       WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> +       WARN_ON_ONCE(!ublk_dev_support_batch_io(ub) &&
> +                       io->flags & UBLK_IO_FLAG_ACTIVE);
>
>         if (ublk_nosrv_should_reissue_outstanding(ub))
>                 blk_mq_requeue_request(req, false);
> @@ -2251,6 +2252,26 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
>         }
>  }
>
> +/*
> + * Request tag may just be filled to event kfifo, not get chance to
> + * dispatch, abort these requests too
> + */
> +static void ublk_abort_batch_queue(struct ublk_device *ub,
> +                                  struct ublk_queue *ubq)
> +{
> +       while (true) {
> +               struct request *req;
> +               short tag;
> +
> +               if (!kfifo_out(&ubq->evts_fifo, &tag, 1))
> +                       break;

This loop could also be written a bit more simply as while
(kfifo_out(&ubq->evts_fifo, &tag, 1)).

Best,
Caleb

> +
> +               req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +               if (req && blk_mq_request_started(req))
> +                       __ublk_fail_req(ub, &ubq->ios[tag], req);
> +       }
> +}
> +
>  /*
>   * Called from ublk char device release handler, when any uring_cmd is
>   * done, meantime request queue is "quiesced" since all inflight requests
> @@ -2269,6 +2290,9 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>                 if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
>                         __ublk_fail_req(ub, io, io->req);
>         }
> +
> +       if (ublk_support_batch_io(ubq))
> +               ublk_abort_batch_queue(ub, ubq);
>  }
>
>  static void ublk_start_cancel(struct ublk_device *ub)
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO
  2025-11-21  1:58 ` [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
@ 2025-12-01 21:16   ` Caleb Sander Mateos
  2025-12-02  1:44     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 21:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add new feature UBLK_F_BATCH_IO which replaces the following two
> per-io commands:
>
>         - UBLK_U_IO_FETCH_REQ
>
>         - UBLK_U_IO_COMMIT_AND_FETCH_REQ
>
> with three per-queue batch io uring_cmd:
>
>         - UBLK_U_IO_PREP_IO_CMDS
>
>         - UBLK_U_IO_COMMIT_IO_CMDS
>
>         - UBLK_U_IO_FETCH_IO_CMDS
>
> Then ublk can deliver batch io commands to ublk server in single
> multishort uring_cmd, also allows to prepare & commit multiple
> commands in batch style via single uring_cmd, communication cost is
> reduced a lot.
>
> This feature also doesn't limit task context any more for all supported
> commands, so any allowed uring_cmd can be issued in any task context.
> ublk server implementation becomes much easier.
>
> Meantime load balance becomes much easier to support with this feature.
> The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
> contexts, so each task can adjust this command's buffer length or number
> of inflight commands for controlling how much load is handled by current
> task.
>
> Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
> for improving load balance support.
>
> UBLK_U_IO_GET_DATA isn't supported in batch io yet, but it may be

UBLK_U_IO_NEED_GET_DATA?

> enabled in future via its batch pair.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 58 ++++++++++++++++++++++++++++++++---
>  include/uapi/linux/ublk_cmd.h | 16 ++++++++++
>  2 files changed, 69 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 849199771f86..90cd1863bc83 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -74,7 +74,8 @@
>                 | UBLK_F_AUTO_BUF_REG \
>                 | UBLK_F_QUIESCE \
>                 | UBLK_F_PER_IO_DAEMON \
> -               | UBLK_F_BUF_REG_OFF_DAEMON)
> +               | UBLK_F_BUF_REG_OFF_DAEMON \
> +               | UBLK_F_BATCH_IO)
>
>  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
>                 | UBLK_F_USER_RECOVERY_REISSUE \
> @@ -320,12 +321,12 @@ static void ublk_batch_dispatch(struct ublk_queue *ubq,
>
>  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
>  {
> -       return false;
> +       return ub->dev_info.flags & UBLK_F_BATCH_IO;
>  }
>
>  static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
>  {
> -       return false;
> +       return ubq->flags & UBLK_F_BATCH_IO;
>  }
>
>  static inline void ublk_io_lock(struct ublk_io *io)
> @@ -3450,6 +3451,41 @@ static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
>         return 0;
>  }
>
> +static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd,
> +                                    unsigned int issue_flags)
> +{
> +       const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe);
> +       struct ublk_device *ub = cmd->file->private_data;
> +       unsigned tag = READ_ONCE(ub_cmd->tag);
> +       unsigned q_id = READ_ONCE(ub_cmd->q_id);
> +       unsigned index = READ_ONCE(ub_cmd->addr);
> +       struct ublk_queue *ubq;
> +       struct ublk_io *io;
> +       int ret = -EINVAL;

I think it would be clearer to just return -EINVAL instead of adding
this variable, but up to you

> +
> +       if (!ub)
> +               return ret;

How is this case possible?

> +
> +       if (q_id >= ub->dev_info.nr_hw_queues)
> +               return ret;
> +
> +       ubq = ublk_get_queue(ub, q_id);
> +       if (tag >= ubq->q_depth)

Can avoid the likely cache miss here by using ub->dev_info.queue_depth
instead, analogous to ublk_ch_uring_cmd_local()

> +               return ret;
> +
> +       io = &ubq->ios[tag];
> +
> +       switch (cmd->cmd_op) {
> +       case UBLK_U_IO_REGISTER_IO_BUF:
> +               return ublk_register_io_buf(cmd, ub, q_id, tag, io, index,
> +                               issue_flags);
> +       case UBLK_U_IO_UNREGISTER_IO_BUF:
> +               return ublk_unregister_io_buf(cmd, ub, index, issue_flags);
> +       default:
> +               return -EOPNOTSUPP;
> +       }
> +}
> +
>  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                                        unsigned int issue_flags)
>  {
> @@ -3497,7 +3533,8 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
>                 ret = ublk_handle_batch_fetch_cmd(&data);
>                 break;
>         default:
> -               ret = -EOPNOTSUPP;
> +               ret = ublk_handle_non_batch_cmd(cmd, issue_flags);

We should probably skip the if (data.header.q_id >=
ub->dev_info.nr_hw_queues) check for a non-batch command?

> +               break;
>         }
>  out:
>         return ret;
> @@ -4163,9 +4200,13 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>
>         ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
>                 UBLK_F_URING_CMD_COMP_IN_TASK |
> -               UBLK_F_PER_IO_DAEMON |
> +               (ublk_dev_support_batch_io(ub) ? 0 : UBLK_F_PER_IO_DAEMON) |

Seems redundant with the logic below to clear UBLK_F_PER_IO_DAEMON if
(ublk_dev_support_batch_io(ub))?

>                 UBLK_F_BUF_REG_OFF_DAEMON;
>
> +       /* So far, UBLK_F_PER_IO_DAEMON won't be exposed for BATCH_IO */
> +       if (ublk_dev_support_batch_io(ub))
> +               ub->dev_info.flags &= ~UBLK_F_PER_IO_DAEMON;
> +
>         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
>         if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
>                                 UBLK_F_AUTO_BUF_REG))
> @@ -4518,6 +4559,13 @@ static int ublk_wait_for_idle_io(struct ublk_device *ub,
>         unsigned int elapsed = 0;
>         int ret;
>
> +       /*
> +        * For UBLK_F_BATCH_IO ublk server can get notified with existing
> +        * or new fetch command, so needn't wait any more
> +        */
> +       if (ublk_dev_support_batch_io(ub))
> +               return 0;
> +
>         while (elapsed < timeout_ms && !signal_pending(current)) {
>                 unsigned int queues_cancelable = 0;
>                 int i;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index cd894c1d188e..5e8b1211b7f4 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -335,6 +335,22 @@
>   */
>  #define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14)
>
> +
> +/*
> + * Support the following commands for delivering & committing io command
> + * in batch.
> + *
> + *     - UBLK_U_IO_PREP_IO_CMDS
> + *     - UBLK_U_IO_COMMIT_IO_CMDS
> + *     - UBLK_U_IO_FETCH_IO_CMDS
> + *     - UBLK_U_IO_REGISTER_IO_BUF
> + *     - UBLK_U_IO_UNREGISTER_IO_BUF

Seems like it might make sense to provided batched versions of
UBLK_U_IO_REGISTER_IO_BUF and UBLK_U_IO_UNREGISTER_IO_BUF. That could
be done in the future, I guess, but it might simplify
ublk_ch_batch_io_uring_cmd() to only have to handle struct
ublk_batch_io.

> + *
> + * The existing UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ and
> + * UBLK_U_IO_GET_DATA uring_cmd are not supported for this feature.

UBLK_U_IO_NEED_GET_DATA?

Best,
Caleb

> + */
> +#define UBLK_F_BATCH_IO                (1ULL << 15)
> +
>  /* device state */
>  #define UBLK_S_DEV_DEAD        0
>  #define UBLK_S_DEV_LIVE        1
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 17/27] ublk: document feature UBLK_F_BATCH_IO
  2025-11-21  1:58 ` [PATCH V4 17/27] ublk: document " Ming Lei
@ 2025-12-01 21:46   ` Caleb Sander Mateos
  2025-12-02  1:55     ` Ming Lei
  2025-12-02  2:03     ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 21:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Document feature UBLK_F_BATCH_IO.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  Documentation/block/ublk.rst | 60 +++++++++++++++++++++++++++++++++---
>  1 file changed, 56 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 8c4030bcabb6..09a5604f8e10 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -260,9 +260,12 @@ The following IO commands are communicated via io_uring passthrough command,
>  and each command is only for forwarding the IO and committing the result
>  with specified IO tag in the command data:
>
> -- ``UBLK_IO_FETCH_REQ``
> +Traditional Per-I/O Commands
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> -  Sent from the server IO pthread for fetching future incoming IO requests
> +- ``UBLK_U_IO_FETCH_REQ``
> +
> +  Sent from the server I/O pthread for fetching future incoming I/O requests
>    destined to ``/dev/ublkb*``. This command is sent only once from the server
>    IO pthread for ublk driver to setup IO forward environment.
>
> @@ -278,7 +281,7 @@ with specified IO tag in the command data:
>    supported by the driver, daemons must be per-queue instead - i.e. all I/Os
>    associated to a single qid must be handled by the same task.
>
> -- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
> +- ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``
>
>    When an IO request is destined to ``/dev/ublkb*``, the driver stores
>    the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
> @@ -293,7 +296,7 @@ with specified IO tag in the command data:
>    requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
>    is reused for both fetching request and committing back IO result.
>
> -- ``UBLK_IO_NEED_GET_DATA``
> +- ``UBLK_U_IO_NEED_GET_DATA``
>
>    With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
>    issued to ublk server without data copy. Then, IO backend of ublk server
> @@ -322,6 +325,55 @@ with specified IO tag in the command data:
>    ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>    the server buffer (pages) read to the IO request pages.
>
> +Batch I/O Commands (UBLK_F_BATCH_IO)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance
> +I/O handling model that replaces the traditional per-I/O commands with
> +per-queue batch commands. This significantly reduces communication overhead
> +and enables better load balancing across multiple server tasks.
> +
> +Key differences from traditional mode:
> +
> +- **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os
> +- **Batch processing**: Multiple I/Os are handled in single operations
> +- **Multishot commands**: Use io_uring multishot for reduced submission overhead
> +- **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons)
> +- **Better load balancing**: Tasks can adjust their workload dynamically
> +
> +Batch I/O Commands:
> +
> +- ``UBLK_U_IO_PREP_IO_CMDS``
> +
> +  Prepares multiple I/O commands in batch. The server provides a buffer
> +  containing multiple I/O descriptors that will be processed together.
> +  This reduces the number of individual command submissions required.
> +
> +- ``UBLK_U_IO_COMMIT_IO_CMDS``
> +
> +  Commits results for multiple I/O operations in batch. The server provides

And prepares the I/O descriptors to accept new requests?

> +  a buffer containing the results of multiple completed I/Os, allowing
> +  efficient bulk completion of requests.
> +
> +- ``UBLK_U_IO_FETCH_IO_CMDS``
> +
> +  **Multishot command** for fetching I/O commands in batch. This is the key
> +  command that enables high-performance batch processing:
> +
> +  * Uses io_uring multishot capability for reduced submission overhead
> +  * Single command can fetch multiple I/O requests over time
> +  * Buffer size determines maximum batch size per operation
> +  * Multiple fetch commands can be submitted for load balancing
> +  * Only one fetch command is active at any time per queue

Can you clarify what the lifetime of the fetch command is? It looks
like as long as the buffer selection and posting of the multishot CQE
succeeds, the same UBLK_U_IO_FETCH_IO_CMDS will continue to be used.
If additional UBLK_U_IO_FETCH_IO_CMDS commands are issued to the queue
(e.g. by other threads), they won't be used until the first one fails
to select a buffer or post the CQE? Seems like this would make it
difficult to load-balance incoming requests on a single ublk queue
between multiple threads.

Best,
Caleb

> +  * Supports dynamic load balancing across multiple server tasks
> +
> +  Each task can submit ``UBLK_U_IO_FETCH_IO_CMDS`` with different buffer
> +  sizes to control how much work it handles. This enables sophisticated
> +  load balancing strategies in multi-threaded servers.
> +
> +Migration: Applications using traditional commands (``UBLK_U_IO_FETCH_REQ``,
> +``UBLK_U_IO_COMMIT_AND_FETCH_REQ``) cannot use batch mode simultaneously.
> +
>  Zero copy
>  ---------
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch()
  2025-11-21  1:58 ` [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch() Ming Lei
@ 2025-12-01 21:55   ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-01 21:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> Reduce overhead when completing multiple requests in batch I/O mode by
> accumulating them in an io_comp_batch structure and completing them
> together via blk_mq_end_request_batch(). This minimizes per-request
> completion overhead and improves performance for high IOPS workloads.
>
> The implementation adds an io_comp_batch pointer to struct ublk_io and
> initializes it in __ublk_fetch(). For batch I/O, the pointer is set to
> the batch structure in ublk_batch_commit_io(). The __ublk_complete_rq()
> function uses io->iob to call blk_mq_add_to_batch() for batch mode.
> After processing all batch I/Os, the completion callback is invoked in
> ublk_handle_batch_commit_cmd() to complete all accumulated requests
> efficiently.
>
> So far just covers direct completion. For deferred completion(zero copy,
> auto buffer reg), ublk_io_release() is often delayed in freeing buffer
> consumer io_uring request's code path, so this patch often doesn't work,
> also it is hard to pass the per-task 'struct io_comp_batch' for deferred
> completion.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 90cd1863bc83..a5606c7111a4 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -130,6 +130,7 @@ struct ublk_batch_io_data {
>         struct io_uring_cmd *cmd;
>         struct ublk_batch_io header;
>         unsigned int issue_flags;
> +       struct io_comp_batch *iob;
>  };
>
>  /*
> @@ -642,7 +643,12 @@ static blk_status_t ublk_setup_iod_zoned(struct ublk_queue *ubq,
>  #endif
>
>  static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
> -                                     bool need_map);
> +                                     bool need_map, struct io_comp_batch *iob);
> +
> +static void ublk_complete_batch(struct io_comp_batch *iob)
> +{
> +       blk_mq_end_request_batch(iob);
> +}

Don't see the need for this function, just use blk_mq_end_request_batch instead?

Otherwise,
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

>
>  static dev_t ublk_chr_devt;
>  static const struct class ublk_chr_class = {
> @@ -912,7 +918,7 @@ static inline void ublk_put_req_ref(struct ublk_io *io, struct request *req)
>                 return;
>
>         /* ublk_need_map_io() and ublk_need_req_ref() are mutually exclusive */
> -       __ublk_complete_rq(req, io, false);
> +       __ublk_complete_rq(req, io, false, NULL);
>  }
>
>  static inline bool ublk_sub_req_ref(struct ublk_io *io)
> @@ -1251,7 +1257,7 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>
>  /* todo: handle partial completion */
>  static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
> -                                     bool need_map)
> +                                     bool need_map, struct io_comp_batch *iob)
>  {
>         unsigned int unmapped_bytes;
>         blk_status_t res = BLK_STS_OK;
> @@ -1288,8 +1294,11 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>
>         if (blk_update_request(req, BLK_STS_OK, io->res))
>                 blk_mq_requeue_request(req, true);
> -       else if (likely(!blk_should_fake_timeout(req->q)))
> +       else if (likely(!blk_should_fake_timeout(req->q))) {
> +               if (blk_mq_add_to_batch(req, iob, false, ublk_complete_batch))
> +                       return;
>                 __blk_mq_end_request(req, BLK_STS_OK);
> +       }
>
>         return;
>  exit:
> @@ -2249,7 +2258,7 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
>                 blk_mq_requeue_request(req, false);
>         else {
>                 io->res = -EIO;
> -               __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub));
> +               __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL);
>         }
>  }
>
> @@ -2986,7 +2995,7 @@ static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
>                 if (req_op(req) == REQ_OP_ZONE_APPEND)
>                         req->__sector = addr;
>                 if (compl)
> -                       __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub));
> +                       __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub), NULL);
>
>                 if (ret)
>                         goto out;
> @@ -3321,11 +3330,11 @@ static int ublk_batch_commit_io(struct ublk_queue *ubq,
>         if (req_op(req) == REQ_OP_ZONE_APPEND)
>                 req->__sector = ublk_batch_zone_lba(uc, elem);
>         if (compl)
> -               __ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub));
> +               __ublk_complete_rq(req, io, ublk_dev_need_map_io(data->ub), data->iob);
>         return 0;
>  }
>
> -static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
> +static int ublk_handle_batch_commit_cmd(struct ublk_batch_io_data *data)
>  {
>         const struct ublk_batch_io *uc = &data->header;
>         struct io_uring_cmd *cmd = data->cmd;
> @@ -3334,10 +3343,15 @@ static int ublk_handle_batch_commit_cmd(const struct ublk_batch_io_data *data)
>                 .total = uc->nr_elem * uc->elem_bytes,
>                 .elem_bytes = uc->elem_bytes,
>         };
> +       DEFINE_IO_COMP_BATCH(iob);
>         int ret;
>
> +       data->iob = &iob;
>         ret = ublk_walk_cmd_buf(&iter, data, ublk_batch_commit_io);
>
> +       if (iob.complete)
> +               iob.complete(&iob);
> +
>         return iter.done == 0 ? ret : iter.done;
>  }
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-01 17:51       ` Caleb Sander Mateos
@ 2025-12-02  1:27         ` Ming Lei
  2025-12-02  1:39           ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-02  1:27 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 09:51:59AM -0800, Caleb Sander Mateos wrote:
> On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > > multiple I/O commands in a single operation, significantly reducing
> > > > submission overhead compared to individual FETCH_REQ* commands.
> > > >
> > > > Key Design Features:
> > > >
> > > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > > >    commands, with the batch size limited by the provided buffer length.
> > > >
> > > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > > >    simultaneously, but only one is active at any time. This enables
> > > >    efficient load distribution across multiple server task contexts.
> > > >
> > > > 3. Implicit State Management: The implementation uses three key variables
> > > >    to track state:
> > > >    - evts_fifo: Queue of request tags awaiting processing
> > > >    - fcmd_head: List of available fetch commands
> > > >    - active_fcmd: Currently active fetch command (NULL = none active)
> > > >
> > > >    States are derived implicitly:
> > > >    - IDLE: No fetch commands available
> > > >    - READY: Fetch commands available, none active
> > > >    - ACTIVE: One fetch command processing events
> > > >
> > > > 4. Lockless Reader Optimization: The active fetch command can read from
> > > >    evts_fifo without locking (single reader guarantee), while writers
> > > >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > > >    barrier pairing plays key role for the single lockless reader
> > > >    optimization.
> > > >
> > > > Implementation Details:
> > > >
> > > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > > >   events arrive and no command is currently active
> > >
> > > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
> >
> > It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> > __ublk_release_fcmd().
> 
> Okay, update the commit message then?
> 
> >
> > >
> > > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > > >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > > - State transitions are coordinated via evts_lock to maintain consistency
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> > > >  include/uapi/linux/ublk_cmd.h |   7 +
> > > >  2 files changed, 388 insertions(+), 31 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index cc9c92d97349..2e5e392c939e 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -93,6 +93,7 @@
> > > >
> > > >  /* ublk batch fetch uring_cmd */
> > > >  struct ublk_batch_fcmd {
> > > > +       struct list_head node;
> > > >         struct io_uring_cmd *cmd;
> > > >         unsigned short buf_group;
> > > >  };
> > > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > > >          */
> > > >         struct ublk_queue *ubq;
> > > >
> > > > -       u16 tag;
> > > > +       union {
> > > > +               u16 tag;
> > > > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > > +       };
> > > >  };
> > > >
> > > >  struct ublk_batch_io_data {
> > > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > > >         struct ublk_device *dev;
> > > >
> > > >         /*
> > > > -        * Inflight ublk request tag is saved in this fifo
> > > > +        * Batch I/O State Management:
> > > > +        *
> > > > +        * The batch I/O system uses implicit state management based on the
> > > > +        * combination of three key variables below.
> > > > +        *
> > > > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > > +        *   No fetch commands available, events queue in evts_fifo
> > > > +        *
> > > > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > > +        *   Fetch commands available but none processing events
> > > >          *
> > > > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > > -        * so lock is required for storing request tag to fifo
> > > > +        * - ACTIVE: active_fcmd
> > > > +        *   One fetch command actively processing events from evts_fifo
> > > >          *
> > > > -        * Make sure just one reader for fetching request from task work
> > > > -        * function to ublk server, so no need to grab the lock in reader
> > > > -        * side.
> > > > +        * Key Invariants:
> > > > +        * - At most one active_fcmd at any time (single reader)
> > > > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > > > +        * - evts_fifo can be read locklessly by the single active reader
> > > > +        * - All state transitions require evts_lock protection
> > > > +        * - Multiple writers to evts_fifo require lock protection
> > > >          */
> > > >         struct {
> > > >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > > >                 spinlock_t evts_lock;
> > > > +
> > > > +               /* List of fetch commands available to process events */
> > > > +               struct list_head fcmd_head;
> > > > +
> > > > +               /* Currently active fetch command (NULL = none active) */
> > > > +               struct ublk_batch_fcmd  *active_fcmd;
> > > >         }____cacheline_aligned_in_smp;
> > > >
> > > >         struct ublk_io ios[] __counted_by(q_depth);
> > > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > > >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > +                               struct ublk_batch_io_data *data,
> > > > +                               struct ublk_batch_fcmd *fcmd);
> > > >
> > > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > >  {
> > > >         return false;
> > > >  }
> > > >
> > > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > +{
> > > > +       return false;
> > > > +}
> > > > +
> > > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > >  {
> > > >         spin_lock(&io->lock);
> > > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> > > >
> > > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > > >
> > > > +static struct ublk_batch_fcmd *
> > > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > > +{
> > > > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> > >
> > > An allocation in the I/O path seems unfortunate. Is there not room to
> > > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> >
> > It is allocated once for one mshot request, which covers many IOs.
> >
> > It can't be held in uring_cmd pdu, but the allocation can be optimized in
> > future. Not a big deal in enablement stage.
> 
> Okay, seems fine to optimize it in the future.
> 
> >
> > > > +
> > > > +       if (fcmd) {
> > > > +               fcmd->cmd = cmd;
> > > > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> > >
> > > Is it necessary to store sample this here just to pass it back to the
> > > io_uring layer? Wouldn't the io_uring layer already have access to it
> > > in struct io_kiocb's buf_index field?
> >
> > ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> > follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
> >
> >
> > io_ring_buffer_select(), so we can't reuse req->buf_index here.
> 
> But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
> from req->buf_index instead of the SQE, for example:
> if (req->flags & REQ_F_BUFFER_SELECT)
>         sr->buf_group = req->buf_index;
> 
> Seems like it would make sense to do the same for
> UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.

IMO we shouldn't encourage driver to access `io_kiocb`, however, cmd->sqe
is exposed to driver explicitly.

> 
> >
> > >
> > > > +       }
> > > > +       return fcmd;
> > > > +}
> > > > +
> > > > +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> > > > +{
> > > > +       kfree(fcmd);
> > > > +}
> > > > +
> > > > +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> > > > +{
> > > > +       WRITE_ONCE(ubq->active_fcmd, NULL);
> > > > +}
> > > >
> > > > -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > > > +/*
> > > > + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> > > > + * dispatching
> > > > + */
> > > > +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> > > > +                                       const struct ublk_batch_io_data *data,
> > > >                                         struct ublk_batch_fcmd *fcmd,
> > > >                                         int res)
> > > >  {
> > > > +       spin_lock(&ubq->evts_lock);
> > > > +       list_del(&fcmd->node);
> > > > +       WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> > > > +       __ublk_release_fcmd(ubq);
> > > > +       spin_unlock(&ubq->evts_lock);
> > > > +
> > > >         io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > > > -       fcmd->cmd = NULL;
> > > > +       ublk_batch_free_fcmd(fcmd);
> > > >  }
> > > >
> > > >  static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > > > @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > >         bool needs_filter;
> > > >         int ret;
> > > >
> > > > +       WARN_ON_ONCE(data->cmd != fcmd->cmd);
> > > > +
> > > >         sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > > >                                          data->issue_flags);
> > > >         if (sel.val < 0)
> > > > @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > >         return ret;
> > > >  }
> > > >
> > > > -static __maybe_unused int
> > > > -ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > -                   const struct ublk_batch_io_data *data,
> > > > -                   struct ublk_batch_fcmd *fcmd)
> > > > +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> > > > +               struct ublk_queue *ubq)
> > > > +{
> > > > +       struct ublk_batch_fcmd *fcmd;
> > > > +
> > > > +       lockdep_assert_held(&ubq->evts_lock);
> > > > +
> > > > +       /*
> > > > +        * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> > > > +        *
> > > > +        * The pair is the smp_mb() in ublk_batch_dispatch().
> > > > +        *
> > > > +        * If ubq->active_fcmd is observed as non-NULL, the new added tags
> > > > +        * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> > > > +        */
> > > > +       smp_mb();
> > > > +       if (READ_ONCE(ubq->active_fcmd)) {
> > > > +               fcmd = NULL;
> > > > +       } else {
> > > > +               fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> > > > +                               struct ublk_batch_fcmd, node);
> > > > +               WRITE_ONCE(ubq->active_fcmd, fcmd);
> > > > +       }
> > > > +       return fcmd;
> > > > +}
> > > > +
> > > > +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> > > > +                          unsigned int issue_flags)
> > > > +{
> > > > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > > > +       struct ublk_batch_io_data data = {
> > > > +               .ub = pdu->ubq->dev,
> > > > +               .cmd = fcmd->cmd,
> > > > +               .issue_flags = issue_flags,
> > > > +       };
> > > > +
> > > > +       WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> > > > +
> > > > +       ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> > > > +}
> > > > +
> > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > +                               struct ublk_batch_io_data *data,
> > > > +                               struct ublk_batch_fcmd *fcmd)
> > > >  {
> > > > +       struct ublk_batch_fcmd *new_fcmd;
> > >
> > > Is the new_fcmd variable necessary? Can fcmd be reused instead?
> > >
> > > > +       void *handle;
> > > > +       bool empty;
> > > >         int ret = 0;
> > > >
> > > > +again:
> > > >         while (!ublk_io_evts_empty(ubq)) {
> > > >                 ret = __ublk_batch_dispatch(ubq, data, fcmd);
> > > >                 if (ret <= 0)
> > > >                         break;
> > > >         }
> > > >
> > > > -       if (ret < 0)
> > > > -               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > > > +       if (ret < 0) {
> > > > +               ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> > > > +               return;
> > > > +       }
> > > >
> > > > -       return ret;
> > > > +       handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> > > > +       __ublk_release_fcmd(ubq);
> > > > +       /*
> > > > +        * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> > > > +        * checking ubq->evts_fifo.
> > > > +        *
> > > > +        * The pair is the smp_mb() in __ublk_acquire_fcmd().
> > > > +        */
> > > > +       smp_mb();
> > > > +       empty = ublk_io_evts_empty(ubq);
> > > > +       if (likely(empty))
> > >
> > > nit: empty variable seems unnecessary
> > >
> > > > +               return;
> > > > +
> > > > +       spin_lock(&ubq->evts_lock);
> > > > +       new_fcmd = __ublk_acquire_fcmd(ubq);
> > > > +       spin_unlock(&ubq->evts_lock);
> > > > +
> > > > +       if (!new_fcmd)
> > > > +               return;
> > > > +       if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
> > >
> > > This check seems to be meant to decide whether the new and old
> > > UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?
> >
> > Actually not.
> >
> > > But belonging to the same io_uring context doesn't necessarily mean
> > > that the same task issued them. It seems like it would be safer to
> > > always dispatch new_fcmd->cmd to task work.
> >
> > What matters is just that ctx->uring_lock & issue_flag matches from ublk
> > viewpoint, so it is safe to do so.
> 
> Okay, that makes sense.
> 
> >
> > However, given it is hit in slow path, so starting new dispatch
> > is easier.
> 
> Yeah, I'd agree it makes sense to keep the unexpected path code
> simpler. There may also be fairness concerns from looping indefinitely
> here if the evts_fifo continues to be nonempty, so dispatching to task
> work seems safer.

Fair enough.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 15/27] ublk: abort requests filled in event kfifo
  2025-12-01 18:52   ` Caleb Sander Mateos
@ 2025-12-02  1:29     ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-12-02  1:29 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 10:52:22AM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > In case of BATCH_IO, any request filled in event kfifo, they don't get
> > chance to be dispatched any more when releasing ublk char device, so
> > we have to abort them too.
> >
> > Add ublk_abort_batch_queue() for aborting this kind of requests.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 2e5e392c939e..849199771f86 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -2241,7 +2241,8 @@ static int ublk_ch_mmap(struct file *filp, struct vm_area_struct *vma)
> >  static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
> >                 struct request *req)
> >  {
> > -       WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> > +       WARN_ON_ONCE(!ublk_dev_support_batch_io(ub) &&
> > +                       io->flags & UBLK_IO_FLAG_ACTIVE);
> >
> >         if (ublk_nosrv_should_reissue_outstanding(ub))
> >                 blk_mq_requeue_request(req, false);
> > @@ -2251,6 +2252,26 @@ static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io,
> >         }
> >  }
> >
> > +/*
> > + * Request tag may just be filled to event kfifo, not get chance to
> > + * dispatch, abort these requests too
> > + */
> > +static void ublk_abort_batch_queue(struct ublk_device *ub,
> > +                                  struct ublk_queue *ubq)
> > +{
> > +       while (true) {
> > +               struct request *req;
> > +               short tag;
> 
> unsigned short?

OK.

> 
> > +
> > +               if (!kfifo_out(&ubq->evts_fifo, &tag, 1))
> > +                       break;
> > +
> > +               req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> > +               if (req && blk_mq_request_started(req))
> 
> If the tag is in the evts_fifo, how would it be possible for the
> request not to have been started yet?

Good point, the above check can be replaced with warn_on_once().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-02  1:27         ` Ming Lei
@ 2025-12-02  1:39           ` Caleb Sander Mateos
  2025-12-02  8:14             ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-02  1:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 1, 2025 at 5:27 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Dec 01, 2025 at 09:51:59AM -0800, Caleb Sander Mateos wrote:
> > On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > > > multiple I/O commands in a single operation, significantly reducing
> > > > > submission overhead compared to individual FETCH_REQ* commands.
> > > > >
> > > > > Key Design Features:
> > > > >
> > > > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > > > >    commands, with the batch size limited by the provided buffer length.
> > > > >
> > > > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > > > >    simultaneously, but only one is active at any time. This enables
> > > > >    efficient load distribution across multiple server task contexts.
> > > > >
> > > > > 3. Implicit State Management: The implementation uses three key variables
> > > > >    to track state:
> > > > >    - evts_fifo: Queue of request tags awaiting processing
> > > > >    - fcmd_head: List of available fetch commands
> > > > >    - active_fcmd: Currently active fetch command (NULL = none active)
> > > > >
> > > > >    States are derived implicitly:
> > > > >    - IDLE: No fetch commands available
> > > > >    - READY: Fetch commands available, none active
> > > > >    - ACTIVE: One fetch command processing events
> > > > >
> > > > > 4. Lockless Reader Optimization: The active fetch command can read from
> > > > >    evts_fifo without locking (single reader guarantee), while writers
> > > > >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > > > >    barrier pairing plays key role for the single lockless reader
> > > > >    optimization.
> > > > >
> > > > > Implementation Details:
> > > > >
> > > > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > > > >   events arrive and no command is currently active
> > > >
> > > > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
> > >
> > > It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> > > __ublk_release_fcmd().
> >
> > Okay, update the commit message then?
> >
> > >
> > > >
> > > > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > > > >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > > > - State transitions are coordinated via evts_lock to maintain consistency
> > > > >
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > ---
> > > > >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> > > > >  include/uapi/linux/ublk_cmd.h |   7 +
> > > > >  2 files changed, 388 insertions(+), 31 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > index cc9c92d97349..2e5e392c939e 100644
> > > > > --- a/drivers/block/ublk_drv.c
> > > > > +++ b/drivers/block/ublk_drv.c
> > > > > @@ -93,6 +93,7 @@
> > > > >
> > > > >  /* ublk batch fetch uring_cmd */
> > > > >  struct ublk_batch_fcmd {
> > > > > +       struct list_head node;
> > > > >         struct io_uring_cmd *cmd;
> > > > >         unsigned short buf_group;
> > > > >  };
> > > > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > > > >          */
> > > > >         struct ublk_queue *ubq;
> > > > >
> > > > > -       u16 tag;
> > > > > +       union {
> > > > > +               u16 tag;
> > > > > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > > > +       };
> > > > >  };
> > > > >
> > > > >  struct ublk_batch_io_data {
> > > > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > > > >         struct ublk_device *dev;
> > > > >
> > > > >         /*
> > > > > -        * Inflight ublk request tag is saved in this fifo
> > > > > +        * Batch I/O State Management:
> > > > > +        *
> > > > > +        * The batch I/O system uses implicit state management based on the
> > > > > +        * combination of three key variables below.
> > > > > +        *
> > > > > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > > > +        *   No fetch commands available, events queue in evts_fifo
> > > > > +        *
> > > > > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > > > +        *   Fetch commands available but none processing events
> > > > >          *
> > > > > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > > > -        * so lock is required for storing request tag to fifo
> > > > > +        * - ACTIVE: active_fcmd
> > > > > +        *   One fetch command actively processing events from evts_fifo
> > > > >          *
> > > > > -        * Make sure just one reader for fetching request from task work
> > > > > -        * function to ublk server, so no need to grab the lock in reader
> > > > > -        * side.
> > > > > +        * Key Invariants:
> > > > > +        * - At most one active_fcmd at any time (single reader)
> > > > > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > > > > +        * - evts_fifo can be read locklessly by the single active reader
> > > > > +        * - All state transitions require evts_lock protection
> > > > > +        * - Multiple writers to evts_fifo require lock protection
> > > > >          */
> > > > >         struct {
> > > > >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > > > >                 spinlock_t evts_lock;
> > > > > +
> > > > > +               /* List of fetch commands available to process events */
> > > > > +               struct list_head fcmd_head;
> > > > > +
> > > > > +               /* Currently active fetch command (NULL = none active) */
> > > > > +               struct ublk_batch_fcmd  *active_fcmd;
> > > > >         }____cacheline_aligned_in_smp;
> > > > >
> > > > >         struct ublk_io ios[] __counted_by(q_depth);
> > > > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > > >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > > >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > > > >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > > +                               struct ublk_batch_io_data *data,
> > > > > +                               struct ublk_batch_fcmd *fcmd);
> > > > >
> > > > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > > >  {
> > > > >         return false;
> > > > >  }
> > > > >
> > > > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > > +{
> > > > > +       return false;
> > > > > +}
> > > > > +
> > > > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > > >  {
> > > > >         spin_lock(&io->lock);
> > > > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> > > > >
> > > > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > > > >
> > > > > +static struct ublk_batch_fcmd *
> > > > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > > > +{
> > > > > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> > > >
> > > > An allocation in the I/O path seems unfortunate. Is there not room to
> > > > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> > >
> > > It is allocated once for one mshot request, which covers many IOs.
> > >
> > > It can't be held in uring_cmd pdu, but the allocation can be optimized in
> > > future. Not a big deal in enablement stage.
> >
> > Okay, seems fine to optimize it in the future.
> >
> > >
> > > > > +
> > > > > +       if (fcmd) {
> > > > > +               fcmd->cmd = cmd;
> > > > > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> > > >
> > > > Is it necessary to store sample this here just to pass it back to the
> > > > io_uring layer? Wouldn't the io_uring layer already have access to it
> > > > in struct io_kiocb's buf_index field?
> > >
> > > ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> > > follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
> > >
> > >
> > > io_ring_buffer_select(), so we can't reuse req->buf_index here.
> >
> > But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
> > from req->buf_index instead of the SQE, for example:
> > if (req->flags & REQ_F_BUFFER_SELECT)
> >         sr->buf_group = req->buf_index;
> >
> > Seems like it would make sense to do the same for
> > UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.
>
> IMO we shouldn't encourage driver to access `io_kiocb`, however, cmd->sqe
> is exposed to driver explicitly.

Right, but we can add a helper in include/linux/io_uring/cmd.h to
encapsulate accessing the io_kiocb field.

Best,
Caleb

>
> >
> > >
> > > >
> > > > > +       }
> > > > > +       return fcmd;
> > > > > +}
> > > > > +
> > > > > +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> > > > > +{
> > > > > +       kfree(fcmd);
> > > > > +}
> > > > > +
> > > > > +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> > > > > +{
> > > > > +       WRITE_ONCE(ubq->active_fcmd, NULL);
> > > > > +}
> > > > >
> > > > > -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > > > > +/*
> > > > > + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> > > > > + * dispatching
> > > > > + */
> > > > > +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> > > > > +                                       const struct ublk_batch_io_data *data,
> > > > >                                         struct ublk_batch_fcmd *fcmd,
> > > > >                                         int res)
> > > > >  {
> > > > > +       spin_lock(&ubq->evts_lock);
> > > > > +       list_del(&fcmd->node);
> > > > > +       WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> > > > > +       __ublk_release_fcmd(ubq);
> > > > > +       spin_unlock(&ubq->evts_lock);
> > > > > +
> > > > >         io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > > > > -       fcmd->cmd = NULL;
> > > > > +       ublk_batch_free_fcmd(fcmd);
> > > > >  }
> > > > >
> > > > >  static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > > > > @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > >         bool needs_filter;
> > > > >         int ret;
> > > > >
> > > > > +       WARN_ON_ONCE(data->cmd != fcmd->cmd);
> > > > > +
> > > > >         sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > > > >                                          data->issue_flags);
> > > > >         if (sel.val < 0)
> > > > > @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > >         return ret;
> > > > >  }
> > > > >
> > > > > -static __maybe_unused int
> > > > > -ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > > -                   const struct ublk_batch_io_data *data,
> > > > > -                   struct ublk_batch_fcmd *fcmd)
> > > > > +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> > > > > +               struct ublk_queue *ubq)
> > > > > +{
> > > > > +       struct ublk_batch_fcmd *fcmd;
> > > > > +
> > > > > +       lockdep_assert_held(&ubq->evts_lock);
> > > > > +
> > > > > +       /*
> > > > > +        * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> > > > > +        *
> > > > > +        * The pair is the smp_mb() in ublk_batch_dispatch().
> > > > > +        *
> > > > > +        * If ubq->active_fcmd is observed as non-NULL, the new added tags
> > > > > +        * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> > > > > +        */
> > > > > +       smp_mb();
> > > > > +       if (READ_ONCE(ubq->active_fcmd)) {
> > > > > +               fcmd = NULL;
> > > > > +       } else {
> > > > > +               fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> > > > > +                               struct ublk_batch_fcmd, node);
> > > > > +               WRITE_ONCE(ubq->active_fcmd, fcmd);
> > > > > +       }
> > > > > +       return fcmd;
> > > > > +}
> > > > > +
> > > > > +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> > > > > +                          unsigned int issue_flags)
> > > > > +{
> > > > > +       struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > > > +       struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > > > > +       struct ublk_batch_io_data data = {
> > > > > +               .ub = pdu->ubq->dev,
> > > > > +               .cmd = fcmd->cmd,
> > > > > +               .issue_flags = issue_flags,
> > > > > +       };
> > > > > +
> > > > > +       WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> > > > > +
> > > > > +       ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> > > > > +}
> > > > > +
> > > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > > +                               struct ublk_batch_io_data *data,
> > > > > +                               struct ublk_batch_fcmd *fcmd)
> > > > >  {
> > > > > +       struct ublk_batch_fcmd *new_fcmd;
> > > >
> > > > Is the new_fcmd variable necessary? Can fcmd be reused instead?
> > > >
> > > > > +       void *handle;
> > > > > +       bool empty;
> > > > >         int ret = 0;
> > > > >
> > > > > +again:
> > > > >         while (!ublk_io_evts_empty(ubq)) {
> > > > >                 ret = __ublk_batch_dispatch(ubq, data, fcmd);
> > > > >                 if (ret <= 0)
> > > > >                         break;
> > > > >         }
> > > > >
> > > > > -       if (ret < 0)
> > > > > -               ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > > > > +       if (ret < 0) {
> > > > > +               ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> > > > > +               return;
> > > > > +       }
> > > > >
> > > > > -       return ret;
> > > > > +       handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> > > > > +       __ublk_release_fcmd(ubq);
> > > > > +       /*
> > > > > +        * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> > > > > +        * checking ubq->evts_fifo.
> > > > > +        *
> > > > > +        * The pair is the smp_mb() in __ublk_acquire_fcmd().
> > > > > +        */
> > > > > +       smp_mb();
> > > > > +       empty = ublk_io_evts_empty(ubq);
> > > > > +       if (likely(empty))
> > > >
> > > > nit: empty variable seems unnecessary
> > > >
> > > > > +               return;
> > > > > +
> > > > > +       spin_lock(&ubq->evts_lock);
> > > > > +       new_fcmd = __ublk_acquire_fcmd(ubq);
> > > > > +       spin_unlock(&ubq->evts_lock);
> > > > > +
> > > > > +       if (!new_fcmd)
> > > > > +               return;
> > > > > +       if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
> > > >
> > > > This check seems to be meant to decide whether the new and old
> > > > UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?
> > >
> > > Actually not.
> > >
> > > > But belonging to the same io_uring context doesn't necessarily mean
> > > > that the same task issued them. It seems like it would be safer to
> > > > always dispatch new_fcmd->cmd to task work.
> > >
> > > What matters is just that ctx->uring_lock & issue_flag matches from ublk
> > > viewpoint, so it is safe to do so.
> >
> > Okay, that makes sense.
> >
> > >
> > > However, given it is hit in slow path, so starting new dispatch
> > > is easier.
> >
> > Yeah, I'd agree it makes sense to keep the unexpected path code
> > simpler. There may also be fairness concerns from looping indefinitely
> > here if the evts_fifo continues to be nonempty, so dispatching to task
> > work seems safer.
>
> Fair enough.
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO
  2025-12-01 21:16   ` Caleb Sander Mateos
@ 2025-12-02  1:44     ` Ming Lei
  2025-12-02 16:05       ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-02  1:44 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 01:16:04PM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add new feature UBLK_F_BATCH_IO which replaces the following two
> > per-io commands:
> >
> >         - UBLK_U_IO_FETCH_REQ
> >
> >         - UBLK_U_IO_COMMIT_AND_FETCH_REQ
> >
> > with three per-queue batch io uring_cmd:
> >
> >         - UBLK_U_IO_PREP_IO_CMDS
> >
> >         - UBLK_U_IO_COMMIT_IO_CMDS
> >
> >         - UBLK_U_IO_FETCH_IO_CMDS
> >
> > Then ublk can deliver batch io commands to ublk server in single
> > multishort uring_cmd, also allows to prepare & commit multiple
> > commands in batch style via single uring_cmd, communication cost is
> > reduced a lot.
> >
> > This feature also doesn't limit task context any more for all supported
> > commands, so any allowed uring_cmd can be issued in any task context.
> > ublk server implementation becomes much easier.
> >
> > Meantime load balance becomes much easier to support with this feature.
> > The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
> > contexts, so each task can adjust this command's buffer length or number
> > of inflight commands for controlling how much load is handled by current
> > task.
> >
> > Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
> > for improving load balance support.
> >
> > UBLK_U_IO_GET_DATA isn't supported in batch io yet, but it may be
> 
> UBLK_U_IO_NEED_GET_DATA?

Yeah.

> 
> > enabled in future via its batch pair.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c      | 58 ++++++++++++++++++++++++++++++++---
> >  include/uapi/linux/ublk_cmd.h | 16 ++++++++++
> >  2 files changed, 69 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 849199771f86..90cd1863bc83 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -74,7 +74,8 @@
> >                 | UBLK_F_AUTO_BUF_REG \
> >                 | UBLK_F_QUIESCE \
> >                 | UBLK_F_PER_IO_DAEMON \
> > -               | UBLK_F_BUF_REG_OFF_DAEMON)
> > +               | UBLK_F_BUF_REG_OFF_DAEMON \
> > +               | UBLK_F_BATCH_IO)
> >
> >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > @@ -320,12 +321,12 @@ static void ublk_batch_dispatch(struct ublk_queue *ubq,
> >
> >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> >  {
> > -       return false;
> > +       return ub->dev_info.flags & UBLK_F_BATCH_IO;
> >  }
> >
> >  static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> >  {
> > -       return false;
> > +       return ubq->flags & UBLK_F_BATCH_IO;
> >  }
> >
> >  static inline void ublk_io_lock(struct ublk_io *io)
> > @@ -3450,6 +3451,41 @@ static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> >         return 0;
> >  }
> >
> > +static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd,
> > +                                    unsigned int issue_flags)
> > +{
> > +       const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe);
> > +       struct ublk_device *ub = cmd->file->private_data;
> > +       unsigned tag = READ_ONCE(ub_cmd->tag);
> > +       unsigned q_id = READ_ONCE(ub_cmd->q_id);
> > +       unsigned index = READ_ONCE(ub_cmd->addr);
> > +       struct ublk_queue *ubq;
> > +       struct ublk_io *io;
> > +       int ret = -EINVAL;
> 
> I think it would be clearer to just return -EINVAL instead of adding
> this variable, but up to you
> 
> > +
> > +       if (!ub)
> > +               return ret;
> 
> How is this case possible?

Will remove the check.

> 
> > +
> > +       if (q_id >= ub->dev_info.nr_hw_queues)
> > +               return ret;
> > +
> > +       ubq = ublk_get_queue(ub, q_id);
> > +       if (tag >= ubq->q_depth)
> 
> Can avoid the likely cache miss here by using ub->dev_info.queue_depth
> instead, analogous to ublk_ch_uring_cmd_local()

OK.

> 
> > +               return ret;
> > +
> > +       io = &ubq->ios[tag];
> > +
> > +       switch (cmd->cmd_op) {
> > +       case UBLK_U_IO_REGISTER_IO_BUF:
> > +               return ublk_register_io_buf(cmd, ub, q_id, tag, io, index,
> > +                               issue_flags);
> > +       case UBLK_U_IO_UNREGISTER_IO_BUF:
> > +               return ublk_unregister_io_buf(cmd, ub, index, issue_flags);
> > +       default:
> > +               return -EOPNOTSUPP;
> > +       }
> > +}
> > +
> >  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> >                                        unsigned int issue_flags)
> >  {
> > @@ -3497,7 +3533,8 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> >                 ret = ublk_handle_batch_fetch_cmd(&data);
> >                 break;
> >         default:
> > -               ret = -EOPNOTSUPP;
> > +               ret = ublk_handle_non_batch_cmd(cmd, issue_flags);
> 
> We should probably skip the if (data.header.q_id >=
> ub->dev_info.nr_hw_queues) check for a non-batch command?

It is true only for UBLK_IO_UNREGISTER_IO_BUF.

> 
> > +               break;
> >         }
> >  out:
> >         return ret;
> > @@ -4163,9 +4200,13 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> >
> >         ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
> >                 UBLK_F_URING_CMD_COMP_IN_TASK |
> > -               UBLK_F_PER_IO_DAEMON |
> > +               (ublk_dev_support_batch_io(ub) ? 0 : UBLK_F_PER_IO_DAEMON) |
> 
> Seems redundant with the logic below to clear UBLK_F_PER_IO_DAEMON if
> (ublk_dev_support_batch_io(ub))?

Good catch.

> 
> >                 UBLK_F_BUF_REG_OFF_DAEMON;
> >
> > +       /* So far, UBLK_F_PER_IO_DAEMON won't be exposed for BATCH_IO */
> > +       if (ublk_dev_support_batch_io(ub))
> > +               ub->dev_info.flags &= ~UBLK_F_PER_IO_DAEMON;
> > +
> >         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
> >         if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> >                                 UBLK_F_AUTO_BUF_REG))
> > @@ -4518,6 +4559,13 @@ static int ublk_wait_for_idle_io(struct ublk_device *ub,
> >         unsigned int elapsed = 0;
> >         int ret;
> >
> > +       /*
> > +        * For UBLK_F_BATCH_IO ublk server can get notified with existing
> > +        * or new fetch command, so needn't wait any more
> > +        */
> > +       if (ublk_dev_support_batch_io(ub))
> > +               return 0;
> > +
> >         while (elapsed < timeout_ms && !signal_pending(current)) {
> >                 unsigned int queues_cancelable = 0;
> >                 int i;
> > diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> > index cd894c1d188e..5e8b1211b7f4 100644
> > --- a/include/uapi/linux/ublk_cmd.h
> > +++ b/include/uapi/linux/ublk_cmd.h
> > @@ -335,6 +335,22 @@
> >   */
> >  #define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14)
> >
> > +
> > +/*
> > + * Support the following commands for delivering & committing io command
> > + * in batch.
> > + *
> > + *     - UBLK_U_IO_PREP_IO_CMDS
> > + *     - UBLK_U_IO_COMMIT_IO_CMDS
> > + *     - UBLK_U_IO_FETCH_IO_CMDS
> > + *     - UBLK_U_IO_REGISTER_IO_BUF
> > + *     - UBLK_U_IO_UNREGISTER_IO_BUF
> 
> Seems like it might make sense to provided batched versions of
> UBLK_U_IO_REGISTER_IO_BUF and UBLK_U_IO_UNREGISTER_IO_BUF. That could
> be done in the future, I guess, but it might simplify
> ublk_ch_batch_io_uring_cmd() to only have to handle struct
> ublk_batch_io.

Agree, and it can be added in future.




Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 17/27] ublk: document feature UBLK_F_BATCH_IO
  2025-12-01 21:46   ` Caleb Sander Mateos
@ 2025-12-02  1:55     ` Ming Lei
  2025-12-02  2:03     ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-12-02  1:55 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 01:46:19PM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Document feature UBLK_F_BATCH_IO.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  Documentation/block/ublk.rst | 60 +++++++++++++++++++++++++++++++++---
> >  1 file changed, 56 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 8c4030bcabb6..09a5604f8e10 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -260,9 +260,12 @@ The following IO commands are communicated via io_uring passthrough command,
> >  and each command is only for forwarding the IO and committing the result
> >  with specified IO tag in the command data:
> >
> > -- ``UBLK_IO_FETCH_REQ``
> > +Traditional Per-I/O Commands
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > -  Sent from the server IO pthread for fetching future incoming IO requests
> > +- ``UBLK_U_IO_FETCH_REQ``
> > +
> > +  Sent from the server I/O pthread for fetching future incoming I/O requests
> >    destined to ``/dev/ublkb*``. This command is sent only once from the server
> >    IO pthread for ublk driver to setup IO forward environment.
> >
> > @@ -278,7 +281,7 @@ with specified IO tag in the command data:
> >    supported by the driver, daemons must be per-queue instead - i.e. all I/Os
> >    associated to a single qid must be handled by the same task.
> >
> > -- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
> > +- ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``
> >
> >    When an IO request is destined to ``/dev/ublkb*``, the driver stores
> >    the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
> > @@ -293,7 +296,7 @@ with specified IO tag in the command data:
> >    requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
> >    is reused for both fetching request and committing back IO result.
> >
> > -- ``UBLK_IO_NEED_GET_DATA``
> > +- ``UBLK_U_IO_NEED_GET_DATA``
> >
> >    With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
> >    issued to ublk server without data copy. Then, IO backend of ublk server
> > @@ -322,6 +325,55 @@ with specified IO tag in the command data:
> >    ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> >    the server buffer (pages) read to the IO request pages.
> >
> > +Batch I/O Commands (UBLK_F_BATCH_IO)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance
> > +I/O handling model that replaces the traditional per-I/O commands with
> > +per-queue batch commands. This significantly reduces communication overhead
> > +and enables better load balancing across multiple server tasks.
> > +
> > +Key differences from traditional mode:
> > +
> > +- **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os
> > +- **Batch processing**: Multiple I/Os are handled in single operations
> > +- **Multishot commands**: Use io_uring multishot for reduced submission overhead
> > +- **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons)
> > +- **Better load balancing**: Tasks can adjust their workload dynamically
> > +
> > +Batch I/O Commands:
> > +
> > +- ``UBLK_U_IO_PREP_IO_CMDS``
> > +
> > +  Prepares multiple I/O commands in batch. The server provides a buffer
> > +  containing multiple I/O descriptors that will be processed together.
> > +  This reduces the number of individual command submissions required.
> > +
> > +- ``UBLK_U_IO_COMMIT_IO_CMDS``
> > +
> > +  Commits results for multiple I/O operations in batch. The server provides
> 
> And prepares the I/O descriptors to accept new requests?

Yeah, will add it in next version.

> 
> > +  a buffer containing the results of multiple completed I/Os, allowing
> > +  efficient bulk completion of requests.
> > +
> > +- ``UBLK_U_IO_FETCH_IO_CMDS``
> > +
> > +  **Multishot command** for fetching I/O commands in batch. This is the key
> > +  command that enables high-performance batch processing:
> > +
> > +  * Uses io_uring multishot capability for reduced submission overhead
> > +  * Single command can fetch multiple I/O requests over time
> > +  * Buffer size determines maximum batch size per operation
> > +  * Multiple fetch commands can be submitted for load balancing
> > +  * Only one fetch command is active at any time per queue
> 
> Can you clarify what the lifetime of the fetch command is? It looks
> like as long as the buffer selection and posting of the multishot CQE
> succeeds, the same UBLK_U_IO_FETCH_IO_CMDS will continue to be used.

Yeah, it means the provided buffer isn't full yet.

> If additional UBLK_U_IO_FETCH_IO_CMDS commands are issued to the queue
> (e.g. by other threads), they won't be used until the first one fails
> to select a buffer or post the CQE? Seems like this would make it
> difficult to load-balance incoming requests on a single ublk queue
> between multiple threads.

So far, fetch command is added in FIFO style, so new fetch command
can only be handled when old commands are done, ublk server can support
simple load balance by adjusting the fetch buffer size dynamically:

- if one pthread is close to saturate, the fetch buffer size can be reduced

- if one pthread has more capacity, the fetch buffer size can be increased

In future, it should be easy to introduce fetch command priority, so ublk
server can balance load by controlling either fetch buffer size or command
priority.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 17/27] ublk: document feature UBLK_F_BATCH_IO
  2025-12-01 21:46   ` Caleb Sander Mateos
  2025-12-02  1:55     ` Ming Lei
@ 2025-12-02  2:03     ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-12-02  2:03 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 01:46:19PM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Document feature UBLK_F_BATCH_IO.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  Documentation/block/ublk.rst | 60 +++++++++++++++++++++++++++++++++---
> >  1 file changed, 56 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 8c4030bcabb6..09a5604f8e10 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -260,9 +260,12 @@ The following IO commands are communicated via io_uring passthrough command,
> >  and each command is only for forwarding the IO and committing the result
> >  with specified IO tag in the command data:
> >
> > -- ``UBLK_IO_FETCH_REQ``
> > +Traditional Per-I/O Commands
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > -  Sent from the server IO pthread for fetching future incoming IO requests
> > +- ``UBLK_U_IO_FETCH_REQ``
> > +
> > +  Sent from the server I/O pthread for fetching future incoming I/O requests
> >    destined to ``/dev/ublkb*``. This command is sent only once from the server
> >    IO pthread for ublk driver to setup IO forward environment.
> >
> > @@ -278,7 +281,7 @@ with specified IO tag in the command data:
> >    supported by the driver, daemons must be per-queue instead - i.e. all I/Os
> >    associated to a single qid must be handled by the same task.
> >
> > -- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
> > +- ``UBLK_U_IO_COMMIT_AND_FETCH_REQ``
> >
> >    When an IO request is destined to ``/dev/ublkb*``, the driver stores
> >    the IO's ``ublksrv_io_desc`` to the specified mapped area; then the
> > @@ -293,7 +296,7 @@ with specified IO tag in the command data:
> >    requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ``
> >    is reused for both fetching request and committing back IO result.
> >
> > -- ``UBLK_IO_NEED_GET_DATA``
> > +- ``UBLK_U_IO_NEED_GET_DATA``
> >
> >    With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly
> >    issued to ublk server without data copy. Then, IO backend of ublk server
> > @@ -322,6 +325,55 @@ with specified IO tag in the command data:
> >    ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> >    the server buffer (pages) read to the IO request pages.
> >
> > +Batch I/O Commands (UBLK_F_BATCH_IO)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The ``UBLK_F_BATCH_IO`` feature provides an alternative high-performance
> > +I/O handling model that replaces the traditional per-I/O commands with
> > +per-queue batch commands. This significantly reduces communication overhead
> > +and enables better load balancing across multiple server tasks.
> > +
> > +Key differences from traditional mode:
> > +
> > +- **Per-queue vs Per-I/O**: Commands operate on queues rather than individual I/Os
> > +- **Batch processing**: Multiple I/Os are handled in single operations
> > +- **Multishot commands**: Use io_uring multishot for reduced submission overhead
> > +- **Flexible task assignment**: Any task can handle any I/O (no per-I/O daemons)
> > +- **Better load balancing**: Tasks can adjust their workload dynamically
> > +
> > +Batch I/O Commands:
> > +
> > +- ``UBLK_U_IO_PREP_IO_CMDS``
> > +
> > +  Prepares multiple I/O commands in batch. The server provides a buffer
> > +  containing multiple I/O descriptors that will be processed together.
> > +  This reduces the number of individual command submissions required.
> > +
> > +- ``UBLK_U_IO_COMMIT_IO_CMDS``
> > +
> > +  Commits results for multiple I/O operations in batch. The server provides
> 
> And prepares the I/O descriptors to accept new requests?
> 
> > +  a buffer containing the results of multiple completed I/Os, allowing
> > +  efficient bulk completion of requests.
> > +
> > +- ``UBLK_U_IO_FETCH_IO_CMDS``
> > +
> > +  **Multishot command** for fetching I/O commands in batch. This is the key
> > +  command that enables high-performance batch processing:
> > +
> > +  * Uses io_uring multishot capability for reduced submission overhead
> > +  * Single command can fetch multiple I/O requests over time
> > +  * Buffer size determines maximum batch size per operation
> > +  * Multiple fetch commands can be submitted for load balancing
> > +  * Only one fetch command is active at any time per queue
> 
> Can you clarify what the lifetime of the fetch command is? It looks

The fetch command is live if the provided buffer isn't full, which aligns
with typical io_uring multishot req & provided buffer use case, such as
IORING_OP_READ_MULTISHOT.

Also the fetch command is completed in case of FETCH failure.

```
A multishot request will persist as long as no errors are encountered doing
handling of the request. For each CQE posted on behalf of this request, the
CQE flags will have IORING_CQE_F_MORE set if the application should expect
more completions from this request. If this flag isn’t set, then that signifies
termination of the multishot read request.
```

> like as long as the buffer selection and posting of the multishot CQE
> succeeds, the same UBLK_U_IO_FETCH_IO_CMDS will continue to be used.
> If additional UBLK_U_IO_FETCH_IO_CMDS commands are issued to the queue
> (e.g. by other threads), they won't be used until the first one fails
> to select a buffer or post the CQE? Seems like this would make it
> difficult to load-balance incoming requests on a single ublk queue
> between multiple threads.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-02  1:39           ` Caleb Sander Mateos
@ 2025-12-02  8:14             ` Ming Lei
  2025-12-02 15:20               ` Caleb Sander Mateos
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2025-12-02  8:14 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 01, 2025 at 05:39:29PM -0800, Caleb Sander Mateos wrote:
> On Mon, Dec 1, 2025 at 5:27 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Dec 01, 2025 at 09:51:59AM -0800, Caleb Sander Mateos wrote:
> > > On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > > > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > > > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > > > > multiple I/O commands in a single operation, significantly reducing
> > > > > > submission overhead compared to individual FETCH_REQ* commands.
> > > > > >
> > > > > > Key Design Features:
> > > > > >
> > > > > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > > > > >    commands, with the batch size limited by the provided buffer length.
> > > > > >
> > > > > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > > > > >    simultaneously, but only one is active at any time. This enables
> > > > > >    efficient load distribution across multiple server task contexts.
> > > > > >
> > > > > > 3. Implicit State Management: The implementation uses three key variables
> > > > > >    to track state:
> > > > > >    - evts_fifo: Queue of request tags awaiting processing
> > > > > >    - fcmd_head: List of available fetch commands
> > > > > >    - active_fcmd: Currently active fetch command (NULL = none active)
> > > > > >
> > > > > >    States are derived implicitly:
> > > > > >    - IDLE: No fetch commands available
> > > > > >    - READY: Fetch commands available, none active
> > > > > >    - ACTIVE: One fetch command processing events
> > > > > >
> > > > > > 4. Lockless Reader Optimization: The active fetch command can read from
> > > > > >    evts_fifo without locking (single reader guarantee), while writers
> > > > > >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > > > > >    barrier pairing plays key role for the single lockless reader
> > > > > >    optimization.
> > > > > >
> > > > > > Implementation Details:
> > > > > >
> > > > > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > > > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > > > > >   events arrive and no command is currently active
> > > > >
> > > > > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
> > > >
> > > > It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> > > > __ublk_release_fcmd().
> > >
> > > Okay, update the commit message then?
> > >
> > > >
> > > > >
> > > > > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > > > > >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > > > > - State transitions are coordinated via evts_lock to maintain consistency
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> > > > > >  include/uapi/linux/ublk_cmd.h |   7 +
> > > > > >  2 files changed, 388 insertions(+), 31 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > index cc9c92d97349..2e5e392c939e 100644
> > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > @@ -93,6 +93,7 @@
> > > > > >
> > > > > >  /* ublk batch fetch uring_cmd */
> > > > > >  struct ublk_batch_fcmd {
> > > > > > +       struct list_head node;
> > > > > >         struct io_uring_cmd *cmd;
> > > > > >         unsigned short buf_group;
> > > > > >  };
> > > > > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > > > > >          */
> > > > > >         struct ublk_queue *ubq;
> > > > > >
> > > > > > -       u16 tag;
> > > > > > +       union {
> > > > > > +               u16 tag;
> > > > > > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > > > > +       };
> > > > > >  };
> > > > > >
> > > > > >  struct ublk_batch_io_data {
> > > > > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > > > > >         struct ublk_device *dev;
> > > > > >
> > > > > >         /*
> > > > > > -        * Inflight ublk request tag is saved in this fifo
> > > > > > +        * Batch I/O State Management:
> > > > > > +        *
> > > > > > +        * The batch I/O system uses implicit state management based on the
> > > > > > +        * combination of three key variables below.
> > > > > > +        *
> > > > > > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > > > > +        *   No fetch commands available, events queue in evts_fifo
> > > > > > +        *
> > > > > > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > > > > +        *   Fetch commands available but none processing events
> > > > > >          *
> > > > > > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > > > > -        * so lock is required for storing request tag to fifo
> > > > > > +        * - ACTIVE: active_fcmd
> > > > > > +        *   One fetch command actively processing events from evts_fifo
> > > > > >          *
> > > > > > -        * Make sure just one reader for fetching request from task work
> > > > > > -        * function to ublk server, so no need to grab the lock in reader
> > > > > > -        * side.
> > > > > > +        * Key Invariants:
> > > > > > +        * - At most one active_fcmd at any time (single reader)
> > > > > > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > > > > > +        * - evts_fifo can be read locklessly by the single active reader
> > > > > > +        * - All state transitions require evts_lock protection
> > > > > > +        * - Multiple writers to evts_fifo require lock protection
> > > > > >          */
> > > > > >         struct {
> > > > > >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > > > > >                 spinlock_t evts_lock;
> > > > > > +
> > > > > > +               /* List of fetch commands available to process events */
> > > > > > +               struct list_head fcmd_head;
> > > > > > +
> > > > > > +               /* Currently active fetch command (NULL = none active) */
> > > > > > +               struct ublk_batch_fcmd  *active_fcmd;
> > > > > >         }____cacheline_aligned_in_smp;
> > > > > >
> > > > > >         struct ublk_io ios[] __counted_by(q_depth);
> > > > > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > > > >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > > > >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > > > > >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > > > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > > > +                               struct ublk_batch_io_data *data,
> > > > > > +                               struct ublk_batch_fcmd *fcmd);
> > > > > >
> > > > > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > > > >  {
> > > > > >         return false;
> > > > > >  }
> > > > > >
> > > > > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > > > +{
> > > > > > +       return false;
> > > > > > +}
> > > > > > +
> > > > > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > > > >  {
> > > > > >         spin_lock(&io->lock);
> > > > > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> > > > > >
> > > > > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > > > > >
> > > > > > +static struct ublk_batch_fcmd *
> > > > > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > > > > +{
> > > > > > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> > > > >
> > > > > An allocation in the I/O path seems unfortunate. Is there not room to
> > > > > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> > > >
> > > > It is allocated once for one mshot request, which covers many IOs.
> > > >
> > > > It can't be held in uring_cmd pdu, but the allocation can be optimized in
> > > > future. Not a big deal in enablement stage.
> > >
> > > Okay, seems fine to optimize it in the future.
> > >
> > > >
> > > > > > +
> > > > > > +       if (fcmd) {
> > > > > > +               fcmd->cmd = cmd;
> > > > > > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> > > > >
> > > > > Is it necessary to store sample this here just to pass it back to the
> > > > > io_uring layer? Wouldn't the io_uring layer already have access to it
> > > > > in struct io_kiocb's buf_index field?
> > > >
> > > > ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> > > > follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
> > > >
> > > >
> > > > io_ring_buffer_select(), so we can't reuse req->buf_index here.
> > >
> > > But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
> > > from req->buf_index instead of the SQE, for example:
> > > if (req->flags & REQ_F_BUFFER_SELECT)
> > >         sr->buf_group = req->buf_index;
> > >
> > > Seems like it would make sense to do the same for
> > > UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.
> >
> > IMO we shouldn't encourage driver to access `io_kiocb`, however, cmd->sqe
> > is exposed to driver explicitly.
> 
> Right, but we can add a helper in include/linux/io_uring/cmd.h to
> encapsulate accessing the io_kiocb field.

OK, however I'd suggest to do it as one followup optimization for avoiding
cross-tree change.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
  2025-12-02  8:14             ` Ming Lei
@ 2025-12-02 15:20               ` Caleb Sander Mateos
  0 siblings, 0 replies; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-02 15:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Tue, Dec 2, 2025 at 12:14 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Dec 01, 2025 at 05:39:29PM -0800, Caleb Sander Mateos wrote:
> > On Mon, Dec 1, 2025 at 5:27 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Mon, Dec 01, 2025 at 09:51:59AM -0800, Caleb Sander Mateos wrote:
> > > > On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > > > > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > >
> > > > > > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > > > > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > > > > > multiple I/O commands in a single operation, significantly reducing
> > > > > > > submission overhead compared to individual FETCH_REQ* commands.
> > > > > > >
> > > > > > > Key Design Features:
> > > > > > >
> > > > > > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > > > > > >    commands, with the batch size limited by the provided buffer length.
> > > > > > >
> > > > > > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > > > > > >    simultaneously, but only one is active at any time. This enables
> > > > > > >    efficient load distribution across multiple server task contexts.
> > > > > > >
> > > > > > > 3. Implicit State Management: The implementation uses three key variables
> > > > > > >    to track state:
> > > > > > >    - evts_fifo: Queue of request tags awaiting processing
> > > > > > >    - fcmd_head: List of available fetch commands
> > > > > > >    - active_fcmd: Currently active fetch command (NULL = none active)
> > > > > > >
> > > > > > >    States are derived implicitly:
> > > > > > >    - IDLE: No fetch commands available
> > > > > > >    - READY: Fetch commands available, none active
> > > > > > >    - ACTIVE: One fetch command processing events
> > > > > > >
> > > > > > > 4. Lockless Reader Optimization: The active fetch command can read from
> > > > > > >    evts_fifo without locking (single reader guarantee), while writers
> > > > > > >    (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > > > > > >    barrier pairing plays key role for the single lockless reader
> > > > > > >    optimization.
> > > > > > >
> > > > > > > Implementation Details:
> > > > > > >
> > > > > > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > > > > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > > > > > >   events arrive and no command is currently active
> > > > > >
> > > > > > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
> > > > >
> > > > > It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> > > > > __ublk_release_fcmd().
> > > >
> > > > Okay, update the commit message then?
> > > >
> > > > >
> > > > > >
> > > > > > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > > > > > >   buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > > > > > - State transitions are coordinated via evts_lock to maintain consistency
> > > > > > >
> > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > > ---
> > > > > > >  drivers/block/ublk_drv.c      | 412 +++++++++++++++++++++++++++++++---
> > > > > > >  include/uapi/linux/ublk_cmd.h |   7 +
> > > > > > >  2 files changed, 388 insertions(+), 31 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > > index cc9c92d97349..2e5e392c939e 100644
> > > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > > @@ -93,6 +93,7 @@
> > > > > > >
> > > > > > >  /* ublk batch fetch uring_cmd */
> > > > > > >  struct ublk_batch_fcmd {
> > > > > > > +       struct list_head node;
> > > > > > >         struct io_uring_cmd *cmd;
> > > > > > >         unsigned short buf_group;
> > > > > > >  };
> > > > > > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > > > > > >          */
> > > > > > >         struct ublk_queue *ubq;
> > > > > > >
> > > > > > > -       u16 tag;
> > > > > > > +       union {
> > > > > > > +               u16 tag;
> > > > > > > +               struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > > > > > +       };
> > > > > > >  };
> > > > > > >
> > > > > > >  struct ublk_batch_io_data {
> > > > > > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > > > > > >         struct ublk_device *dev;
> > > > > > >
> > > > > > >         /*
> > > > > > > -        * Inflight ublk request tag is saved in this fifo
> > > > > > > +        * Batch I/O State Management:
> > > > > > > +        *
> > > > > > > +        * The batch I/O system uses implicit state management based on the
> > > > > > > +        * combination of three key variables below.
> > > > > > > +        *
> > > > > > > +        * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > > > > > +        *   No fetch commands available, events queue in evts_fifo
> > > > > > > +        *
> > > > > > > +        * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > > > > > +        *   Fetch commands available but none processing events
> > > > > > >          *
> > > > > > > -        * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > > > > > -        * so lock is required for storing request tag to fifo
> > > > > > > +        * - ACTIVE: active_fcmd
> > > > > > > +        *   One fetch command actively processing events from evts_fifo
> > > > > > >          *
> > > > > > > -        * Make sure just one reader for fetching request from task work
> > > > > > > -        * function to ublk server, so no need to grab the lock in reader
> > > > > > > -        * side.
> > > > > > > +        * Key Invariants:
> > > > > > > +        * - At most one active_fcmd at any time (single reader)
> > > > > > > +        * - active_fcmd is always from fcmd_head list when non-NULL
> > > > > > > +        * - evts_fifo can be read locklessly by the single active reader
> > > > > > > +        * - All state transitions require evts_lock protection
> > > > > > > +        * - Multiple writers to evts_fifo require lock protection
> > > > > > >          */
> > > > > > >         struct {
> > > > > > >                 DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > > > > > >                 spinlock_t evts_lock;
> > > > > > > +
> > > > > > > +               /* List of fetch commands available to process events */
> > > > > > > +               struct list_head fcmd_head;
> > > > > > > +
> > > > > > > +               /* Currently active fetch command (NULL = none active) */
> > > > > > > +               struct ublk_batch_fcmd  *active_fcmd;
> > > > > > >         }____cacheline_aligned_in_smp;
> > > > > > >
> > > > > > >         struct ublk_io ios[] __counted_by(q_depth);
> > > > > > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > > > > >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > > > > >                 u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > > > > > >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > > > > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > > > > +                               struct ublk_batch_io_data *data,
> > > > > > > +                               struct ublk_batch_fcmd *fcmd);
> > > > > > >
> > > > > > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > > > > >  {
> > > > > > >         return false;
> > > > > > >  }
> > > > > > >
> > > > > > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > > > > +{
> > > > > > > +       return false;
> > > > > > > +}
> > > > > > > +
> > > > > > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > > > > >  {
> > > > > > >         spin_lock(&io->lock);
> > > > > > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq;     /* wait until one idr is freed */
> > > > > > >
> > > > > > >  static DEFINE_MUTEX(ublk_ctl_mutex);
> > > > > > >
> > > > > > > +static struct ublk_batch_fcmd *
> > > > > > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > > > > > +{
> > > > > > > +       struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> > > > > >
> > > > > > An allocation in the I/O path seems unfortunate. Is there not room to
> > > > > > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> > > > >
> > > > > It is allocated once for one mshot request, which covers many IOs.
> > > > >
> > > > > It can't be held in uring_cmd pdu, but the allocation can be optimized in
> > > > > future. Not a big deal in enablement stage.
> > > >
> > > > Okay, seems fine to optimize it in the future.
> > > >
> > > > >
> > > > > > > +
> > > > > > > +       if (fcmd) {
> > > > > > > +               fcmd->cmd = cmd;
> > > > > > > +               fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> > > > > >
> > > > > > Is it necessary to store sample this here just to pass it back to the
> > > > > > io_uring layer? Wouldn't the io_uring layer already have access to it
> > > > > > in struct io_kiocb's buf_index field?
> > > > >
> > > > > ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> > > > > follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
> > > > >
> > > > >
> > > > > io_ring_buffer_select(), so we can't reuse req->buf_index here.
> > > >
> > > > But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
> > > > from req->buf_index instead of the SQE, for example:
> > > > if (req->flags & REQ_F_BUFFER_SELECT)
> > > >         sr->buf_group = req->buf_index;
> > > >
> > > > Seems like it would make sense to do the same for
> > > > UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.
> > >
> > > IMO we shouldn't encourage driver to access `io_kiocb`, however, cmd->sqe
> > > is exposed to driver explicitly.
> >
> > Right, but we can add a helper in include/linux/io_uring/cmd.h to
> > encapsulate accessing the io_kiocb field.
>
> OK, however I'd suggest to do it as one followup optimization for avoiding
> cross-tree change.

Fair enough

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO
  2025-12-02  1:44     ` Ming Lei
@ 2025-12-02 16:05       ` Caleb Sander Mateos
  2025-12-03  2:21         ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Caleb Sander Mateos @ 2025-12-02 16:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Mon, Dec 1, 2025 at 5:44 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Dec 01, 2025 at 01:16:04PM -0800, Caleb Sander Mateos wrote:
> > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Add new feature UBLK_F_BATCH_IO which replaces the following two
> > > per-io commands:
> > >
> > >         - UBLK_U_IO_FETCH_REQ
> > >
> > >         - UBLK_U_IO_COMMIT_AND_FETCH_REQ
> > >
> > > with three per-queue batch io uring_cmd:
> > >
> > >         - UBLK_U_IO_PREP_IO_CMDS
> > >
> > >         - UBLK_U_IO_COMMIT_IO_CMDS
> > >
> > >         - UBLK_U_IO_FETCH_IO_CMDS
> > >
> > > Then ublk can deliver batch io commands to ublk server in single
> > > multishort uring_cmd, also allows to prepare & commit multiple
> > > commands in batch style via single uring_cmd, communication cost is
> > > reduced a lot.
> > >
> > > This feature also doesn't limit task context any more for all supported
> > > commands, so any allowed uring_cmd can be issued in any task context.
> > > ublk server implementation becomes much easier.
> > >
> > > Meantime load balance becomes much easier to support with this feature.
> > > The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
> > > contexts, so each task can adjust this command's buffer length or number
> > > of inflight commands for controlling how much load is handled by current
> > > task.
> > >
> > > Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
> > > for improving load balance support.
> > >
> > > UBLK_U_IO_GET_DATA isn't supported in batch io yet, but it may be
> >
> > UBLK_U_IO_NEED_GET_DATA?
>
> Yeah.
>
> >
> > > enabled in future via its batch pair.
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/block/ublk_drv.c      | 58 ++++++++++++++++++++++++++++++++---
> > >  include/uapi/linux/ublk_cmd.h | 16 ++++++++++
> > >  2 files changed, 69 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 849199771f86..90cd1863bc83 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -74,7 +74,8 @@
> > >                 | UBLK_F_AUTO_BUF_REG \
> > >                 | UBLK_F_QUIESCE \
> > >                 | UBLK_F_PER_IO_DAEMON \
> > > -               | UBLK_F_BUF_REG_OFF_DAEMON)
> > > +               | UBLK_F_BUF_REG_OFF_DAEMON \
> > > +               | UBLK_F_BATCH_IO)
> > >
> > >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> > >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > > @@ -320,12 +321,12 @@ static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > >
> > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > >  {
> > > -       return false;
> > > +       return ub->dev_info.flags & UBLK_F_BATCH_IO;
> > >  }
> > >
> > >  static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > >  {
> > > -       return false;
> > > +       return ubq->flags & UBLK_F_BATCH_IO;
> > >  }
> > >
> > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > @@ -3450,6 +3451,41 @@ static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> > >         return 0;
> > >  }
> > >
> > > +static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd,
> > > +                                    unsigned int issue_flags)
> > > +{
> > > +       const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe);
> > > +       struct ublk_device *ub = cmd->file->private_data;
> > > +       unsigned tag = READ_ONCE(ub_cmd->tag);
> > > +       unsigned q_id = READ_ONCE(ub_cmd->q_id);
> > > +       unsigned index = READ_ONCE(ub_cmd->addr);
> > > +       struct ublk_queue *ubq;
> > > +       struct ublk_io *io;
> > > +       int ret = -EINVAL;
> >
> > I think it would be clearer to just return -EINVAL instead of adding
> > this variable, but up to you
> >
> > > +
> > > +       if (!ub)
> > > +               return ret;
> >
> > How is this case possible?
>
> Will remove the check.
>
> >
> > > +
> > > +       if (q_id >= ub->dev_info.nr_hw_queues)
> > > +               return ret;
> > > +
> > > +       ubq = ublk_get_queue(ub, q_id);
> > > +       if (tag >= ubq->q_depth)
> >
> > Can avoid the likely cache miss here by using ub->dev_info.queue_depth
> > instead, analogous to ublk_ch_uring_cmd_local()
>
> OK.
>
> >
> > > +               return ret;
> > > +
> > > +       io = &ubq->ios[tag];
> > > +
> > > +       switch (cmd->cmd_op) {
> > > +       case UBLK_U_IO_REGISTER_IO_BUF:
> > > +               return ublk_register_io_buf(cmd, ub, q_id, tag, io, index,
> > > +                               issue_flags);
> > > +       case UBLK_U_IO_UNREGISTER_IO_BUF:
> > > +               return ublk_unregister_io_buf(cmd, ub, index, issue_flags);
> > > +       default:
> > > +               return -EOPNOTSUPP;
> > > +       }
> > > +}
> > > +
> > >  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > >                                        unsigned int issue_flags)
> > >  {
> > > @@ -3497,7 +3533,8 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > >                 ret = ublk_handle_batch_fetch_cmd(&data);
> > >                 break;
> > >         default:
> > > -               ret = -EOPNOTSUPP;
> > > +               ret = ublk_handle_non_batch_cmd(cmd, issue_flags);
> >
> > We should probably skip the if (data.header.q_id >=
> > ub->dev_info.nr_hw_queues) check for a non-batch command?
>
> It is true only for UBLK_IO_UNREGISTER_IO_BUF.

My point was that this relies on the q_id field being located at the
same offset in struct ublksrv_io_cmd and struct ublk_batch_io, which
seems quite subtle. I think it would make more sense not to read the
SQE as a struct ublk_batch_io for the non-batch commands.

Best,
Caleb

>
> >
> > > +               break;
> > >         }
> > >  out:
> > >         return ret;
> > > @@ -4163,9 +4200,13 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> > >
> > >         ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
> > >                 UBLK_F_URING_CMD_COMP_IN_TASK |
> > > -               UBLK_F_PER_IO_DAEMON |
> > > +               (ublk_dev_support_batch_io(ub) ? 0 : UBLK_F_PER_IO_DAEMON) |
> >
> > Seems redundant with the logic below to clear UBLK_F_PER_IO_DAEMON if
> > (ublk_dev_support_batch_io(ub))?
>
> Good catch.
>
> >
> > >                 UBLK_F_BUF_REG_OFF_DAEMON;
> > >
> > > +       /* So far, UBLK_F_PER_IO_DAEMON won't be exposed for BATCH_IO */
> > > +       if (ublk_dev_support_batch_io(ub))
> > > +               ub->dev_info.flags &= ~UBLK_F_PER_IO_DAEMON;
> > > +
> > >         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
> > >         if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> > >                                 UBLK_F_AUTO_BUF_REG))
> > > @@ -4518,6 +4559,13 @@ static int ublk_wait_for_idle_io(struct ublk_device *ub,
> > >         unsigned int elapsed = 0;
> > >         int ret;
> > >
> > > +       /*
> > > +        * For UBLK_F_BATCH_IO ublk server can get notified with existing
> > > +        * or new fetch command, so needn't wait any more
> > > +        */
> > > +       if (ublk_dev_support_batch_io(ub))
> > > +               return 0;
> > > +
> > >         while (elapsed < timeout_ms && !signal_pending(current)) {
> > >                 unsigned int queues_cancelable = 0;
> > >                 int i;
> > > diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> > > index cd894c1d188e..5e8b1211b7f4 100644
> > > --- a/include/uapi/linux/ublk_cmd.h
> > > +++ b/include/uapi/linux/ublk_cmd.h
> > > @@ -335,6 +335,22 @@
> > >   */
> > >  #define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14)
> > >
> > > +
> > > +/*
> > > + * Support the following commands for delivering & committing io command
> > > + * in batch.
> > > + *
> > > + *     - UBLK_U_IO_PREP_IO_CMDS
> > > + *     - UBLK_U_IO_COMMIT_IO_CMDS
> > > + *     - UBLK_U_IO_FETCH_IO_CMDS
> > > + *     - UBLK_U_IO_REGISTER_IO_BUF
> > > + *     - UBLK_U_IO_UNREGISTER_IO_BUF
> >
> > Seems like it might make sense to provided batched versions of
> > UBLK_U_IO_REGISTER_IO_BUF and UBLK_U_IO_UNREGISTER_IO_BUF. That could
> > be done in the future, I guess, but it might simplify
> > ublk_ch_batch_io_uring_cmd() to only have to handle struct
> > ublk_batch_io.
>
> Agree, and it can be added in future.
>
>
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO
  2025-12-02 16:05       ` Caleb Sander Mateos
@ 2025-12-03  2:21         ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2025-12-03  2:21 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, linux-block, Uday Shankar, Stefani Seibold,
	Andrew Morton, linux-kernel

On Tue, Dec 02, 2025 at 08:05:17AM -0800, Caleb Sander Mateos wrote:
> On Mon, Dec 1, 2025 at 5:44 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Dec 01, 2025 at 01:16:04PM -0800, Caleb Sander Mateos wrote:
> > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Add new feature UBLK_F_BATCH_IO which replaces the following two
> > > > per-io commands:
> > > >
> > > >         - UBLK_U_IO_FETCH_REQ
> > > >
> > > >         - UBLK_U_IO_COMMIT_AND_FETCH_REQ
> > > >
> > > > with three per-queue batch io uring_cmd:
> > > >
> > > >         - UBLK_U_IO_PREP_IO_CMDS
> > > >
> > > >         - UBLK_U_IO_COMMIT_IO_CMDS
> > > >
> > > >         - UBLK_U_IO_FETCH_IO_CMDS
> > > >
> > > > Then ublk can deliver batch io commands to ublk server in single
> > > > multishort uring_cmd, also allows to prepare & commit multiple
> > > > commands in batch style via single uring_cmd, communication cost is
> > > > reduced a lot.
> > > >
> > > > This feature also doesn't limit task context any more for all supported
> > > > commands, so any allowed uring_cmd can be issued in any task context.
> > > > ublk server implementation becomes much easier.
> > > >
> > > > Meantime load balance becomes much easier to support with this feature.
> > > > The command `UBLK_U_IO_FETCH_IO_CMDS` can be issued from multiple task
> > > > contexts, so each task can adjust this command's buffer length or number
> > > > of inflight commands for controlling how much load is handled by current
> > > > task.
> > > >
> > > > Later, priority parameter will be added to command `UBLK_U_IO_FETCH_IO_CMDS`
> > > > for improving load balance support.
> > > >
> > > > UBLK_U_IO_GET_DATA isn't supported in batch io yet, but it may be
> > >
> > > UBLK_U_IO_NEED_GET_DATA?
> >
> > Yeah.
> >
> > >
> > > > enabled in future via its batch pair.
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/block/ublk_drv.c      | 58 ++++++++++++++++++++++++++++++++---
> > > >  include/uapi/linux/ublk_cmd.h | 16 ++++++++++
> > > >  2 files changed, 69 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 849199771f86..90cd1863bc83 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -74,7 +74,8 @@
> > > >                 | UBLK_F_AUTO_BUF_REG \
> > > >                 | UBLK_F_QUIESCE \
> > > >                 | UBLK_F_PER_IO_DAEMON \
> > > > -               | UBLK_F_BUF_REG_OFF_DAEMON)
> > > > +               | UBLK_F_BUF_REG_OFF_DAEMON \
> > > > +               | UBLK_F_BATCH_IO)
> > > >
> > > >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> > > >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > > > @@ -320,12 +321,12 @@ static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > >
> > > >  static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > >  {
> > > > -       return false;
> > > > +       return ub->dev_info.flags & UBLK_F_BATCH_IO;
> > > >  }
> > > >
> > > >  static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > >  {
> > > > -       return false;
> > > > +       return ubq->flags & UBLK_F_BATCH_IO;
> > > >  }
> > > >
> > > >  static inline void ublk_io_lock(struct ublk_io *io)
> > > > @@ -3450,6 +3451,41 @@ static int ublk_validate_batch_fetch_cmd(struct ublk_batch_io_data *data,
> > > >         return 0;
> > > >  }
> > > >
> > > > +static int ublk_handle_non_batch_cmd(struct io_uring_cmd *cmd,
> > > > +                                    unsigned int issue_flags)
> > > > +{
> > > > +       const struct ublksrv_io_cmd *ub_cmd = io_uring_sqe_cmd(cmd->sqe);
> > > > +       struct ublk_device *ub = cmd->file->private_data;
> > > > +       unsigned tag = READ_ONCE(ub_cmd->tag);
> > > > +       unsigned q_id = READ_ONCE(ub_cmd->q_id);
> > > > +       unsigned index = READ_ONCE(ub_cmd->addr);
> > > > +       struct ublk_queue *ubq;
> > > > +       struct ublk_io *io;
> > > > +       int ret = -EINVAL;
> > >
> > > I think it would be clearer to just return -EINVAL instead of adding
> > > this variable, but up to you
> > >
> > > > +
> > > > +       if (!ub)
> > > > +               return ret;
> > >
> > > How is this case possible?
> >
> > Will remove the check.
> >
> > >
> > > > +
> > > > +       if (q_id >= ub->dev_info.nr_hw_queues)
> > > > +               return ret;
> > > > +
> > > > +       ubq = ublk_get_queue(ub, q_id);
> > > > +       if (tag >= ubq->q_depth)
> > >
> > > Can avoid the likely cache miss here by using ub->dev_info.queue_depth
> > > instead, analogous to ublk_ch_uring_cmd_local()
> >
> > OK.
> >
> > >
> > > > +               return ret;
> > > > +
> > > > +       io = &ubq->ios[tag];
> > > > +
> > > > +       switch (cmd->cmd_op) {
> > > > +       case UBLK_U_IO_REGISTER_IO_BUF:
> > > > +               return ublk_register_io_buf(cmd, ub, q_id, tag, io, index,
> > > > +                               issue_flags);
> > > > +       case UBLK_U_IO_UNREGISTER_IO_BUF:
> > > > +               return ublk_unregister_io_buf(cmd, ub, index, issue_flags);
> > > > +       default:
> > > > +               return -EOPNOTSUPP;
> > > > +       }
> > > > +}
> > > > +
> > > >  static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > > >                                        unsigned int issue_flags)
> > > >  {
> > > > @@ -3497,7 +3533,8 @@ static int ublk_ch_batch_io_uring_cmd(struct io_uring_cmd *cmd,
> > > >                 ret = ublk_handle_batch_fetch_cmd(&data);
> > > >                 break;
> > > >         default:
> > > > -               ret = -EOPNOTSUPP;
> > > > +               ret = ublk_handle_non_batch_cmd(cmd, issue_flags);
> > >
> > > We should probably skip the if (data.header.q_id >=
> > > ub->dev_info.nr_hw_queues) check for a non-batch command?
> >
> > It is true only for UBLK_IO_UNREGISTER_IO_BUF.
> 
> My point was that this relies on the q_id field being located at the
> same offset in struct ublksrv_io_cmd and struct ublk_batch_io, which
> seems quite subtle. I think it would make more sense not to read the
> SQE as a struct ublk_batch_io for the non-batch commands.

OK, got it, then the check can be moved to ublk_check_batch_cmd() and
ublk_validate_batch_fetch_cmd(). It can be one delta fix for V5.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2025-12-03  2:21 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21  1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-11-21  1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
2025-11-29 19:12   ` Caleb Sander Mateos
2025-12-01  1:46     ` Ming Lei
2025-12-01  5:58       ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
2025-11-21  1:58 ` [PATCH V4 03/27] ublk: add `union ublk_io_buf` with improved naming Ming Lei
2025-11-21  1:58 ` [PATCH V4 04/27] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
2025-11-21  1:58 ` [PATCH V4 05/27] ublk: pass const pointer to ublk_queue_is_zoned() Ming Lei
2025-11-21  1:58 ` [PATCH V4 06/27] ublk: add helper of __ublk_fetch() Ming Lei
2025-11-21  1:58 ` [PATCH V4 07/27] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
2025-11-21  1:58 ` [PATCH V4 08/27] ublk: prepare for not tracking task context for command batch Ming Lei
2025-11-21  1:58 ` [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-29 19:19   ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-11-29 19:47   ` Caleb Sander Mateos
2025-11-30 19:25   ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-30 16:39   ` Caleb Sander Mateos
2025-12-01 10:25     ` Ming Lei
2025-12-01 16:43       ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 12/27] ublk: add io events fifo structure Ming Lei
2025-11-30 16:53   ` Caleb Sander Mateos
2025-12-01  3:04     ` Ming Lei
2025-11-21  1:58 ` [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure Ming Lei
2025-11-30 19:24   ` Caleb Sander Mateos
2025-11-30 21:37     ` Caleb Sander Mateos
2025-12-01  2:32     ` Ming Lei
2025-12-01 17:37       ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
2025-12-01  5:55   ` Caleb Sander Mateos
2025-12-01  9:41     ` Ming Lei
2025-12-01 17:51       ` Caleb Sander Mateos
2025-12-02  1:27         ` Ming Lei
2025-12-02  1:39           ` Caleb Sander Mateos
2025-12-02  8:14             ` Ming Lei
2025-12-02 15:20               ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
2025-12-01 18:52   ` Caleb Sander Mateos
2025-12-02  1:29     ` Ming Lei
2025-12-01 19:00   ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
2025-12-01 21:16   ` Caleb Sander Mateos
2025-12-02  1:44     ` Ming Lei
2025-12-02 16:05       ` Caleb Sander Mateos
2025-12-03  2:21         ` Ming Lei
2025-11-21  1:58 ` [PATCH V4 17/27] ublk: document " Ming Lei
2025-12-01 21:46   ` Caleb Sander Mateos
2025-12-02  1:55     ` Ming Lei
2025-12-02  2:03     ` Ming Lei
2025-11-21  1:58 ` [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch() Ming Lei
2025-12-01 21:55   ` Caleb Sander Mateos
2025-11-21  1:58 ` [PATCH V4 19/27] selftests: ublk: fix user_data truncation for tgt_data >= 256 Ming Lei
2025-11-21  1:58 ` [PATCH V4 20/27] selftests: ublk: replace assert() with ublk_assert() Ming Lei
2025-11-21  1:58 ` [PATCH V4 21/27] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
2025-11-21  1:58 ` [PATCH V4 22/27] selftests: ublk: add batch buffer management infrastructure Ming Lei
2025-11-21  1:58 ` [PATCH V4 23/27] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-11-21  1:58 ` [PATCH V4 24/27] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-21  1:58 ` [PATCH V4 25/27] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
2025-11-21  1:58 ` [PATCH V4 26/27] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
2025-11-21  1:58 ` [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues combination Ming Lei
2025-11-28 11:59 ` [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-11-28 16:19   ` Jens Axboe
2025-11-28 19:07     ` Caleb Sander Mateos
2025-11-29  1:24       ` Ming Lei
2025-11-28 16:22 ` (subset) " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).