public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/10] ublk: add shared memory zero-copy support
@ 2026-03-31 15:31 Ming Lei
  2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Hello,

Add shared memory based zero-copy (UBLK_F_SHMEM_ZC) support for ublk.

The ublk server and its client share a memory region (e.g. memfd or
hugetlbfs file) via MAP_SHARED mmap. The server registers this region
with the kernel via UBLK_U_CMD_REG_BUF, which pins the pages and
builds a PFN maple tree. When I/O arrives, the driver looks up bio
pages in the maple tree — if they match registered buffer pages, the
data is used directly without copying.

Please see details on document added in patch 3.

Patches 1-4 implement the kernel side:
 - buffer register/unregister control commands with PFN coalescing,
   including read-only buffer support (UBLK_SHMEM_BUF_READ_ONLY)
 - PFN-based matching in the I/O path, with enforcement that read-only
   buffers reject non-WRITE requests
 - UBLK_F_SHMEM_ZC feature flag
 - eliminate permanent pages[] array from struct ublk_buf; the maple
   tree already stores PFN ranges, so pages[] becomes temporary

Patches 5-10 add kublk (selftest server) support and tests:
 - hugetlbfs buffer sharing (both kublk and fio mmap the same file)
 - null target and loop target tests with fio verify
 - filesystem-level test (ext4 on ublk, fio verify on a file)
 - read-only buffer registration test (--rdonly_shmem_buf)

Changes since V1:
 - rename struct ublk_buf_reg to struct ublk_shmem_buf_reg, add __u32
   flags field for extensibility, narrow __u64 len to __u32 (max 4GB
   per UBLK_SHMEM_ZC_OFF_MASK), remove __u32 reserved (patch 1)
 - add UBLK_SHMEM_BUF_READ_ONLY flag: pin pages without FOLL_WRITE,
   enabling registration of write-sealed memfd buffers (patch 1)
 - use backward-compatible struct reading: memset zero + copy
   min(header->len, sizeof(struct)) (patch 1)
 - reorder struct ublk_buf_range fields for better packing (16 bytes
   vs 24 bytes), change buf_index to unsigned short, add unsigned short
   flags to store per-range read-only state (patch 1)
 - enforce read-only buffer semantics in ublk_try_buf_match(): reject
   non-WRITE requests on read-only buffers since READ I/O needs to
   write data into the buffer (patch 2)
 - narrow struct ublk_buf::nr_pages to unsigned int, narrow struct
   ublk_buf_range::base_offset to unsigned int (patch 1)
 - add new patch 4: eliminate permanent pages[] array from struct
   ublk_buf — recover struct page pointers via pfn_to_page() from the
   maple tree during unregistration, saving 2MB per 1GB buffer
 - add UBLK_F_SHMEM_ZC to feat_map in kublk (patch 5)
 - add new patch 10: read-only buffer registration selftest with
   --rdonly_shmem_buf option on null target + hugetlbfs

Ming Lei (10):
  ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
  ublk: add PFN-based buffer matching in I/O path
  ublk: enable UBLK_F_SHMEM_ZC feature flag
  ublk: eliminate permanent pages[] array from struct ublk_buf
  selftests/ublk: add shared memory zero-copy support in kublk
  selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
  selftests/ublk: add shared memory zero-copy test
  selftests/ublk: add hugetlbfs shmem_zc test for loop target
  selftests/ublk: add filesystem fio verify test for shmem_zc
  selftests/ublk: add read-only buffer registration test

 Documentation/block/ublk.rst                  | 117 +++++
 drivers/block/ublk_drv.c                      | 403 +++++++++++++++++-
 include/uapi/linux/ublk_cmd.h                 |  79 ++++
 tools/testing/selftests/ublk/Makefile         |   5 +
 tools/testing/selftests/ublk/file_backed.c    |  38 ++
 tools/testing/selftests/ublk/kublk.c          | 347 ++++++++++++++-
 tools/testing/selftests/ublk/kublk.h          |  15 +
 tools/testing/selftests/ublk/test_common.sh   |  15 +-
 .../testing/selftests/ublk/test_shmemzc_01.sh |  72 ++++
 .../testing/selftests/ublk/test_shmemzc_02.sh |  68 +++
 .../testing/selftests/ublk/test_shmemzc_03.sh |  69 +++
 .../testing/selftests/ublk/test_shmemzc_04.sh |  72 ++++
 12 files changed, 1292 insertions(+), 8 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_01.sh
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_02.sh
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_03.sh
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_04.sh

--
2.53.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-04-07 19:35   ` Caleb Sander Mateos
  2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add control commands for registering and unregistering shared memory
buffers for zero-copy I/O:

- UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
  ranges into a per-device maple tree for O(log n) lookup during I/O.
  Buffer pointers are tracked in a per-device xarray. Returns the
  assigned buffer index.

- UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.

Queue freeze/unfreeze is handled internally so userspace need not
quiesce the device during registration.

Also adds:
- UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
  (16-bit buffer index supporting up to 65536 buffers)
- Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
- __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
- __ublk_ctrl_unreg_buf() helper for cleanup reuse
- ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
  (returning false — feature not enabled yet)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 300 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/ublk_cmd.h |  72 ++++++++
 2 files changed, 372 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 71c7c56b38ca..ac6ccc174d44 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -46,6 +46,8 @@
 #include <linux/kref.h>
 #include <linux/kfifo.h>
 #include <linux/blk-integrity.h>
+#include <linux/maple_tree.h>
+#include <linux/xarray.h>
 #include <uapi/linux/fs.h>
 #include <uapi/linux/ublk_cmd.h>
 
@@ -58,6 +60,8 @@
 #define UBLK_CMD_UPDATE_SIZE	_IOC_NR(UBLK_U_CMD_UPDATE_SIZE)
 #define UBLK_CMD_QUIESCE_DEV	_IOC_NR(UBLK_U_CMD_QUIESCE_DEV)
 #define UBLK_CMD_TRY_STOP_DEV	_IOC_NR(UBLK_U_CMD_TRY_STOP_DEV)
+#define UBLK_CMD_REG_BUF	_IOC_NR(UBLK_U_CMD_REG_BUF)
+#define UBLK_CMD_UNREG_BUF	_IOC_NR(UBLK_U_CMD_UNREG_BUF)
 
 #define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
 #define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
@@ -289,6 +293,20 @@ struct ublk_queue {
 	struct ublk_io ios[] __counted_by(q_depth);
 };
 
+/* Per-registered shared memory buffer */
+struct ublk_buf {
+	struct page **pages;
+	unsigned int nr_pages;
+};
+
+/* Maple tree value: maps a PFN range to buffer location */
+struct ublk_buf_range {
+	unsigned long base_pfn;
+	unsigned short buf_index;
+	unsigned short flags;
+	unsigned int base_offset;	/* byte offset within buffer */
+};
+
 struct ublk_device {
 	struct gendisk		*ub_disk;
 
@@ -323,6 +341,10 @@ struct ublk_device {
 
 	bool			block_open; /* protected by open_mutex */
 
+	/* shared memory zero copy */
+	struct maple_tree	buf_tree;
+	struct xarray		bufs_xa;
+
 	struct ublk_queue       *queues[];
 };
 
@@ -334,6 +356,7 @@ struct ublk_params_header {
 
 static void ublk_io_release(void *priv);
 static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static void ublk_buf_cleanup(struct ublk_device *ub);
 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 		u16 q_id, u16 tag, struct ublk_io *io);
@@ -398,6 +421,16 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
 	return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY;
 }
 
+static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
+{
+	return false;
+}
+
+static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
+{
+	return false;
+}
+
 static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
 {
 	return ubq->flags & UBLK_F_AUTO_BUF_REG;
@@ -1460,6 +1493,7 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
 	iod->op_flags = ublk_op | ublk_req_build_flags(req);
 	iod->nr_sectors = blk_rq_sectors(req);
 	iod->start_sector = blk_rq_pos(req);
+
 	iod->addr = io->buf.addr;
 
 	return BLK_STS_OK;
@@ -1665,6 +1699,7 @@ static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
 {
 	unsigned mapped_bytes = ublk_map_io(ubq, req, io);
 
+
 	/* partially mapped, update io descriptor */
 	if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
 		/*
@@ -4206,6 +4241,7 @@ static void ublk_cdev_rel(struct device *dev)
 {
 	struct ublk_device *ub = container_of(dev, struct ublk_device, cdev_dev);
 
+	ublk_buf_cleanup(ub);
 	blk_mq_free_tag_set(&ub->tag_set);
 	ublk_deinit_queues(ub);
 	ublk_free_dev_number(ub);
@@ -4625,6 +4661,8 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 	mutex_init(&ub->mutex);
 	spin_lock_init(&ub->lock);
 	mutex_init(&ub->cancel_mutex);
+	mt_init(&ub->buf_tree);
+	xa_init_flags(&ub->bufs_xa, XA_FLAGS_ALLOC);
 	INIT_WORK(&ub->partition_scan_work, ublk_partition_scan_work);
 
 	ret = ublk_alloc_dev_number(ub, header->dev_id);
@@ -5168,6 +5206,260 @@ static int ublk_char_dev_permission(struct ublk_device *ub,
 	return err;
 }
 
+/*
+ * Drain inflight I/O and quiesce the queue. Freeze drains all inflight
+ * requests, quiesce_nowait marks the queue so no new requests dispatch,
+ * then unfreeze allows new submissions (which won't dispatch due to
+ * quiesce). This keeps freeze and ub->mutex non-nested.
+ */
+static void ublk_quiesce_and_release(struct gendisk *disk)
+{
+	unsigned int memflags;
+
+	memflags = blk_mq_freeze_queue(disk->queue);
+	blk_mq_quiesce_queue_nowait(disk->queue);
+	blk_mq_unfreeze_queue(disk->queue, memflags);
+}
+
+static void ublk_unquiesce_and_resume(struct gendisk *disk)
+{
+	blk_mq_unquiesce_queue(disk->queue);
+}
+
+/*
+ * Insert PFN ranges of a registered buffer into the maple tree,
+ * coalescing consecutive PFNs into single range entries.
+ * Returns 0 on success, negative error with partial insertions unwound.
+ */
+/* Erase coalesced PFN ranges from the maple tree for pages [0, nr_pages) */
+static void ublk_buf_erase_ranges(struct ublk_device *ub,
+				  struct ublk_buf *ubuf,
+				  unsigned long nr_pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; ) {
+		unsigned long pfn = page_to_pfn(ubuf->pages[i]);
+		unsigned long start = i;
+
+		while (i + 1 < nr_pages &&
+		       page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
+			i++;
+		i++;
+		kfree(mtree_erase(&ub->buf_tree, pfn));
+	}
+}
+
+static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
+			       struct ublk_buf *ubuf, int index,
+			       unsigned short flags)
+{
+	unsigned long nr_pages = ubuf->nr_pages;
+	unsigned long i;
+	int ret;
+
+	for (i = 0; i < nr_pages; ) {
+		unsigned long pfn = page_to_pfn(ubuf->pages[i]);
+		unsigned long start = i;
+		struct ublk_buf_range *range;
+
+		/* Find run of consecutive PFNs */
+		while (i + 1 < nr_pages &&
+		       page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
+			i++;
+		i++;	/* past the last page in this run */
+
+		range = kzalloc(sizeof(*range), GFP_KERNEL);
+		if (!range) {
+			ret = -ENOMEM;
+			goto unwind;
+		}
+		range->buf_index = index;
+		range->flags = flags;
+		range->base_pfn = pfn;
+		range->base_offset = start << PAGE_SHIFT;
+
+		ret = mtree_insert_range(&ub->buf_tree, pfn,
+					 pfn + (i - start) - 1,
+					 range, GFP_KERNEL);
+		if (ret) {
+			kfree(range);
+			goto unwind;
+		}
+	}
+	return 0;
+
+unwind:
+	ublk_buf_erase_ranges(ub, ubuf, i);
+	return ret;
+}
+
+/*
+ * Register a shared memory buffer for zero-copy I/O.
+ * Pins pages, builds PFN maple tree, freezes/unfreezes the queue
+ * internally. Returns buffer index (>= 0) on success.
+ */
+static int ublk_ctrl_reg_buf(struct ublk_device *ub,
+			     struct ublksrv_ctrl_cmd *header)
+{
+	void __user *argp = (void __user *)(unsigned long)header->addr;
+	struct ublk_shmem_buf_reg buf_reg;
+	unsigned long addr, size, nr_pages;
+	unsigned int gup_flags;
+	struct gendisk *disk;
+	struct ublk_buf *ubuf;
+	long pinned;
+	u32 index;
+	int ret;
+
+	if (!ublk_dev_support_shmem_zc(ub))
+		return -EOPNOTSUPP;
+
+	memset(&buf_reg, 0, sizeof(buf_reg));
+	if (copy_from_user(&buf_reg, argp,
+			   min_t(size_t, header->len, sizeof(buf_reg))))
+		return -EFAULT;
+
+	if (buf_reg.flags & ~UBLK_SHMEM_BUF_READ_ONLY)
+		return -EINVAL;
+
+	addr = buf_reg.addr;
+	size = buf_reg.len;
+	nr_pages = size >> PAGE_SHIFT;
+
+	if (!size || !PAGE_ALIGNED(size) || !PAGE_ALIGNED(addr))
+		return -EINVAL;
+
+	disk = ublk_get_disk(ub);
+	if (!disk)
+		return -ENODEV;
+
+	/* Pin pages before quiescing (may sleep) */
+	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
+	if (!ubuf) {
+		ret = -ENOMEM;
+		goto put_disk;
+	}
+
+	ubuf->pages = kvmalloc_array(nr_pages, sizeof(*ubuf->pages),
+				     GFP_KERNEL);
+	if (!ubuf->pages) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	gup_flags = FOLL_LONGTERM;
+	if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
+		gup_flags |= FOLL_WRITE;
+
+	pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, ubuf->pages);
+	if (pinned < 0) {
+		ret = pinned;
+		goto err_free_pages;
+	}
+	if (pinned != nr_pages) {
+		ret = -EFAULT;
+		goto err_unpin;
+	}
+	ubuf->nr_pages = nr_pages;
+
+	/*
+	 * Drain inflight I/O and quiesce the queue so no new requests
+	 * are dispatched while we modify the maple tree. Keep freeze
+	 * and mutex non-nested to avoid lock dependency.
+	 */
+	ublk_quiesce_and_release(disk);
+
+	mutex_lock(&ub->mutex);
+
+	ret = xa_alloc(&ub->bufs_xa, &index, ubuf, xa_limit_16b, GFP_KERNEL);
+	if (ret)
+		goto err_unlock;
+
+	ret = __ublk_ctrl_reg_buf(ub, ubuf, index, buf_reg.flags);
+	if (ret) {
+		xa_erase(&ub->bufs_xa, index);
+		goto err_unlock;
+	}
+
+	mutex_unlock(&ub->mutex);
+
+	ublk_unquiesce_and_resume(disk);
+	ublk_put_disk(disk);
+	return index;
+
+err_unlock:
+	mutex_unlock(&ub->mutex);
+	ublk_unquiesce_and_resume(disk);
+err_unpin:
+	unpin_user_pages(ubuf->pages, pinned);
+err_free_pages:
+	kvfree(ubuf->pages);
+err_free:
+	kfree(ubuf);
+put_disk:
+	ublk_put_disk(disk);
+	return ret;
+}
+
+static void __ublk_ctrl_unreg_buf(struct ublk_device *ub,
+				  struct ublk_buf *ubuf)
+{
+	ublk_buf_erase_ranges(ub, ubuf, ubuf->nr_pages);
+	unpin_user_pages(ubuf->pages, ubuf->nr_pages);
+	kvfree(ubuf->pages);
+	kfree(ubuf);
+}
+
+static int ublk_ctrl_unreg_buf(struct ublk_device *ub,
+			       struct ublksrv_ctrl_cmd *header)
+{
+	int index = (int)header->data[0];
+	struct gendisk *disk;
+	struct ublk_buf *ubuf;
+
+	if (!ublk_dev_support_shmem_zc(ub))
+		return -EOPNOTSUPP;
+
+	disk = ublk_get_disk(ub);
+	if (!disk)
+		return -ENODEV;
+
+	/* Drain inflight I/O before modifying the maple tree */
+	ublk_quiesce_and_release(disk);
+
+	mutex_lock(&ub->mutex);
+
+	ubuf = xa_erase(&ub->bufs_xa, index);
+	if (!ubuf) {
+		mutex_unlock(&ub->mutex);
+		ublk_unquiesce_and_resume(disk);
+		ublk_put_disk(disk);
+		return -ENOENT;
+	}
+
+	__ublk_ctrl_unreg_buf(ub, ubuf);
+
+	mutex_unlock(&ub->mutex);
+
+	ublk_unquiesce_and_resume(disk);
+	ublk_put_disk(disk);
+	return 0;
+}
+
+static void ublk_buf_cleanup(struct ublk_device *ub)
+{
+	struct ublk_buf *ubuf;
+	unsigned long index;
+
+	xa_for_each(&ub->bufs_xa, index, ubuf)
+		__ublk_ctrl_unreg_buf(ub, ubuf);
+	xa_destroy(&ub->bufs_xa);
+	mtree_destroy(&ub->buf_tree);
+}
+
+
+
 static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 		u32 cmd_op, struct ublksrv_ctrl_cmd *header)
 {
@@ -5225,6 +5517,8 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 	case UBLK_CMD_UPDATE_SIZE:
 	case UBLK_CMD_QUIESCE_DEV:
 	case UBLK_CMD_TRY_STOP_DEV:
+	case UBLK_CMD_REG_BUF:
+	case UBLK_CMD_UNREG_BUF:
 		mask = MAY_READ | MAY_WRITE;
 		break;
 	default:
@@ -5350,6 +5644,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 	case UBLK_CMD_TRY_STOP_DEV:
 		ret = ublk_ctrl_try_stop_dev(ub);
 		break;
+	case UBLK_CMD_REG_BUF:
+		ret = ublk_ctrl_reg_buf(ub, &header);
+		break;
+	case UBLK_CMD_UNREG_BUF:
+		ret = ublk_ctrl_unreg_buf(ub, &header);
+		break;
 	default:
 		ret = -EOPNOTSUPP;
 		break;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index a88876756805..52bb9b843d73 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -57,6 +57,44 @@
 	_IOWR('u', 0x16, struct ublksrv_ctrl_cmd)
 #define UBLK_U_CMD_TRY_STOP_DEV		\
 	_IOWR('u', 0x17, struct ublksrv_ctrl_cmd)
+/*
+ * Register a shared memory buffer for zero-copy I/O.
+ * Input:  ctrl_cmd.addr points to struct ublk_buf_reg (buffer VA + size)
+ *         ctrl_cmd.len  = sizeof(struct ublk_buf_reg)
+ * Result: >= 0 is the assigned buffer index, < 0 is error
+ *
+ * The kernel pins pages from the calling process's address space
+ * and inserts PFN ranges into a per-device maple tree. When a block
+ * request's pages match registered pages, the driver sets
+ * UBLK_IO_F_SHMEM_ZC and encodes the buffer index + offset in addr,
+ * allowing the server to access the data via its own mapping of the
+ * same shared memory — true zero copy.
+ *
+ * The memory can be backed by memfd, hugetlbfs, or any GUP-compatible
+ * shared mapping. Queue freeze is handled internally.
+ *
+ * The buffer VA and size are passed via a user buffer (not inline in
+ * ctrl_cmd) so that unprivileged devices can prepend the device path
+ * to ctrl_cmd.addr without corrupting the VA.
+ */
+#define UBLK_U_CMD_REG_BUF		\
+	_IOWR('u', 0x18, struct ublksrv_ctrl_cmd)
+/*
+ * Unregister a shared memory buffer.
+ * Input:  ctrl_cmd.data[0] = buffer index
+ */
+#define UBLK_U_CMD_UNREG_BUF		\
+	_IOWR('u', 0x19, struct ublksrv_ctrl_cmd)
+
+/* Parameter buffer for UBLK_U_CMD_REG_BUF, pointed to by ctrl_cmd.addr */
+struct ublk_shmem_buf_reg {
+	__u64	addr;	/* userspace virtual address of shared memory */
+	__u32	len;	/* buffer size in bytes (page-aligned, max 4GB) */
+	__u32	flags;
+};
+
+/* Pin pages without FOLL_WRITE; usable with write-sealed memfd */
+#define UBLK_SHMEM_BUF_READ_ONLY	(1U << 0)
 /*
  * 64bits are enough now, and it should be easy to extend in case of
  * running out of feature flags
@@ -370,6 +408,7 @@
 /* Disable automatic partition scanning when device is started */
 #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
 
+
 /* device state */
 #define UBLK_S_DEV_DEAD	0
 #define UBLK_S_DEV_LIVE	1
@@ -469,6 +508,12 @@ struct ublksrv_ctrl_dev_info {
 #define		UBLK_IO_F_NEED_REG_BUF		(1U << 17)
 /* Request has an integrity data buffer */
 #define		UBLK_IO_F_INTEGRITY		(1UL << 18)
+/*
+ * I/O buffer is in a registered shared memory buffer. When set, the addr
+ * field in ublksrv_io_desc encodes buffer index and byte offset instead
+ * of a userspace virtual address.
+ */
+#define		UBLK_IO_F_SHMEM_ZC		(1U << 19)
 
 /*
  * io cmd is described by this structure, and stored in share memory, indexed
@@ -743,4 +788,31 @@ struct ublk_params {
 	struct ublk_param_integrity	integrity;
 };
 
+/*
+ * Shared memory zero-copy addr encoding for UBLK_IO_F_SHMEM_ZC.
+ *
+ * When UBLK_IO_F_SHMEM_ZC is set, ublksrv_io_desc.addr is encoded as:
+ *   bits [0:31]  = byte offset within the buffer (up to 4GB)
+ *   bits [32:47] = buffer index (up to 65536)
+ *   bits [48:63] = reserved (must be zero)
+ */
+#define UBLK_SHMEM_ZC_OFF_MASK		0xffffffffULL
+#define UBLK_SHMEM_ZC_IDX_OFF		32
+#define UBLK_SHMEM_ZC_IDX_MASK		0xffffULL
+
+static inline __u64 ublk_shmem_zc_addr(__u16 index, __u32 offset)
+{
+	return ((__u64)index << UBLK_SHMEM_ZC_IDX_OFF) | offset;
+}
+
+static inline __u16 ublk_shmem_zc_index(__u64 addr)
+{
+	return (addr >> UBLK_SHMEM_ZC_IDX_OFF) & UBLK_SHMEM_ZC_IDX_MASK;
+}
+
+static inline __u32 ublk_shmem_zc_offset(__u64 addr)
+{
+	return (__u32)(addr & UBLK_SHMEM_ZC_OFF_MASK);
+}
+
 #endif
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
  2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-04-07 19:47   ` Caleb Sander Mateos
  2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
each page's PFN in the per-device maple tree, and verifies all pages
belong to the same registered buffer at contiguous offsets.

Add ublk_iod_is_shmem_zc() inline helper for checking whether a
request uses the shmem zero-copy path.

Integrate into the I/O path:
- ublk_setup_iod(): if pages match a registered buffer, set
  UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
- ublk_start_io(): skip ublk_map_io() for zero-copy requests
- __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests

The feature remains disabled (ublk_support_shmem_zc() returns false)
until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 77 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ac6ccc174d44..d53865437600 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -356,6 +356,8 @@ struct ublk_params_header {
 
 static void ublk_io_release(void *priv);
 static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static bool ublk_try_buf_match(struct ublk_device *ub, struct request *rq,
+				  u32 *buf_idx, u32 *buf_off);
 static void ublk_buf_cleanup(struct ublk_device *ub);
 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
@@ -426,6 +428,12 @@ static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
 	return false;
 }
 
+static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
+					unsigned int tag)
+{
+	return ublk_get_iod(ubq, tag)->op_flags & UBLK_IO_F_SHMEM_ZC;
+}
+
 static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
 {
 	return false;
@@ -1494,6 +1502,18 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
 	iod->nr_sectors = blk_rq_sectors(req);
 	iod->start_sector = blk_rq_pos(req);
 
+	/* Try shmem zero-copy match before setting addr */
+	if (ublk_support_shmem_zc(ubq) && ublk_rq_has_data(req)) {
+		u32 buf_idx, buf_off;
+
+		if (ublk_try_buf_match(ubq->dev, req,
+					  &buf_idx, &buf_off)) {
+			iod->op_flags |= UBLK_IO_F_SHMEM_ZC;
+			iod->addr = ublk_shmem_zc_addr(buf_idx, buf_off);
+			return BLK_STS_OK;
+		}
+	}
+
 	iod->addr = io->buf.addr;
 
 	return BLK_STS_OK;
@@ -1539,6 +1559,10 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 	    req_op(req) != REQ_OP_DRV_IN)
 		goto exit;
 
+	/* shmem zero copy: no data to unmap, pages already shared */
+	if (ublk_iod_is_shmem_zc(req->mq_hctx->driver_data, req->tag))
+		goto exit;
+
 	/* for READ request, writing data in iod->addr to rq buffers */
 	unmapped_bytes = ublk_unmap_io(need_map, req, io);
 
@@ -1697,8 +1721,13 @@ static void ublk_auto_buf_dispatch(const struct ublk_queue *ubq,
 static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
 			  struct ublk_io *io)
 {
-	unsigned mapped_bytes = ublk_map_io(ubq, req, io);
+	unsigned mapped_bytes;
 
+	/* shmem zero copy: skip data copy, pages already shared */
+	if (ublk_iod_is_shmem_zc(ubq, req->tag))
+		return true;
+
+	mapped_bytes = ublk_map_io(ubq, req, io);
 
 	/* partially mapped, update io descriptor */
 	if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
@@ -5458,7 +5487,53 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
 	mtree_destroy(&ub->buf_tree);
 }
 
+/* Check if request pages match a registered shared memory buffer */
+static bool ublk_try_buf_match(struct ublk_device *ub,
+				   struct request *rq,
+				   u32 *buf_idx, u32 *buf_off)
+{
+	struct req_iterator iter;
+	struct bio_vec bv;
+	int index = -1;
+	unsigned long expected_offset = 0;
+	bool first = true;
+
+	rq_for_each_bvec(bv, rq, iter) {
+		unsigned long pfn = page_to_pfn(bv.bv_page);
+		struct ublk_buf_range *range;
+		unsigned long off;
 
+		range = mtree_load(&ub->buf_tree, pfn);
+		if (!range)
+			return false;
+
+		off = range->base_offset +
+			(pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;
+
+		if (first) {
+			/* Read-only buffer can't serve READ (kernel writes) */
+			if ((range->flags & UBLK_SHMEM_BUF_READ_ONLY) &&
+			    req_op(rq) != REQ_OP_WRITE)
+				return false;
+			index = range->buf_index;
+			expected_offset = off;
+			*buf_off = off;
+			first = false;
+		} else {
+			if (range->buf_index != index)
+				return false;
+			if (off != expected_offset)
+				return false;
+		}
+		expected_offset += bv.bv_len;
+	}
+
+	if (first)
+		return false;
+
+	*buf_idx = index;
+	return true;
+}
 
 static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
 		u32 cmd_op, struct ublksrv_ctrl_cmd *header)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
  2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
  2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-04-07 19:47   ` Caleb Sander Mateos
  2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 Documentation/block/ublk.rst  | 117 ++++++++++++++++++++++++++++++++++
 drivers/block/ublk_drv.c      |   7 +-
 include/uapi/linux/ublk_cmd.h |   7 ++
 3 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 6ad28039663d..a818e09a4b66 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -485,6 +485,123 @@ Limitations
   in case that too many ublk devices are handled by this single io_ring_ctx
   and each one has very large queue depth
 
+Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
+------------------------------------------
+
+The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
+that works by sharing physical memory pages between the client application
+and the ublk server. Unlike the io_uring fixed buffer approach above,
+shared memory zero copy does not require io_uring buffer registration
+per I/O — instead, it relies on the kernel matching page frame numbers
+(PFNs) at I/O time. This allows the ublk server to access the shared
+buffer directly, which is unlikely for the io_uring fixed buffer
+approach.
+
+Motivation
+~~~~~~~~~~
+
+Shared memory zero copy takes a different approach: if the client
+application and the ublk server both map the same physical memory, there is
+nothing to copy. The kernel detects the shared pages automatically and
+tells the server where the data already lives.
+
+``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
+applications — when the client is willing to allocate I/O buffers from
+shared memory, the entire data path becomes zero-copy without any per-I/O
+overhead.
+
+Use Cases
+~~~~~~~~~
+
+This feature is useful when the client application can be configured to
+use a specific shared memory region for its I/O buffers:
+
+- **Custom storage clients** that allocate I/O buffers from shared memory
+  (memfd, hugetlbfs) and issue direct I/O to the ublk device
+- **Database engines** that use pre-allocated buffer pools with O_DIRECT
+
+How It Works
+~~~~~~~~~~~~
+
+1. The ublk server and client both ``mmap()`` the same file (memfd or
+   hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
+   same physical pages.
+
+2. The ublk server registers its mapping with the kernel::
+
+     struct ublk_buf_reg buf = { .addr = mmap_va, .len = size };
+     ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
+
+   The kernel pins the pages and builds a PFN lookup tree.
+
+3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
+   the kernel checks whether the I/O buffer pages match any registered
+   pages by comparing PFNs.
+
+4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
+   descriptor and encodes the buffer index and offset in ``addr``::
+
+     if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
+         /* Data is already in our shared mapping — zero copy */
+         index  = ublk_shmem_zc_index(iod->addr);
+         offset = ublk_shmem_zc_offset(iod->addr);
+         buf = shmem_table[index].mmap_base + offset;
+     }
+
+5. If pages do not match (e.g., the client used a non-shared buffer),
+   the I/O falls back to the normal copy path silently.
+
+The shared memory can be set up via two methods:
+
+- **Socket-based**: the client sends a memfd to the ublk server via
+  ``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
+- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
+  hugetlbfs file. No IPC needed — same file gives same physical pages.
+
+Advantages
+~~~~~~~~~~
+
+- **Simple**: no per-I/O buffer registration or unregistration commands.
+  Once the shared buffer is registered, all matching I/O is zero-copy
+  automatically.
+- **Direct buffer access**: the ublk server can read and write the shared
+  buffer directly via its own mmap, without going through io_uring fixed
+  buffer operations. This is more friendly for server implementations.
+- **Fast**: PFN matching is a single maple tree lookup per bvec. No
+  io_uring command round-trips for buffer management.
+- **Compatible**: non-matching I/O silently falls back to the copy path.
+  The device works normally for any client, with zero-copy as an
+  optimization when shared memory is available.
+
+Limitations
+~~~~~~~~~~~
+
+- **Requires client cooperation**: the client must allocate its I/O
+  buffers from the shared memory region. This requires a custom or
+  configured client — standard applications using their own buffers
+  will not benefit.
+- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
+  the page cache, which allocates its own pages. These kernel-allocated
+  pages will never match the registered shared buffer. Only ``O_DIRECT``
+  puts the client's buffer pages directly into the block I/O.
+
+Control Commands
+~~~~~~~~~~~~~~~~
+
+- ``UBLK_U_CMD_REG_BUF``
+
+  Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
+  ``struct ublk_buf_reg`` containing the buffer virtual address and size.
+  Returns the assigned buffer index (>= 0) on success. The kernel pins
+  pages and builds the PFN lookup tree. Queue freeze is handled
+  internally.
+
+- ``UBLK_U_CMD_UNREG_BUF``
+
+  Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
+  buffer index. Unpins pages and removes PFN entries from the lookup
+  tree.
+
 References
 ==========
 
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d53865437600..c2b9992503a4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -85,7 +85,8 @@
 		| (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ? UBLK_F_INTEGRITY : 0) \
 		| UBLK_F_SAFE_STOP_DEV \
 		| UBLK_F_BATCH_IO \
-		| UBLK_F_NO_AUTO_PART_SCAN)
+		| UBLK_F_NO_AUTO_PART_SCAN \
+		| UBLK_F_SHMEM_ZC)
 
 #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
 		| UBLK_F_USER_RECOVERY_REISSUE \
@@ -425,7 +426,7 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
 
 static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
 {
-	return false;
+	return ubq->flags & UBLK_F_SHMEM_ZC;
 }
 
 static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
@@ -436,7 +437,7 @@ static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
 
 static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
 {
-	return false;
+	return ub->dev_info.flags & UBLK_F_SHMEM_ZC;
 }
 
 static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 52bb9b843d73..ecd258847d3d 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -408,6 +408,13 @@ struct ublk_shmem_buf_reg {
 /* Disable automatic partition scanning when device is started */
 #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
 
+/*
+ * Enable shared memory zero copy. When enabled, the server can register
+ * shared memory buffers via UBLK_U_CMD_REG_BUF. If a block request's
+ * pages match a registered buffer, UBLK_IO_F_SHMEM_ZC is set and addr
+ * encodes the buffer index + offset instead of a userspace buffer address.
+ */
+#define UBLK_F_SHMEM_ZC	(1ULL << 19)
 
 /* device state */
 #define UBLK_S_DEV_DEAD	0
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (2 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-04-07 19:50   ` Caleb Sander Mateos
  2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

The pages[] array (kvmalloc'd, 8 bytes per page = 2MB for a 1GB buffer)
was stored permanently in struct ublk_buf but only needed during
pin_user_pages_fast() and maple tree construction. Since the maple tree
already stores PFN ranges via ublk_buf_range, struct page pointers can
be recovered via pfn_to_page() during unregistration.

Make pages[] a temporary allocation in ublk_ctrl_reg_buf(), freed
immediately after the maple tree is built. Rewrite __ublk_ctrl_unreg_buf()
to iterate the maple tree for matching buf_index entries, recovering
struct page pointers via pfn_to_page() and unpinning in batches of 32.
Simplify ublk_buf_erase_ranges() to iterate the maple tree by buf_index
instead of walking the now-removed pages[] array.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 87 +++++++++++++++++++++++++---------------
 1 file changed, 55 insertions(+), 32 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c2b9992503a4..2e475bdc54dd 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -296,7 +296,6 @@ struct ublk_queue {
 
 /* Per-registered shared memory buffer */
 struct ublk_buf {
-	struct page **pages;
 	unsigned int nr_pages;
 };
 
@@ -5261,27 +5260,25 @@ static void ublk_unquiesce_and_resume(struct gendisk *disk)
  * coalescing consecutive PFNs into single range entries.
  * Returns 0 on success, negative error with partial insertions unwound.
  */
-/* Erase coalesced PFN ranges from the maple tree for pages [0, nr_pages) */
-static void ublk_buf_erase_ranges(struct ublk_device *ub,
-				  struct ublk_buf *ubuf,
-				  unsigned long nr_pages)
+/* Erase coalesced PFN ranges from the maple tree matching buf_index */
+static void ublk_buf_erase_ranges(struct ublk_device *ub, int buf_index)
 {
-	unsigned long i;
-
-	for (i = 0; i < nr_pages; ) {
-		unsigned long pfn = page_to_pfn(ubuf->pages[i]);
-		unsigned long start = i;
+	MA_STATE(mas, &ub->buf_tree, 0, ULONG_MAX);
+	struct ublk_buf_range *range;
 
-		while (i + 1 < nr_pages &&
-		       page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
-			i++;
-		i++;
-		kfree(mtree_erase(&ub->buf_tree, pfn));
+	mas_lock(&mas);
+	mas_for_each(&mas, range, ULONG_MAX) {
+		if (range->buf_index == buf_index) {
+			mas_erase(&mas);
+			kfree(range);
+		}
 	}
+	mas_unlock(&mas);
 }
 
 static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
-			       struct ublk_buf *ubuf, int index,
+			       struct ublk_buf *ubuf,
+			       struct page **pages, int index,
 			       unsigned short flags)
 {
 	unsigned long nr_pages = ubuf->nr_pages;
@@ -5289,13 +5286,13 @@ static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
 	int ret;
 
 	for (i = 0; i < nr_pages; ) {
-		unsigned long pfn = page_to_pfn(ubuf->pages[i]);
+		unsigned long pfn = page_to_pfn(pages[i]);
 		unsigned long start = i;
 		struct ublk_buf_range *range;
 
 		/* Find run of consecutive PFNs */
 		while (i + 1 < nr_pages &&
-		       page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
+		       page_to_pfn(pages[i + 1]) == pfn + (i - start) + 1)
 			i++;
 		i++;	/* past the last page in this run */
 
@@ -5320,7 +5317,7 @@ static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
 	return 0;
 
 unwind:
-	ublk_buf_erase_ranges(ub, ubuf, i);
+	ublk_buf_erase_ranges(ub, index);
 	return ret;
 }
 
@@ -5335,6 +5332,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 	void __user *argp = (void __user *)(unsigned long)header->addr;
 	struct ublk_shmem_buf_reg buf_reg;
 	unsigned long addr, size, nr_pages;
+	struct page **pages = NULL;
 	unsigned int gup_flags;
 	struct gendisk *disk;
 	struct ublk_buf *ubuf;
@@ -5371,9 +5369,8 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 		goto put_disk;
 	}
 
-	ubuf->pages = kvmalloc_array(nr_pages, sizeof(*ubuf->pages),
-				     GFP_KERNEL);
-	if (!ubuf->pages) {
+	pages = kvmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages) {
 		ret = -ENOMEM;
 		goto err_free;
 	}
@@ -5382,7 +5379,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 	if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
 		gup_flags |= FOLL_WRITE;
 
-	pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, ubuf->pages);
+	pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, pages);
 	if (pinned < 0) {
 		ret = pinned;
 		goto err_free_pages;
@@ -5406,7 +5403,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 	if (ret)
 		goto err_unlock;
 
-	ret = __ublk_ctrl_reg_buf(ub, ubuf, index, buf_reg.flags);
+	ret = __ublk_ctrl_reg_buf(ub, ubuf, pages, index, buf_reg.flags);
 	if (ret) {
 		xa_erase(&ub->bufs_xa, index);
 		goto err_unlock;
@@ -5414,6 +5411,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 
 	mutex_unlock(&ub->mutex);
 
+	kvfree(pages);
 	ublk_unquiesce_and_resume(disk);
 	ublk_put_disk(disk);
 	return index;
@@ -5422,9 +5420,9 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 	mutex_unlock(&ub->mutex);
 	ublk_unquiesce_and_resume(disk);
 err_unpin:
-	unpin_user_pages(ubuf->pages, pinned);
+	unpin_user_pages(pages, pinned);
 err_free_pages:
-	kvfree(ubuf->pages);
+	kvfree(pages);
 err_free:
 	kfree(ubuf);
 put_disk:
@@ -5433,11 +5431,36 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
 }
 
 static void __ublk_ctrl_unreg_buf(struct ublk_device *ub,
-				  struct ublk_buf *ubuf)
+				  struct ublk_buf *ubuf, int buf_index)
 {
-	ublk_buf_erase_ranges(ub, ubuf, ubuf->nr_pages);
-	unpin_user_pages(ubuf->pages, ubuf->nr_pages);
-	kvfree(ubuf->pages);
+	MA_STATE(mas, &ub->buf_tree, 0, ULONG_MAX);
+	struct ublk_buf_range *range;
+	struct page *pages[32];
+
+	mas_lock(&mas);
+	mas_for_each(&mas, range, ULONG_MAX) {
+		unsigned long base, nr, off;
+
+		if (range->buf_index != buf_index)
+			continue;
+
+		base = range->base_pfn;
+		nr = mas.last - mas.index + 1;
+		mas_erase(&mas);
+
+		for (off = 0; off < nr; ) {
+			unsigned int batch = min_t(unsigned long,
+						   nr - off, 32);
+			unsigned int j;
+
+			for (j = 0; j < batch; j++)
+				pages[j] = pfn_to_page(base + off + j);
+			unpin_user_pages(pages, batch);
+			off += batch;
+		}
+		kfree(range);
+	}
+	mas_unlock(&mas);
 	kfree(ubuf);
 }
 
@@ -5468,7 +5491,7 @@ static int ublk_ctrl_unreg_buf(struct ublk_device *ub,
 		return -ENOENT;
 	}
 
-	__ublk_ctrl_unreg_buf(ub, ubuf);
+	__ublk_ctrl_unreg_buf(ub, ubuf, index);
 
 	mutex_unlock(&ub->mutex);
 
@@ -5483,7 +5506,7 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
 	unsigned long index;
 
 	xa_for_each(&ub->bufs_xa, index, ubuf)
-		__ublk_ctrl_unreg_buf(ub, ubuf);
+		__ublk_ctrl_unreg_buf(ub, ubuf, index);
 	xa_destroy(&ub->bufs_xa);
 	mtree_destroy(&ub->buf_tree);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (3 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add infrastructure for UBLK_F_SHMEM_ZC shared memory zero-copy:

- kublk.h: struct ublk_shmem_entry and table for tracking registered
  shared memory buffers
- kublk.c: per-device unix socket listener that accepts memfd
  registrations from clients via SCM_RIGHTS fd passing. The listener
  mmaps the memfd and registers the VA range with the kernel for PFN
  matching. Also adds --shmem_zc command line option.
- kublk.c: --htlb <path> option to open a pre-allocated hugetlbfs
  file, mmap it with MAP_SHARED|MAP_POPULATE, and register it with
  the kernel via ublk_ctrl_reg_buf(). Any process that mmaps the same
  hugetlbfs file shares the same physical pages, enabling zero-copy
  without socket-based fd passing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/kublk.c | 340 ++++++++++++++++++++++++++-
 tools/testing/selftests/ublk/kublk.h |  14 ++
 2 files changed, 352 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index e1c3b3c55e56..bd97e34f131b 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/fs.h>
+#include <sys/un.h>
 #include "kublk.h"
 
 #define MAX_NR_TGT_ARG 	64
@@ -1085,13 +1086,312 @@ static int ublk_send_dev_event(const struct dev_ctx *ctx, struct ublk_dev *dev,
 }
 
 
+/*
+ * Shared memory registration socket listener.
+ *
+ * The parent daemon context listens on a per-device unix socket at
+ * /run/ublk/ublkb<dev_id>.sock for shared memory registration requests
+ * from clients. Clients send a memfd via SCM_RIGHTS; the server
+ * registers it with the kernel, mmaps it, and returns the assigned index.
+ */
+#define UBLK_SHMEM_SOCK_DIR	"/run/ublk"
+
+/* defined in kublk.h, shared with file_backed.c (loop target) */
+struct ublk_shmem_entry shmem_table[UBLK_BUF_MAX];
+int shmem_count;
+
+static void ublk_shmem_sock_path(int dev_id, char *buf, size_t len)
+{
+	snprintf(buf, len, "%s/ublkb%d.sock", UBLK_SHMEM_SOCK_DIR, dev_id);
+}
+
+static int ublk_shmem_sock_create(int dev_id)
+{
+	struct sockaddr_un addr = { .sun_family = AF_UNIX };
+	char path[108];
+	int fd;
+
+	mkdir(UBLK_SHMEM_SOCK_DIR, 0755);
+	ublk_shmem_sock_path(dev_id, path, sizeof(path));
+	unlink(path);
+
+	fd = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	if (fd < 0)
+		return -1;
+
+	snprintf(addr.sun_path, sizeof(addr.sun_path), "%s", path);
+	if (bind(fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		close(fd);
+		return -1;
+	}
+
+	listen(fd, 4);
+	ublk_dbg(UBLK_DBG_DEV, "shmem socket created: %s\n", path);
+	return fd;
+}
+
+static void ublk_shmem_sock_destroy(int dev_id, int sock_fd)
+{
+	char path[108];
+
+	if (sock_fd >= 0)
+		close(sock_fd);
+	ublk_shmem_sock_path(dev_id, path, sizeof(path));
+	unlink(path);
+}
+
+/* Receive a memfd from a client via SCM_RIGHTS */
+static int ublk_shmem_recv_fd(int client_fd)
+{
+	char buf[1];
+	struct iovec iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	union {
+		char cmsg_buf[CMSG_SPACE(sizeof(int))];
+		struct cmsghdr align;
+	} u;
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = u.cmsg_buf,
+		.msg_controllen = sizeof(u.cmsg_buf),
+	};
+	struct cmsghdr *cmsg;
+
+	if (recvmsg(client_fd, &msg, 0) <= 0)
+		return -1;
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (!cmsg || cmsg->cmsg_level != SOL_SOCKET ||
+	    cmsg->cmsg_type != SCM_RIGHTS)
+		return -1;
+
+	return *(int *)CMSG_DATA(cmsg);
+}
+
+/* Register a shared memory buffer: store fd, mmap it, return index */
+static int ublk_shmem_register(int shmem_fd)
+{
+	off_t size;
+	void *base;
+	int idx;
+
+	if (shmem_count >= UBLK_BUF_MAX)
+		return -1;
+
+	size = lseek(shmem_fd, 0, SEEK_END);
+	if (size <= 0)
+		return -1;
+
+	base = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		    shmem_fd, 0);
+	if (base == MAP_FAILED)
+		return -1;
+
+	idx = shmem_count++;
+	shmem_table[idx].fd = shmem_fd;
+	shmem_table[idx].mmap_base = base;
+	shmem_table[idx].size = size;
+
+	ublk_dbg(UBLK_DBG_DEV, "shmem registered: index=%d fd=%d size=%zu\n",
+		 idx, shmem_fd, (size_t)size);
+	return idx;
+}
+
+static void ublk_shmem_unregister_all(void)
+{
+	int i;
+
+	for (i = 0; i < shmem_count; i++) {
+		if (shmem_table[i].mmap_base) {
+			munmap(shmem_table[i].mmap_base,
+			       shmem_table[i].size);
+			close(shmem_table[i].fd);
+			shmem_table[i].mmap_base = NULL;
+		}
+	}
+	shmem_count = 0;
+}
+
+static int ublk_ctrl_reg_buf(struct ublk_dev *dev, void *addr, size_t size)
+{
+	struct ublk_buf_reg buf_reg = {
+		.addr = (unsigned long)addr,
+		.len = size,
+	};
+	struct ublk_ctrl_cmd_data data = {
+		.cmd_op = UBLK_U_CMD_REG_BUF,
+		.flags = CTRL_CMD_HAS_BUF,
+		.addr = (unsigned long)&buf_reg,
+		.len = sizeof(buf_reg),
+	};
+
+	return __ublk_ctrl_cmd(dev, &data);
+}
+
+/*
+ * Handle one client connection: receive memfd, mmap it, register
+ * the VA range with kernel, send back the assigned index.
+ */
+static void ublk_shmem_handle_client(int sock_fd, struct ublk_dev *dev)
+{
+	int client_fd, memfd, idx, ret;
+	int32_t reply;
+	off_t size;
+	void *base;
+
+	client_fd = accept(sock_fd, NULL, NULL);
+	if (client_fd < 0)
+		return;
+
+	memfd = ublk_shmem_recv_fd(client_fd);
+	if (memfd < 0) {
+		reply = -1;
+		goto out;
+	}
+
+	/* mmap the memfd in server address space */
+	size = lseek(memfd, 0, SEEK_END);
+	if (size <= 0) {
+		reply = -1;
+		close(memfd);
+		goto out;
+	}
+	base = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		    MAP_SHARED | MAP_POPULATE, memfd, 0);
+	if (base == MAP_FAILED) {
+		reply = -1;
+		close(memfd);
+		goto out;
+	}
+
+	/* Register server's VA range with kernel for PFN matching */
+	ret = ublk_ctrl_reg_buf(dev, base, size);
+	if (ret < 0) {
+		ublk_dbg(UBLK_DBG_DEV,
+			 "shmem_zc: kernel reg failed %d\n", ret);
+		munmap(base, size);
+		close(memfd);
+		reply = ret;
+		goto out;
+	}
+
+	/* Store in table for I/O handling */
+	idx = ublk_shmem_register(memfd);
+	if (idx >= 0) {
+		shmem_table[idx].mmap_base = base;
+		shmem_table[idx].size = size;
+	}
+	reply = idx;
+out:
+	send(client_fd, &reply, sizeof(reply), 0);
+	close(client_fd);
+}
+
+struct shmem_listener_info {
+	int dev_id;
+	int stop_efd;		/* eventfd to signal listener to stop */
+	int sock_fd;		/* listener socket fd (output) */
+	struct ublk_dev *dev;
+};
+
+/*
+ * Socket listener thread: runs in the parent daemon context alongside
+ * the I/O threads. Accepts shared memory registration requests from
+ * clients via SCM_RIGHTS. Exits when stop_efd is signaled.
+ */
+static void *ublk_shmem_listener_fn(void *data)
+{
+	struct shmem_listener_info *info = data;
+	struct pollfd pfds[2];
+
+	info->sock_fd = ublk_shmem_sock_create(info->dev_id);
+	if (info->sock_fd < 0)
+		return NULL;
+
+	pfds[0].fd = info->sock_fd;
+	pfds[0].events = POLLIN;
+	pfds[1].fd = info->stop_efd;
+	pfds[1].events = POLLIN;
+
+	while (1) {
+		int ret = poll(pfds, 2, -1);
+
+		if (ret < 0)
+			break;
+
+		/* Stop signal from parent */
+		if (pfds[1].revents & POLLIN)
+			break;
+
+		/* Client connection */
+		if (pfds[0].revents & POLLIN)
+			ublk_shmem_handle_client(info->sock_fd, info->dev);
+	}
+
+	return NULL;
+}
+
+static int ublk_shmem_htlb_setup(const struct dev_ctx *ctx,
+				 struct ublk_dev *dev)
+{
+	int fd, idx, ret;
+	struct stat st;
+	void *base;
+
+	fd = open(ctx->htlb_path, O_RDWR);
+	if (fd < 0) {
+		ublk_err("htlb: can't open %s\n", ctx->htlb_path);
+		return -errno;
+	}
+
+	if (fstat(fd, &st) < 0 || st.st_size <= 0) {
+		ublk_err("htlb: invalid file size\n");
+		close(fd);
+		return -EINVAL;
+	}
+
+	base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
+		    MAP_SHARED | MAP_POPULATE, fd, 0);
+	if (base == MAP_FAILED) {
+		ublk_err("htlb: mmap failed\n");
+		close(fd);
+		return -ENOMEM;
+	}
+
+	ret = ublk_ctrl_reg_buf(dev, base, st.st_size);
+	if (ret < 0) {
+		ublk_err("htlb: reg_buf failed: %d\n", ret);
+		munmap(base, st.st_size);
+		close(fd);
+		return ret;
+	}
+
+	if (shmem_count >= UBLK_BUF_MAX) {
+		munmap(base, st.st_size);
+		close(fd);
+		return -ENOMEM;
+	}
+
+	idx = shmem_count++;
+	shmem_table[idx].fd = fd;
+	shmem_table[idx].mmap_base = base;
+	shmem_table[idx].size = st.st_size;
+
+	ublk_dbg(UBLK_DBG_DEV, "htlb registered: index=%d size=%zu\n",
+		 idx, (size_t)st.st_size);
+	return 0;
+}
+
 static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 {
 	const struct ublksrv_ctrl_dev_info *dinfo = &dev->dev_info;
+	struct shmem_listener_info linfo = {};
 	struct ublk_thread_info *tinfo;
 	unsigned long long extra_flags = 0;
 	cpu_set_t *affinity_buf;
 	unsigned char (*q_thread_map)[UBLK_MAX_QUEUES] = NULL;
+	uint64_t stop_val = 1;
+	pthread_t listener;
 	void *thread_ret;
 	sem_t ready;
 	int ret, i;
@@ -1180,15 +1480,44 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
 		goto fail_start;
 	}
 
+	if (ctx->htlb_path) {
+		ret = ublk_shmem_htlb_setup(ctx, dev);
+		if (ret < 0) {
+			ublk_err("htlb setup failed: %d\n", ret);
+			ublk_ctrl_stop_dev(dev);
+			goto fail_start;
+		}
+	}
+
 	ublk_ctrl_get_info(dev);
 	if (ctx->fg)
 		ublk_ctrl_dump(dev);
 	else
 		ublk_send_dev_event(ctx, dev, dev->dev_info.dev_id);
 fail_start:
-	/* wait until we are terminated */
-	for (i = 0; i < dev->nthreads; i++)
+	/*
+	 * Wait for I/O threads to exit. While waiting, a listener
+	 * thread accepts shared memory registration requests from
+	 * clients via a per-device unix socket (SCM_RIGHTS fd passing).
+	 */
+	linfo.dev_id = dinfo->dev_id;
+	linfo.dev = dev;
+	linfo.stop_efd = eventfd(0, 0);
+	if (linfo.stop_efd >= 0)
+		pthread_create(&listener, NULL,
+			       ublk_shmem_listener_fn, &linfo);
+
+	for (i = 0; i < (int)dev->nthreads; i++)
 		pthread_join(tinfo[i].thread, &thread_ret);
+
+	/* Signal listener thread to stop and wait for it */
+	if (linfo.stop_efd >= 0) {
+		write(linfo.stop_efd, &stop_val, sizeof(stop_val));
+		pthread_join(listener, NULL);
+		close(linfo.stop_efd);
+		ublk_shmem_sock_destroy(dinfo->dev_id, linfo.sock_fd);
+	}
+	ublk_shmem_unregister_all();
 	free(tinfo);
  fail:
 	for (i = 0; i < dinfo->nr_hw_queues; i++)
@@ -1618,6 +1947,7 @@ static int cmd_dev_get_features(void)
 		FEAT_NAME(UBLK_F_SAFE_STOP_DEV),
 		FEAT_NAME(UBLK_F_BATCH_IO),
 		FEAT_NAME(UBLK_F_NO_AUTO_PART_SCAN),
+		FEAT_NAME(UBLK_F_SHMEM_ZC),
 	};
 	struct ublk_dev *dev;
 	__u64 features = 0;
@@ -1790,6 +2120,8 @@ int main(int argc, char *argv[])
 		{ "safe",		0,	NULL,  0 },
 		{ "batch",              0,      NULL, 'b'},
 		{ "no_auto_part_scan",	0,	NULL,  0 },
+		{ "shmem_zc",		0,	NULL,  0  },
+		{ "htlb",		1,	NULL,  0  },
 		{ 0, 0, 0, 0 }
 	};
 	const struct ublk_tgt_ops *ops = NULL;
@@ -1905,6 +2237,10 @@ int main(int argc, char *argv[])
 				ctx.safe_stop = 1;
 			if (!strcmp(longopts[option_idx].name, "no_auto_part_scan"))
 				ctx.flags |= UBLK_F_NO_AUTO_PART_SCAN;
+			if (!strcmp(longopts[option_idx].name, "shmem_zc"))
+				ctx.flags |= UBLK_F_SHMEM_ZC;
+			if (!strcmp(longopts[option_idx].name, "htlb"))
+				ctx.htlb_path = strdup(optarg);
 			break;
 		case '?':
 			/*
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 02f0c55d006b..20d0a1eab41f 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -95,6 +95,8 @@ struct dev_ctx {
 	/* for 'update_size' command */
 	unsigned long long size;
 
+	char *htlb_path;
+
 	union {
 		struct stripe_ctx 	stripe;
 		struct fault_inject_ctx fault_inject;
@@ -599,6 +601,18 @@ static inline void ublk_queued_tgt_io(struct ublk_thread *t, struct ublk_queue *
 	}
 }
 
+/* shared memory zero-copy support */
+#define UBLK_BUF_MAX		256
+
+struct ublk_shmem_entry {
+	int fd;
+	void *mmap_base;
+	size_t size;
+};
+
+extern struct ublk_shmem_entry shmem_table[UBLK_BUF_MAX];
+extern int shmem_count;
+
 extern const struct ublk_tgt_ops null_tgt_ops;
 extern const struct ublk_tgt_ops loop_tgt_ops;
 extern const struct ublk_tgt_ops stripe_tgt_ops;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (4 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add loop_queue_shmem_zc_io() which handles I/O requests marked with
UBLK_IO_F_SHMEM_ZC. When the kernel sets this flag, the request data
lives in a registered shared memory buffer — decode index + offset
from iod->addr and use the server's mmap as the I/O buffer.

The dispatch check in loop_queue_tgt_rw_io() routes SHMEM_ZC requests
to this new function, bypassing the normal buffer registration path.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/file_backed.c | 38 ++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
index 228af2580ac6..d28da98f917a 100644
--- a/tools/testing/selftests/ublk/file_backed.c
+++ b/tools/testing/selftests/ublk/file_backed.c
@@ -27,6 +27,40 @@ static int loop_queue_flush_io(struct ublk_thread *t, struct ublk_queue *q,
 	return 1;
 }
 
+/*
+ * Shared memory zero-copy I/O: when UBLK_IO_F_SHMEM_ZC is set, the
+ * request's data lives in a registered shared memory buffer. Decode
+ * index + offset from iod->addr and use the server's mmap of that
+ * buffer as the I/O buffer for the backing file.
+ */
+static int loop_queue_shmem_zc_io(struct ublk_thread *t, struct ublk_queue *q,
+				  const struct ublksrv_io_desc *iod, int tag)
+{
+	unsigned ublk_op = ublksrv_get_op(iod);
+	enum io_uring_op op = ublk_to_uring_op(iod, 0);
+	__u64 file_offset = iod->start_sector << 9;
+	__u32 len = iod->nr_sectors << 9;
+	__u32 shmem_idx = ublk_shmem_zc_index(iod->addr);
+	__u32 shmem_off = ublk_shmem_zc_offset(iod->addr);
+	struct io_uring_sqe *sqe[1];
+	void *addr;
+
+	if (shmem_idx >= UBLK_BUF_MAX || !shmem_table[shmem_idx].mmap_base)
+		return -EINVAL;
+
+	addr = shmem_table[shmem_idx].mmap_base + shmem_off;
+
+	ublk_io_alloc_sqes(t, sqe, 1);
+	if (!sqe[0])
+		return -ENOMEM;
+
+	io_uring_prep_rw(op, sqe[0], ublk_get_registered_fd(q, 1),
+			 addr, len, file_offset);
+	io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
+	sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
+	return 1;
+}
+
 static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 				const struct ublksrv_io_desc *iod, int tag)
 {
@@ -41,6 +75,10 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
 	void *addr = io->buf_addr;
 	unsigned short buf_index = ublk_io_buf_idx(t, q, tag);
 
+	/* shared memory zero-copy path */
+	if (iod->op_flags & UBLK_IO_F_SHMEM_ZC)
+		return loop_queue_shmem_zc_io(t, q, iod, tag);
+
 	if (iod->op_flags & UBLK_IO_F_INTEGRITY) {
 		ublk_io_alloc_sqes(t, sqe, 1);
 		/* Use second backing file for integrity data */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (5 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add test_shmem_zc_01.sh which tests UBLK_IO_F_SHMEM_ZC on the null
target using a hugetlbfs shared buffer. Both kublk (--htlb) and fio
(--mem=mmaphuge:<path>) mmap the same hugetlbfs file with MAP_SHARED,
sharing physical pages. The kernel PFN match enables zero-copy I/O.

Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  2 +
 .../testing/selftests/ublk/test_shmemzc_01.sh | 72 +++++++++++++++++++
 2 files changed, 74 insertions(+)
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_01.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 8ac2d4a682a1..001b7dccf3c6 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -51,6 +51,8 @@ TEST_PROGS += test_stripe_06.sh
 TEST_PROGS += test_part_01.sh
 TEST_PROGS += test_part_02.sh
 
+TEST_PROGS += test_shmemzc_01.sh
+
 TEST_PROGS += test_stress_01.sh
 TEST_PROGS += test_stress_02.sh
 TEST_PROGS += test_stress_03.sh
diff --git a/tools/testing/selftests/ublk/test_shmemzc_01.sh b/tools/testing/selftests/ublk/test_shmemzc_01.sh
new file mode 100755
index 000000000000..47210af2aa20
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_shmemzc_01.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test: shmem_zc with hugetlbfs buffer on null target
+#
+# kublk and fio both mmap the same hugetlbfs file (MAP_SHARED),
+# so they share physical pages.  The kernel PFN match enables
+# zero-copy I/O without socket-based fd passing.
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+ERR_CODE=0
+
+_prep_test "shmem_zc" "null target hugetlbfs shmem zero-copy test"
+
+if ! _have_program fio; then
+	echo "SKIP: fio not available"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! grep -q hugetlbfs /proc/filesystems; then
+	echo "SKIP: hugetlbfs not supported"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Allocate hugepages
+OLD_NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+echo 10 > /proc/sys/vm/nr_hugepages
+NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+if [ "$NR_HP" -lt 2 ]; then
+	echo "SKIP: cannot allocate hugepages"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Mount hugetlbfs
+HTLB_MNT=$(mktemp -d "${UBLK_TEST_DIR}/htlb_mnt_XXXXXX")
+if ! mount -t hugetlbfs none "$HTLB_MNT"; then
+	echo "SKIP: cannot mount hugetlbfs"
+	rmdir "$HTLB_MNT"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+HTLB_FILE="$HTLB_MNT/ublk_buf"
+fallocate -l 4M "$HTLB_FILE"
+
+dev_id=$(_add_ublk_dev -t null --shmem_zc --htlb "$HTLB_FILE")
+_check_add_dev $TID $?
+
+fio --name=htlb_zc \
+	--filename=/dev/ublkb"${dev_id}" \
+	--ioengine=io_uring \
+	--rw=randwrite \
+	--direct=1 \
+	--bs=4k \
+	--size=4M \
+	--iodepth=32 \
+	--mem=mmaphuge:"$HTLB_FILE" \
+	> /dev/null 2>&1
+ERR_CODE=$?
+
+# Delete device first so daemon releases the htlb mmap
+_ublk_del_dev "${dev_id}"
+
+rm -f "$HTLB_FILE"
+umount "$HTLB_MNT"
+rmdir "$HTLB_MNT"
+echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+
+_cleanup_test "shmem_zc"
+
+_show_result $TID $ERR_CODE
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (6 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
@ 2026-03-31 15:31 ` Ming Lei
  2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add test_shmem_zc_02.sh which tests the UBLK_IO_F_SHMEM_ZC zero-copy
path on the loop target using a hugetlbfs shared buffer. Both kublk and
fio mmap the same hugetlbfs file with MAP_SHARED, sharing physical
pages. The kernel's PFN matching enables zero-copy — the loop target
reads/writes directly from the shared buffer to the backing file.

Uses standard fio --mem=mmaphuge:<path> (supported since fio 1.10),
no patched fio required.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  1 +
 .../testing/selftests/ublk/test_shmemzc_02.sh | 68 +++++++++++++++++++
 2 files changed, 69 insertions(+)
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_02.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 001b7dccf3c6..271fe11d8d0f 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -52,6 +52,7 @@ TEST_PROGS += test_part_01.sh
 TEST_PROGS += test_part_02.sh
 
 TEST_PROGS += test_shmemzc_01.sh
+TEST_PROGS += test_shmemzc_02.sh
 
 TEST_PROGS += test_stress_01.sh
 TEST_PROGS += test_stress_02.sh
diff --git a/tools/testing/selftests/ublk/test_shmemzc_02.sh b/tools/testing/selftests/ublk/test_shmemzc_02.sh
new file mode 100755
index 000000000000..aed9262494e9
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_shmemzc_02.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test: shmem_zc with hugetlbfs buffer on loop target
+#
+# kublk and fio both mmap the same hugetlbfs file (MAP_SHARED),
+# so they share physical pages.  The kernel PFN match enables
+# zero-copy I/O without socket-based fd passing.
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+ERR_CODE=0
+
+_prep_test "shmem_zc" "loop target hugetlbfs shmem zero-copy test"
+
+if ! _have_program fio; then
+	echo "SKIP: fio not available"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! grep -q hugetlbfs /proc/filesystems; then
+	echo "SKIP: hugetlbfs not supported"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Allocate hugepages
+OLD_NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+echo 10 > /proc/sys/vm/nr_hugepages
+NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+if [ "$NR_HP" -lt 2 ]; then
+	echo "SKIP: cannot allocate hugepages"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Mount hugetlbfs
+HTLB_MNT=$(mktemp -d "${UBLK_TEST_DIR}/htlb_mnt_XXXXXX")
+if ! mount -t hugetlbfs none "$HTLB_MNT"; then
+	echo "SKIP: cannot mount hugetlbfs"
+	rmdir "$HTLB_MNT"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+HTLB_FILE="$HTLB_MNT/ublk_buf"
+fallocate -l 4M "$HTLB_FILE"
+
+_create_backfile 0 128M
+BACKFILE="${UBLK_BACKFILES[0]}"
+
+dev_id=$(_add_ublk_dev -t loop --shmem_zc --htlb "$HTLB_FILE" "$BACKFILE")
+_check_add_dev $TID $?
+
+_run_fio_verify_io --filename=/dev/ublkb"${dev_id}" \
+	--size=128M \
+	--mem=mmaphuge:"$HTLB_FILE"
+ERR_CODE=$?
+
+# Delete device first so daemon releases the htlb mmap
+_ublk_del_dev "${dev_id}"
+
+rm -f "$HTLB_FILE"
+umount "$HTLB_MNT"
+rmdir "$HTLB_MNT"
+echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+
+_cleanup_test "shmem_zc"
+
+_show_result $TID $ERR_CODE
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (7 preceding siblings ...)
  2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
@ 2026-03-31 15:32 ` Ming Lei
  2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:32 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add test_shmemzc_03.sh which exercises shmem_zc through the full
filesystem stack: mkfs ext4 on the ublk device, mount it, then run
fio verify on a file inside the filesystem with --mem=mmaphuge.

Extend _mkfs_mount_test() to accept an optional command that runs
between mount and umount. The function cd's into the mount directory
so the command can use relative file paths. Existing callers that
pass only the device are unaffected.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  1 +
 tools/testing/selftests/ublk/test_common.sh   | 15 ++--
 .../testing/selftests/ublk/test_shmemzc_03.sh | 69 +++++++++++++++++++
 3 files changed, 81 insertions(+), 4 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_03.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 271fe11d8d0f..3d1ad1730d93 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -53,6 +53,7 @@ TEST_PROGS += test_part_02.sh
 
 TEST_PROGS += test_shmemzc_01.sh
 TEST_PROGS += test_shmemzc_02.sh
+TEST_PROGS += test_shmemzc_03.sh
 
 TEST_PROGS += test_stress_01.sh
 TEST_PROGS += test_stress_02.sh
diff --git a/tools/testing/selftests/ublk/test_common.sh b/tools/testing/selftests/ublk/test_common.sh
index 163a40007910..af2ea4fa1111 100755
--- a/tools/testing/selftests/ublk/test_common.sh
+++ b/tools/testing/selftests/ublk/test_common.sh
@@ -88,6 +88,7 @@ _remove_tmp_dir() {
 _mkfs_mount_test()
 {
 	local dev=$1
+	shift
 	local err_code=0
 	local mnt_dir;
 
@@ -99,12 +100,17 @@ _mkfs_mount_test()
 	fi
 
 	mount -t ext4 "$dev" "$mnt_dir" > /dev/null 2>&1
+	if [ $# -gt 0 ]; then
+		cd "$mnt_dir" && "$@"
+		err_code=$?
+		cd - > /dev/null
+	fi
 	umount "$dev"
-	err_code=$?
-	_remove_tmp_dir "$mnt_dir"
-	if [ $err_code -ne 0 ]; then
-		return $err_code
+	if [ $err_code -eq 0 ]; then
+		err_code=$?
 	fi
+	_remove_tmp_dir "$mnt_dir"
+	return $err_code
 }
 
 _check_root() {
@@ -132,6 +138,7 @@ _prep_test() {
 	local base_dir=${TMPDIR:-./ublktest-dir}
 	mkdir -p "$base_dir"
 	UBLK_TEST_DIR=$(mktemp -d ${base_dir}/${TID}.XXXXXX)
+	UBLK_TEST_DIR=$(realpath ${UBLK_TEST_DIR})
 	UBLK_TMP=$(mktemp ${UBLK_TEST_DIR}/ublk_test_XXXXX)
 	[ "$UBLK_TEST_QUIET" -eq 0 ] && echo "ublk $type: $*"
 	echo "ublk selftest: $TID starting at $(date '+%F %T')" | tee /dev/kmsg
diff --git a/tools/testing/selftests/ublk/test_shmemzc_03.sh b/tools/testing/selftests/ublk/test_shmemzc_03.sh
new file mode 100755
index 000000000000..db967a9ffe81
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_shmemzc_03.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test: shmem_zc with fio verify over filesystem on loop target
+#
+# mkfs + mount ext4 on the ublk device, then run fio verify on a
+# file inside that filesystem.  Exercises the full stack:
+# filesystem -> block layer -> ublk shmem_zc -> loop target backing file.
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+ERR_CODE=0
+
+_prep_test "shmem_zc" "loop target hugetlbfs shmem zero-copy fs verify test"
+
+if ! _have_program fio; then
+	echo "SKIP: fio not available"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! grep -q hugetlbfs /proc/filesystems; then
+	echo "SKIP: hugetlbfs not supported"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Allocate hugepages
+OLD_NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+echo 10 > /proc/sys/vm/nr_hugepages
+NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+if [ "$NR_HP" -lt 2 ]; then
+	echo "SKIP: cannot allocate hugepages"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Mount hugetlbfs
+HTLB_MNT=$(mktemp -d "${UBLK_TEST_DIR}/htlb_mnt_XXXXXX")
+if ! mount -t hugetlbfs none "$HTLB_MNT"; then
+	echo "SKIP: cannot mount hugetlbfs"
+	rmdir "$HTLB_MNT"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+HTLB_FILE="$HTLB_MNT/ublk_buf"
+fallocate -l 4M "$HTLB_FILE"
+
+_create_backfile 0 256M
+BACKFILE="${UBLK_BACKFILES[0]}"
+
+dev_id=$(_add_ublk_dev -t loop --shmem_zc --htlb "$HTLB_FILE" "$BACKFILE")
+_check_add_dev $TID $?
+
+_mkfs_mount_test /dev/ublkb"${dev_id}" \
+	_run_fio_verify_io --filename=testfile \
+		--size=128M \
+		--mem=mmaphuge:"$HTLB_FILE"
+ERR_CODE=$?
+
+# Delete device first so daemon releases the htlb mmap
+_ublk_del_dev "${dev_id}"
+
+rm -f "$HTLB_FILE"
+umount "$HTLB_MNT"
+rmdir "$HTLB_MNT"
+echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+
+_cleanup_test "shmem_zc"
+
+_show_result $TID $ERR_CODE
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (8 preceding siblings ...)
  2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
@ 2026-03-31 15:32 ` Ming Lei
  2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
  2026-04-07 13:44 ` Jens Axboe
  11 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2026-03-31 15:32 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Ming Lei

Add --rdonly_shmem_buf option to kublk that registers shared memory
buffers with UBLK_SHMEM_BUF_READ_ONLY (read-only pinning without
FOLL_WRITE) and mmaps with PROT_READ only.

Add test_shmemzc_04.sh which exercises the new flag with a null target,
hugetlbfs buffer, and write workload. Write I/O works because the
server only reads from the shared buffer — the data flows from client
to kernel to the shared pages, and the server reads them out.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  1 +
 tools/testing/selftests/ublk/kublk.c          | 17 +++--
 tools/testing/selftests/ublk/kublk.h          |  1 +
 .../testing/selftests/ublk/test_shmemzc_04.sh | 72 +++++++++++++++++++
 4 files changed, 86 insertions(+), 5 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_shmemzc_04.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index 3d1ad1730d93..d07f90fdd5b8 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -54,6 +54,7 @@ TEST_PROGS += test_part_02.sh
 TEST_PROGS += test_shmemzc_01.sh
 TEST_PROGS += test_shmemzc_02.sh
 TEST_PROGS += test_shmemzc_03.sh
+TEST_PROGS += test_shmemzc_04.sh
 
 TEST_PROGS += test_stress_01.sh
 TEST_PROGS += test_stress_02.sh
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index bd97e34f131b..7ed2fd5d6a0e 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -1212,11 +1212,13 @@ static void ublk_shmem_unregister_all(void)
 	shmem_count = 0;
 }
 
-static int ublk_ctrl_reg_buf(struct ublk_dev *dev, void *addr, size_t size)
+static int ublk_ctrl_reg_buf(struct ublk_dev *dev, void *addr, size_t size,
+			     __u32 flags)
 {
-	struct ublk_buf_reg buf_reg = {
+	struct ublk_shmem_buf_reg buf_reg = {
 		.addr = (unsigned long)addr,
 		.len = size,
+		.flags = flags,
 	};
 	struct ublk_ctrl_cmd_data data = {
 		.cmd_op = UBLK_U_CMD_REG_BUF,
@@ -1265,7 +1267,7 @@ static void ublk_shmem_handle_client(int sock_fd, struct ublk_dev *dev)
 	}
 
 	/* Register server's VA range with kernel for PFN matching */
-	ret = ublk_ctrl_reg_buf(dev, base, size);
+	ret = ublk_ctrl_reg_buf(dev, base, size, 0);
 	if (ret < 0) {
 		ublk_dbg(UBLK_DBG_DEV,
 			 "shmem_zc: kernel reg failed %d\n", ret);
@@ -1350,7 +1352,8 @@ static int ublk_shmem_htlb_setup(const struct dev_ctx *ctx,
 		return -EINVAL;
 	}
 
-	base = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
+	base = mmap(NULL, st.st_size,
+		    ctx->rdonly_shmem_buf ? PROT_READ : PROT_READ | PROT_WRITE,
 		    MAP_SHARED | MAP_POPULATE, fd, 0);
 	if (base == MAP_FAILED) {
 		ublk_err("htlb: mmap failed\n");
@@ -1358,7 +1361,8 @@ static int ublk_shmem_htlb_setup(const struct dev_ctx *ctx,
 		return -ENOMEM;
 	}
 
-	ret = ublk_ctrl_reg_buf(dev, base, st.st_size);
+	ret = ublk_ctrl_reg_buf(dev, base, st.st_size,
+			       ctx->rdonly_shmem_buf ? UBLK_SHMEM_BUF_READ_ONLY : 0);
 	if (ret < 0) {
 		ublk_err("htlb: reg_buf failed: %d\n", ret);
 		munmap(base, st.st_size);
@@ -2122,6 +2126,7 @@ int main(int argc, char *argv[])
 		{ "no_auto_part_scan",	0,	NULL,  0 },
 		{ "shmem_zc",		0,	NULL,  0  },
 		{ "htlb",		1,	NULL,  0  },
+		{ "rdonly_shmem_buf",	0,	NULL,  0  },
 		{ 0, 0, 0, 0 }
 	};
 	const struct ublk_tgt_ops *ops = NULL;
@@ -2241,6 +2246,8 @@ int main(int argc, char *argv[])
 				ctx.flags |= UBLK_F_SHMEM_ZC;
 			if (!strcmp(longopts[option_idx].name, "htlb"))
 				ctx.htlb_path = strdup(optarg);
+			if (!strcmp(longopts[option_idx].name, "rdonly_shmem_buf"))
+				ctx.rdonly_shmem_buf = 1;
 			break;
 		case '?':
 			/*
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 20d0a1eab41f..467af9f487e9 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -80,6 +80,7 @@ struct dev_ctx {
 	unsigned int	no_ublk_fixed_fd:1;
 	unsigned int	safe_stop:1;
 	unsigned int	no_auto_part_scan:1;
+	unsigned int	rdonly_shmem_buf:1;
 	__u32 integrity_flags;
 	__u8 metadata_size;
 	__u8 pi_offset;
diff --git a/tools/testing/selftests/ublk/test_shmemzc_04.sh b/tools/testing/selftests/ublk/test_shmemzc_04.sh
new file mode 100755
index 000000000000..899de088ece4
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_shmemzc_04.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Test: shmem_zc with read-only buffer registration on null target
+#
+# Same as test_shmemzc_01 but with --rdonly_shmem_buf: pages are pinned
+# without FOLL_WRITE (UBLK_BUF_F_READ).  Write I/O works because
+# the server only reads from the shared buffer.
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+ERR_CODE=0
+
+_prep_test "shmem_zc" "null target hugetlbfs shmem zero-copy rdonly_buf test"
+
+if ! _have_program fio; then
+	echo "SKIP: fio not available"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+if ! grep -q hugetlbfs /proc/filesystems; then
+	echo "SKIP: hugetlbfs not supported"
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Allocate hugepages
+OLD_NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+echo 10 > /proc/sys/vm/nr_hugepages
+NR_HP=$(cat /proc/sys/vm/nr_hugepages)
+if [ "$NR_HP" -lt 2 ]; then
+	echo "SKIP: cannot allocate hugepages"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+# Mount hugetlbfs
+HTLB_MNT=$(mktemp -d "${UBLK_TEST_DIR}/htlb_mnt_XXXXXX")
+if ! mount -t hugetlbfs none "$HTLB_MNT"; then
+	echo "SKIP: cannot mount hugetlbfs"
+	rmdir "$HTLB_MNT"
+	echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+	exit "$UBLK_SKIP_CODE"
+fi
+
+HTLB_FILE="$HTLB_MNT/ublk_buf"
+fallocate -l 4M "$HTLB_FILE"
+
+dev_id=$(_add_ublk_dev -t null --shmem_zc --htlb "$HTLB_FILE" --rdonly_shmem_buf)
+_check_add_dev $TID $?
+
+fio --name=htlb_zc_rdonly \
+	--filename=/dev/ublkb"${dev_id}" \
+	--ioengine=io_uring \
+	--rw=randwrite \
+	--direct=1 \
+	--bs=4k \
+	--size=4M \
+	--iodepth=32 \
+	--mem=mmaphuge:"$HTLB_FILE" \
+	> /dev/null 2>&1
+ERR_CODE=$?
+
+# Delete device first so daemon releases the htlb mmap
+_ublk_del_dev "${dev_id}"
+
+rm -f "$HTLB_FILE"
+umount "$HTLB_MNT"
+rmdir "$HTLB_MNT"
+echo "$OLD_NR_HP" > /proc/sys/vm/nr_hugepages
+
+_cleanup_test "shmem_zc"
+
+_show_result $TID $ERR_CODE
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 00/10] ublk: add shared memory zero-copy support
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (9 preceding siblings ...)
  2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
@ 2026-04-07  2:38 ` Ming Lei
  2026-04-07 13:34   ` Jens Axboe
  2026-04-07 19:29   ` Caleb Sander Mateos
  2026-04-07 13:44 ` Jens Axboe
  11 siblings, 2 replies; 19+ messages in thread
From: Ming Lei @ 2026-04-07  2:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos

On Tue, Mar 31, 2026 at 11:31:51PM +0800, Ming Lei wrote:
> Hello,
> 
> Add shared memory based zero-copy (UBLK_F_SHMEM_ZC) support for ublk.
> 
> The ublk server and its client share a memory region (e.g. memfd or
> hugetlbfs file) via MAP_SHARED mmap. The server registers this region
> with the kernel via UBLK_U_CMD_REG_BUF, which pins the pages and
> builds a PFN maple tree. When I/O arrives, the driver looks up bio
> pages in the maple tree — if they match registered buffer pages, the
> data is used directly without copying.
> 
> Please see details on document added in patch 3.
> 
> Patches 1-4 implement the kernel side:
>  - buffer register/unregister control commands with PFN coalescing,
>    including read-only buffer support (UBLK_SHMEM_BUF_READ_ONLY)
>  - PFN-based matching in the I/O path, with enforcement that read-only
>    buffers reject non-WRITE requests
>  - UBLK_F_SHMEM_ZC feature flag
>  - eliminate permanent pages[] array from struct ublk_buf; the maple
>    tree already stores PFN ranges, so pages[] becomes temporary
> 
> Patches 5-10 add kublk (selftest server) support and tests:
>  - hugetlbfs buffer sharing (both kublk and fio mmap the same file)
>  - null target and loop target tests with fio verify
>  - filesystem-level test (ext4 on ublk, fio verify on a file)
>  - read-only buffer registration test (--rdonly_shmem_buf)
> 
> Changes since V1:
>  - rename struct ublk_buf_reg to struct ublk_shmem_buf_reg, add __u32
>    flags field for extensibility, narrow __u64 len to __u32 (max 4GB
>    per UBLK_SHMEM_ZC_OFF_MASK), remove __u32 reserved (patch 1)
>  - add UBLK_SHMEM_BUF_READ_ONLY flag: pin pages without FOLL_WRITE,
>    enabling registration of write-sealed memfd buffers (patch 1)
>  - use backward-compatible struct reading: memset zero + copy
>    min(header->len, sizeof(struct)) (patch 1)
>  - reorder struct ublk_buf_range fields for better packing (16 bytes
>    vs 24 bytes), change buf_index to unsigned short, add unsigned short
>    flags to store per-range read-only state (patch 1)
>  - enforce read-only buffer semantics in ublk_try_buf_match(): reject
>    non-WRITE requests on read-only buffers since READ I/O needs to
>    write data into the buffer (patch 2)
>  - narrow struct ublk_buf::nr_pages to unsigned int, narrow struct
>    ublk_buf_range::base_offset to unsigned int (patch 1)
>  - add new patch 4: eliminate permanent pages[] array from struct
>    ublk_buf — recover struct page pointers via pfn_to_page() from the
>    maple tree during unregistration, saving 2MB per 1GB buffer
>  - add UBLK_F_SHMEM_ZC to feat_map in kublk (patch 5)
>  - add new patch 10: read-only buffer registration selftest with
>    --rdonly_shmem_buf option on null target + hugetlbfs

Hello,

Ping...


thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 00/10] ublk: add shared memory zero-copy support
  2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
@ 2026-04-07 13:34   ` Jens Axboe
  2026-04-07 19:29   ` Caleb Sander Mateos
  1 sibling, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2026-04-07 13:34 UTC (permalink / raw)
  To: Ming Lei, linux-block; +Cc: Caleb Sander Mateos

On 4/6/26 8:38 PM, Ming Lei wrote:
> On Tue, Mar 31, 2026 at 11:31:51PM +0800, Ming Lei wrote:
>> Hello,
>>
>> Add shared memory based zero-copy (UBLK_F_SHMEM_ZC) support for ublk.
>>
>> The ublk server and its client share a memory region (e.g. memfd or
>> hugetlbfs file) via MAP_SHARED mmap. The server registers this region
>> with the kernel via UBLK_U_CMD_REG_BUF, which pins the pages and
>> builds a PFN maple tree. When I/O arrives, the driver looks up bio
>> pages in the maple tree ? if they match registered buffer pages, the
>> data is used directly without copying.
>>
>> Please see details on document added in patch 3.
>>
>> Patches 1-4 implement the kernel side:
>>  - buffer register/unregister control commands with PFN coalescing,
>>    including read-only buffer support (UBLK_SHMEM_BUF_READ_ONLY)
>>  - PFN-based matching in the I/O path, with enforcement that read-only
>>    buffers reject non-WRITE requests
>>  - UBLK_F_SHMEM_ZC feature flag
>>  - eliminate permanent pages[] array from struct ublk_buf; the maple
>>    tree already stores PFN ranges, so pages[] becomes temporary
>>
>> Patches 5-10 add kublk (selftest server) support and tests:
>>  - hugetlbfs buffer sharing (both kublk and fio mmap the same file)
>>  - null target and loop target tests with fio verify
>>  - filesystem-level test (ext4 on ublk, fio verify on a file)
>>  - read-only buffer registration test (--rdonly_shmem_buf)
>>
>> Changes since V1:
>>  - rename struct ublk_buf_reg to struct ublk_shmem_buf_reg, add __u32
>>    flags field for extensibility, narrow __u64 len to __u32 (max 4GB
>>    per UBLK_SHMEM_ZC_OFF_MASK), remove __u32 reserved (patch 1)
>>  - add UBLK_SHMEM_BUF_READ_ONLY flag: pin pages without FOLL_WRITE,
>>    enabling registration of write-sealed memfd buffers (patch 1)
>>  - use backward-compatible struct reading: memset zero + copy
>>    min(header->len, sizeof(struct)) (patch 1)
>>  - reorder struct ublk_buf_range fields for better packing (16 bytes
>>    vs 24 bytes), change buf_index to unsigned short, add unsigned short
>>    flags to store per-range read-only state (patch 1)
>>  - enforce read-only buffer semantics in ublk_try_buf_match(): reject
>>    non-WRITE requests on read-only buffers since READ I/O needs to
>>    write data into the buffer (patch 2)
>>  - narrow struct ublk_buf::nr_pages to unsigned int, narrow struct
>>    ublk_buf_range::base_offset to unsigned int (patch 1)
>>  - add new patch 4: eliminate permanent pages[] array from struct
>>    ublk_buf ? recover struct page pointers via pfn_to_page() from the
>>    maple tree during unregistration, saving 2MB per 1GB buffer
>>  - add UBLK_F_SHMEM_ZC to feat_map in kublk (patch 5)
>>  - add new patch 10: read-only buffer registration selftest with
>>    --rdonly_shmem_buf option on null target + hugetlbfs
> 
> Hello,
> 
> Ping...

It generally looks good to me. You have a few mixups of struct
ublk_buf_reg which is the old name, and should be struct
ublk_shmem_buf_reg, and similar a function name changed and the doc not
updated. I'll sort these out while applying.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 00/10] ublk: add shared memory zero-copy support
  2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
                   ` (10 preceding siblings ...)
  2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
@ 2026-04-07 13:44 ` Jens Axboe
  11 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2026-04-07 13:44 UTC (permalink / raw)
  To: linux-block, Ming Lei; +Cc: Caleb Sander Mateos


On Tue, 31 Mar 2026 23:31:51 +0800, Ming Lei wrote:
> Add shared memory based zero-copy (UBLK_F_SHMEM_ZC) support for ublk.
> 
> The ublk server and its client share a memory region (e.g. memfd or
> hugetlbfs file) via MAP_SHARED mmap. The server registers this region
> with the kernel via UBLK_U_CMD_REG_BUF, which pins the pages and
> builds a PFN maple tree. When I/O arrives, the driver looks up bio
> pages in the maple tree — if they match registered buffer pages, the
> data is used directly without copying.
> 
> [...]

Applied, thanks!

[01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
        commit: 2fb0ded237bb55dae45bc076666b348fc948ac9e
[02/10] ublk: add PFN-based buffer matching in I/O path
        commit: 4d4a512a1f87b156f694d25c800e3d525aa56e8a
[03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
        commit: 08677040a91199175149d1fd465c02e3b3fc768a
[04/10] ublk: eliminate permanent pages[] array from struct ublk_buf
        commit: 8a34e88769f617dc980edb5a0079e347bd1b9a89
[05/10] selftests/ublk: add shared memory zero-copy support in kublk
        commit: 166b476b8dee61dc6501f6eb91619d28c3430f75
[06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
        commit: ec20aa44ac2629943c9b2b5524bcb55d778f746c
[07/10] selftests/ublk: add shared memory zero-copy test
        commit: 2f1e9468bdcba7e7572e16defd3c516f24281f14
[08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target
        commit: d4866503324c062f70dddfdd2e59957d335fc230
[09/10] selftests/ublk: add filesystem fio verify test for shmem_zc
        commit: 12075992c62ee330b2c531fa066b19be21698115
[10/10] selftests/ublk: add read-only buffer registration test
        commit: affb5f67d73c1e0bd412e7807a55691502b5679e

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 00/10] ublk: add shared memory zero-copy support
  2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
  2026-04-07 13:34   ` Jens Axboe
@ 2026-04-07 19:29   ` Caleb Sander Mateos
  1 sibling, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2026-04-07 19:29 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

On Mon, Apr 6, 2026 at 7:39 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Mar 31, 2026 at 11:31:51PM +0800, Ming Lei wrote:
> > Hello,
> >
> > Add shared memory based zero-copy (UBLK_F_SHMEM_ZC) support for ublk.
> >
> > The ublk server and its client share a memory region (e.g. memfd or
> > hugetlbfs file) via MAP_SHARED mmap. The server registers this region
> > with the kernel via UBLK_U_CMD_REG_BUF, which pins the pages and
> > builds a PFN maple tree. When I/O arrives, the driver looks up bio
> > pages in the maple tree — if they match registered buffer pages, the
> > data is used directly without copying.
> >
> > Please see details on document added in patch 3.
> >
> > Patches 1-4 implement the kernel side:
> >  - buffer register/unregister control commands with PFN coalescing,
> >    including read-only buffer support (UBLK_SHMEM_BUF_READ_ONLY)
> >  - PFN-based matching in the I/O path, with enforcement that read-only
> >    buffers reject non-WRITE requests
> >  - UBLK_F_SHMEM_ZC feature flag
> >  - eliminate permanent pages[] array from struct ublk_buf; the maple
> >    tree already stores PFN ranges, so pages[] becomes temporary
> >
> > Patches 5-10 add kublk (selftest server) support and tests:
> >  - hugetlbfs buffer sharing (both kublk and fio mmap the same file)
> >  - null target and loop target tests with fio verify
> >  - filesystem-level test (ext4 on ublk, fio verify on a file)
> >  - read-only buffer registration test (--rdonly_shmem_buf)
> >
> > Changes since V1:
> >  - rename struct ublk_buf_reg to struct ublk_shmem_buf_reg, add __u32
> >    flags field for extensibility, narrow __u64 len to __u32 (max 4GB
> >    per UBLK_SHMEM_ZC_OFF_MASK), remove __u32 reserved (patch 1)
> >  - add UBLK_SHMEM_BUF_READ_ONLY flag: pin pages without FOLL_WRITE,
> >    enabling registration of write-sealed memfd buffers (patch 1)
> >  - use backward-compatible struct reading: memset zero + copy
> >    min(header->len, sizeof(struct)) (patch 1)
> >  - reorder struct ublk_buf_range fields for better packing (16 bytes
> >    vs 24 bytes), change buf_index to unsigned short, add unsigned short
> >    flags to store per-range read-only state (patch 1)
> >  - enforce read-only buffer semantics in ublk_try_buf_match(): reject
> >    non-WRITE requests on read-only buffers since READ I/O needs to
> >    write data into the buffer (patch 2)
> >  - narrow struct ublk_buf::nr_pages to unsigned int, narrow struct
> >    ublk_buf_range::base_offset to unsigned int (patch 1)
> >  - add new patch 4: eliminate permanent pages[] array from struct
> >    ublk_buf — recover struct page pointers via pfn_to_page() from the
> >    maple tree during unregistration, saving 2MB per 1GB buffer
> >  - add UBLK_F_SHMEM_ZC to feat_map in kublk (patch 5)
> >  - add new patch 10: read-only buffer registration selftest with
> >    --rdonly_shmem_buf option on null target + hugetlbfs
>
> Hello,
>
> Ping...

Sorry I didn't get a chance to look at this earlier. It looks really
nice, thank you for implementing it! I have just a few comments. One
is about the UAPI (can we just use virtual addresses instead of buffer
index + offset), so I might wait on landing this patchset until we've
finalized that.

Thanks,
Caleb

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands
  2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
@ 2026-04-07 19:35   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2026-04-07 19:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add control commands for registering and unregistering shared memory
> buffers for zero-copy I/O:
>
> - UBLK_U_CMD_REG_BUF (0x18): pins pages from userspace, inserts PFN
>   ranges into a per-device maple tree for O(log n) lookup during I/O.
>   Buffer pointers are tracked in a per-device xarray. Returns the
>   assigned buffer index.
>
> - UBLK_U_CMD_UNREG_BUF (0x19): removes PFN entries and unpins pages.
>
> Queue freeze/unfreeze is handled internally so userspace need not
> quiesce the device during registration.
>
> Also adds:
> - UBLK_IO_F_SHMEM_ZC flag and addr encoding helpers in UAPI header
>   (16-bit buffer index supporting up to 65536 buffers)
> - Data structures (ublk_buf, ublk_buf_range) and xarray/maple tree
> - __ublk_ctrl_reg_buf() helper for PFN insertion with error unwinding
> - __ublk_ctrl_unreg_buf() helper for cleanup reuse
> - ublk_support_shmem_zc() / ublk_dev_support_shmem_zc() stubs
>   (returning false — feature not enabled yet)
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 300 ++++++++++++++++++++++++++++++++++
>  include/uapi/linux/ublk_cmd.h |  72 ++++++++
>  2 files changed, 372 insertions(+)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 71c7c56b38ca..ac6ccc174d44 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -46,6 +46,8 @@
>  #include <linux/kref.h>
>  #include <linux/kfifo.h>
>  #include <linux/blk-integrity.h>
> +#include <linux/maple_tree.h>
> +#include <linux/xarray.h>
>  #include <uapi/linux/fs.h>
>  #include <uapi/linux/ublk_cmd.h>
>
> @@ -58,6 +60,8 @@
>  #define UBLK_CMD_UPDATE_SIZE   _IOC_NR(UBLK_U_CMD_UPDATE_SIZE)
>  #define UBLK_CMD_QUIESCE_DEV   _IOC_NR(UBLK_U_CMD_QUIESCE_DEV)
>  #define UBLK_CMD_TRY_STOP_DEV  _IOC_NR(UBLK_U_CMD_TRY_STOP_DEV)
> +#define UBLK_CMD_REG_BUF       _IOC_NR(UBLK_U_CMD_REG_BUF)
> +#define UBLK_CMD_UNREG_BUF     _IOC_NR(UBLK_U_CMD_UNREG_BUF)
>
>  #define UBLK_IO_REGISTER_IO_BUF                _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
>  #define UBLK_IO_UNREGISTER_IO_BUF      _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> @@ -289,6 +293,20 @@ struct ublk_queue {
>         struct ublk_io ios[] __counted_by(q_depth);
>  };
>
> +/* Per-registered shared memory buffer */
> +struct ublk_buf {
> +       struct page **pages;
> +       unsigned int nr_pages;
> +};
> +
> +/* Maple tree value: maps a PFN range to buffer location */
> +struct ublk_buf_range {
> +       unsigned long base_pfn;
> +       unsigned short buf_index;
> +       unsigned short flags;
> +       unsigned int base_offset;       /* byte offset within buffer */
> +};
> +
>  struct ublk_device {
>         struct gendisk          *ub_disk;
>
> @@ -323,6 +341,10 @@ struct ublk_device {
>
>         bool                    block_open; /* protected by open_mutex */
>
> +       /* shared memory zero copy */
> +       struct maple_tree       buf_tree;
> +       struct xarray           bufs_xa;
> +
>         struct ublk_queue       *queues[];
>  };
>
> @@ -334,6 +356,7 @@ struct ublk_params_header {
>
>  static void ublk_io_release(void *priv);
>  static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> +static void ublk_buf_cleanup(struct ublk_device *ub);
>  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
>                 u16 q_id, u16 tag, struct ublk_io *io);
> @@ -398,6 +421,16 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
>         return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY;
>  }
>
> +static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
> +{
> +       return false;
> +}
> +
> +static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
> +{
> +       return false;
> +}
> +
>  static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
>  {
>         return ubq->flags & UBLK_F_AUTO_BUF_REG;
> @@ -1460,6 +1493,7 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
>         iod->op_flags = ublk_op | ublk_req_build_flags(req);
>         iod->nr_sectors = blk_rq_sectors(req);
>         iod->start_sector = blk_rq_pos(req);
> +

nit: unrelated whitespace change?

>         iod->addr = io->buf.addr;
>
>         return BLK_STS_OK;
> @@ -1665,6 +1699,7 @@ static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
>  {
>         unsigned mapped_bytes = ublk_map_io(ubq, req, io);
>
> +

nit: unrelated whitespace change?

>         /* partially mapped, update io descriptor */
>         if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
>                 /*
> @@ -4206,6 +4241,7 @@ static void ublk_cdev_rel(struct device *dev)
>  {
>         struct ublk_device *ub = container_of(dev, struct ublk_device, cdev_dev);
>
> +       ublk_buf_cleanup(ub);
>         blk_mq_free_tag_set(&ub->tag_set);
>         ublk_deinit_queues(ub);
>         ublk_free_dev_number(ub);
> @@ -4625,6 +4661,8 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>         mutex_init(&ub->mutex);
>         spin_lock_init(&ub->lock);
>         mutex_init(&ub->cancel_mutex);
> +       mt_init(&ub->buf_tree);
> +       xa_init_flags(&ub->bufs_xa, XA_FLAGS_ALLOC);
>         INIT_WORK(&ub->partition_scan_work, ublk_partition_scan_work);
>
>         ret = ublk_alloc_dev_number(ub, header->dev_id);
> @@ -5168,6 +5206,260 @@ static int ublk_char_dev_permission(struct ublk_device *ub,
>         return err;
>  }
>
> +/*
> + * Drain inflight I/O and quiesce the queue. Freeze drains all inflight
> + * requests, quiesce_nowait marks the queue so no new requests dispatch,
> + * then unfreeze allows new submissions (which won't dispatch due to
> + * quiesce). This keeps freeze and ub->mutex non-nested.
> + */
> +static void ublk_quiesce_and_release(struct gendisk *disk)
> +{
> +       unsigned int memflags;
> +
> +       memflags = blk_mq_freeze_queue(disk->queue);
> +       blk_mq_quiesce_queue_nowait(disk->queue);
> +       blk_mq_unfreeze_queue(disk->queue, memflags);
> +}
> +
> +static void ublk_unquiesce_and_resume(struct gendisk *disk)
> +{
> +       blk_mq_unquiesce_queue(disk->queue);
> +}
> +
> +/*
> + * Insert PFN ranges of a registered buffer into the maple tree,
> + * coalescing consecutive PFNs into single range entries.
> + * Returns 0 on success, negative error with partial insertions unwound.
> + */
> +/* Erase coalesced PFN ranges from the maple tree for pages [0, nr_pages) */
> +static void ublk_buf_erase_ranges(struct ublk_device *ub,
> +                                 struct ublk_buf *ubuf,
> +                                 unsigned long nr_pages)
> +{
> +       unsigned long i;
> +
> +       for (i = 0; i < nr_pages; ) {
> +               unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> +               unsigned long start = i;
> +
> +               while (i + 1 < nr_pages &&
> +                      page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> +                       i++;
> +               i++;
> +               kfree(mtree_erase(&ub->buf_tree, pfn));
> +       }
> +}
> +
> +static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
> +                              struct ublk_buf *ubuf, int index,
> +                              unsigned short flags)
> +{
> +       unsigned long nr_pages = ubuf->nr_pages;
> +       unsigned long i;
> +       int ret;
> +
> +       for (i = 0; i < nr_pages; ) {
> +               unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> +               unsigned long start = i;
> +               struct ublk_buf_range *range;
> +
> +               /* Find run of consecutive PFNs */
> +               while (i + 1 < nr_pages &&
> +                      page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> +                       i++;
> +               i++;    /* past the last page in this run */

Move this increment to the for loop so you don't need the "- 1" in the
mtree_insert_range() call?

> +
> +               range = kzalloc(sizeof(*range), GFP_KERNEL);

Not sure kzalloc() is necessary; all the fields are initialized below

> +               if (!range) {
> +                       ret = -ENOMEM;
> +                       goto unwind;
> +               }
> +               range->buf_index = index;
> +               range->flags = flags;
> +               range->base_pfn = pfn;
> +               range->base_offset = start << PAGE_SHIFT;
> +
> +               ret = mtree_insert_range(&ub->buf_tree, pfn,
> +                                        pfn + (i - start) - 1,
> +                                        range, GFP_KERNEL);
> +               if (ret) {
> +                       kfree(range);
> +                       goto unwind;
> +               }
> +       }
> +       return 0;
> +
> +unwind:
> +       ublk_buf_erase_ranges(ub, ubuf, i);
> +       return ret;
> +}
> +
> +/*
> + * Register a shared memory buffer for zero-copy I/O.
> + * Pins pages, builds PFN maple tree, freezes/unfreezes the queue
> + * internally. Returns buffer index (>= 0) on success.
> + */
> +static int ublk_ctrl_reg_buf(struct ublk_device *ub,
> +                            struct ublksrv_ctrl_cmd *header)
> +{
> +       void __user *argp = (void __user *)(unsigned long)header->addr;
> +       struct ublk_shmem_buf_reg buf_reg;
> +       unsigned long addr, size, nr_pages;

size and nr_pages could be u32

> +       unsigned int gup_flags;
> +       struct gendisk *disk;
> +       struct ublk_buf *ubuf;
> +       long pinned;

pinned could be int to match the return type of pin_user_pages_fast()

> +       u32 index;
> +       int ret;
> +
> +       if (!ublk_dev_support_shmem_zc(ub))
> +               return -EOPNOTSUPP;
> +
> +       memset(&buf_reg, 0, sizeof(buf_reg));
> +       if (copy_from_user(&buf_reg, argp,
> +                          min_t(size_t, header->len, sizeof(buf_reg))))
> +               return -EFAULT;
> +
> +       if (buf_reg.flags & ~UBLK_SHMEM_BUF_READ_ONLY)
> +               return -EINVAL;
> +
> +       addr = buf_reg.addr;
> +       size = buf_reg.len;

nit: don't see much value in these additional variables that are just
copies of buf_reg fields

> +       nr_pages = size >> PAGE_SHIFT;
> +
> +       if (!size || !PAGE_ALIGNED(size) || !PAGE_ALIGNED(addr))
> +               return -EINVAL;
> +
> +       disk = ublk_get_disk(ub);
> +       if (!disk)
> +               return -ENODEV;

So buffers can't be registered before the ublk device is started? Is
there a reason why that's not possible? Could we just make the
ublk_quiesce_and_release() and ublk_unquiesce_and_resume() conditional
on disk being non-NULL? I guess we'd have to hold the ublk_device
mutex before calling ublk_get_disk() to prevent it from being assigned
concurrently.

> +
> +       /* Pin pages before quiescing (may sleep) */
> +       ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
> +       if (!ubuf) {
> +               ret = -ENOMEM;
> +               goto put_disk;
> +       }
> +
> +       ubuf->pages = kvmalloc_array(nr_pages, sizeof(*ubuf->pages),
> +                                    GFP_KERNEL);
> +       if (!ubuf->pages) {
> +               ret = -ENOMEM;
> +               goto err_free;
> +       }
> +
> +       gup_flags = FOLL_LONGTERM;
> +       if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
> +               gup_flags |= FOLL_WRITE;
> +
> +       pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, ubuf->pages);
> +       if (pinned < 0) {
> +               ret = pinned;
> +               goto err_free_pages;
> +       }
> +       if (pinned != nr_pages) {
> +               ret = -EFAULT;
> +               goto err_unpin;
> +       }
> +       ubuf->nr_pages = nr_pages;
> +
> +       /*
> +        * Drain inflight I/O and quiesce the queue so no new requests
> +        * are dispatched while we modify the maple tree. Keep freeze
> +        * and mutex non-nested to avoid lock dependency.
> +        */
> +       ublk_quiesce_and_release(disk);
> +
> +       mutex_lock(&ub->mutex);

Looks like the xarray and maple tree do their own spinlocking, is this needed?

> +
> +       ret = xa_alloc(&ub->bufs_xa, &index, ubuf, xa_limit_16b, GFP_KERNEL);
> +       if (ret)
> +               goto err_unlock;
> +
> +       ret = __ublk_ctrl_reg_buf(ub, ubuf, index, buf_reg.flags);
> +       if (ret) {
> +               xa_erase(&ub->bufs_xa, index);
> +               goto err_unlock;
> +       }
> +
> +       mutex_unlock(&ub->mutex);
> +
> +       ublk_unquiesce_and_resume(disk);
> +       ublk_put_disk(disk);
> +       return index;
> +
> +err_unlock:
> +       mutex_unlock(&ub->mutex);
> +       ublk_unquiesce_and_resume(disk);
> +err_unpin:
> +       unpin_user_pages(ubuf->pages, pinned);
> +err_free_pages:
> +       kvfree(ubuf->pages);
> +err_free:
> +       kfree(ubuf);
> +put_disk:
> +       ublk_put_disk(disk);
> +       return ret;
> +}
> +
> +static void __ublk_ctrl_unreg_buf(struct ublk_device *ub,
> +                                 struct ublk_buf *ubuf)
> +{
> +       ublk_buf_erase_ranges(ub, ubuf, ubuf->nr_pages);
> +       unpin_user_pages(ubuf->pages, ubuf->nr_pages);
> +       kvfree(ubuf->pages);
> +       kfree(ubuf);
> +}
> +
> +static int ublk_ctrl_unreg_buf(struct ublk_device *ub,
> +                              struct ublksrv_ctrl_cmd *header)
> +{
> +       int index = (int)header->data[0];
> +       struct gendisk *disk;
> +       struct ublk_buf *ubuf;
> +
> +       if (!ublk_dev_support_shmem_zc(ub))
> +               return -EOPNOTSUPP;
> +
> +       disk = ublk_get_disk(ub);
> +       if (!disk)
> +               return -ENODEV;
> +
> +       /* Drain inflight I/O before modifying the maple tree */
> +       ublk_quiesce_and_release(disk);
> +
> +       mutex_lock(&ub->mutex);
> +
> +       ubuf = xa_erase(&ub->bufs_xa, index);
> +       if (!ubuf) {
> +               mutex_unlock(&ub->mutex);
> +               ublk_unquiesce_and_resume(disk);
> +               ublk_put_disk(disk);
> +               return -ENOENT;
> +       }
> +
> +       __ublk_ctrl_unreg_buf(ub, ubuf);
> +
> +       mutex_unlock(&ub->mutex);
> +
> +       ublk_unquiesce_and_resume(disk);
> +       ublk_put_disk(disk);
> +       return 0;
> +}
> +
> +static void ublk_buf_cleanup(struct ublk_device *ub)
> +{
> +       struct ublk_buf *ubuf;
> +       unsigned long index;
> +
> +       xa_for_each(&ub->bufs_xa, index, ubuf)
> +               __ublk_ctrl_unreg_buf(ub, ubuf);

This looks quadratic in the number of registered buffers. Can we do a
single pass over the xarray  and the maple tree?

> +       xa_destroy(&ub->bufs_xa);
> +       mtree_destroy(&ub->buf_tree);
> +}
> +
> +
> +
>  static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
>                 u32 cmd_op, struct ublksrv_ctrl_cmd *header)
>  {
> @@ -5225,6 +5517,8 @@ static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
>         case UBLK_CMD_UPDATE_SIZE:
>         case UBLK_CMD_QUIESCE_DEV:
>         case UBLK_CMD_TRY_STOP_DEV:
> +       case UBLK_CMD_REG_BUF:
> +       case UBLK_CMD_UNREG_BUF:
>                 mask = MAY_READ | MAY_WRITE;
>                 break;
>         default:
> @@ -5350,6 +5644,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
>         case UBLK_CMD_TRY_STOP_DEV:
>                 ret = ublk_ctrl_try_stop_dev(ub);
>                 break;
> +       case UBLK_CMD_REG_BUF:
> +               ret = ublk_ctrl_reg_buf(ub, &header);
> +               break;
> +       case UBLK_CMD_UNREG_BUF:
> +               ret = ublk_ctrl_unreg_buf(ub, &header);
> +               break;
>         default:
>                 ret = -EOPNOTSUPP;
>                 break;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index a88876756805..52bb9b843d73 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -57,6 +57,44 @@
>         _IOWR('u', 0x16, struct ublksrv_ctrl_cmd)
>  #define UBLK_U_CMD_TRY_STOP_DEV                \
>         _IOWR('u', 0x17, struct ublksrv_ctrl_cmd)
> +/*
> + * Register a shared memory buffer for zero-copy I/O.
> + * Input:  ctrl_cmd.addr points to struct ublk_buf_reg (buffer VA + size)
> + *         ctrl_cmd.len  = sizeof(struct ublk_buf_reg)
> + * Result: >= 0 is the assigned buffer index, < 0 is error
> + *
> + * The kernel pins pages from the calling process's address space
> + * and inserts PFN ranges into a per-device maple tree. When a block
> + * request's pages match registered pages, the driver sets
> + * UBLK_IO_F_SHMEM_ZC and encodes the buffer index + offset in addr,
> + * allowing the server to access the data via its own mapping of the
> + * same shared memory — true zero copy.
> + *
> + * The memory can be backed by memfd, hugetlbfs, or any GUP-compatible
> + * shared mapping. Queue freeze is handled internally.
> + *
> + * The buffer VA and size are passed via a user buffer (not inline in
> + * ctrl_cmd) so that unprivileged devices can prepend the device path
> + * to ctrl_cmd.addr without corrupting the VA.
> + */
> +#define UBLK_U_CMD_REG_BUF             \
> +       _IOWR('u', 0x18, struct ublksrv_ctrl_cmd)
> +/*
> + * Unregister a shared memory buffer.
> + * Input:  ctrl_cmd.data[0] = buffer index
> + */
> +#define UBLK_U_CMD_UNREG_BUF           \
> +       _IOWR('u', 0x19, struct ublksrv_ctrl_cmd)
> +
> +/* Parameter buffer for UBLK_U_CMD_REG_BUF, pointed to by ctrl_cmd.addr */
> +struct ublk_shmem_buf_reg {
> +       __u64   addr;   /* userspace virtual address of shared memory */
> +       __u32   len;    /* buffer size in bytes (page-aligned, max 4GB) */
> +       __u32   flags;
> +};
> +
> +/* Pin pages without FOLL_WRITE; usable with write-sealed memfd */
> +#define UBLK_SHMEM_BUF_READ_ONLY       (1U << 0)
>  /*
>   * 64bits are enough now, and it should be easy to extend in case of
>   * running out of feature flags
> @@ -370,6 +408,7 @@
>  /* Disable automatic partition scanning when device is started */
>  #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
>
> +
>  /* device state */
>  #define UBLK_S_DEV_DEAD        0
>  #define UBLK_S_DEV_LIVE        1
> @@ -469,6 +508,12 @@ struct ublksrv_ctrl_dev_info {
>  #define                UBLK_IO_F_NEED_REG_BUF          (1U << 17)
>  /* Request has an integrity data buffer */
>  #define                UBLK_IO_F_INTEGRITY             (1UL << 18)
> +/*
> + * I/O buffer is in a registered shared memory buffer. When set, the addr
> + * field in ublksrv_io_desc encodes buffer index and byte offset instead
> + * of a userspace virtual address.
> + */
> +#define                UBLK_IO_F_SHMEM_ZC              (1U << 19)
>
>  /*
>   * io cmd is described by this structure, and stored in share memory, indexed
> @@ -743,4 +788,31 @@ struct ublk_params {
>         struct ublk_param_integrity     integrity;
>  };
>
> +/*
> + * Shared memory zero-copy addr encoding for UBLK_IO_F_SHMEM_ZC.
> + *
> + * When UBLK_IO_F_SHMEM_ZC is set, ublksrv_io_desc.addr is encoded as:
> + *   bits [0:31]  = byte offset within the buffer (up to 4GB)
> + *   bits [32:47] = buffer index (up to 65536)
> + *   bits [48:63] = reserved (must be zero)

I wonder whether the "buffer index" is necessary. Can iod->addr and
UBLK_U_CMD_UNREG_BUF refer to the buffer by the virtual address used
with UBLK_U_CMD_REG_BUF? Then struct ublksrv_io_desc's addr field
would retain its meaning. We would also avoid needing to compare the
range buf_index values in ublk_try_buf_match(). And the xarray
wouldn't be necessary to allocate buffer indices.

> + */
> +#define UBLK_SHMEM_ZC_OFF_MASK         0xffffffffULL
> +#define UBLK_SHMEM_ZC_IDX_OFF          32
> +#define UBLK_SHMEM_ZC_IDX_MASK         0xffffULL
> +
> +static inline __u64 ublk_shmem_zc_addr(__u16 index, __u32 offset)
> +{
> +       return ((__u64)index << UBLK_SHMEM_ZC_IDX_OFF) | offset;
> +}
> +
> +static inline __u16 ublk_shmem_zc_index(__u64 addr)
> +{
> +       return (addr >> UBLK_SHMEM_ZC_IDX_OFF) & UBLK_SHMEM_ZC_IDX_MASK;

nit: the mask looks redundant with the u16 cast

> +}
> +
> +static inline __u32 ublk_shmem_zc_offset(__u64 addr)
> +{
> +       return (__u32)(addr & UBLK_SHMEM_ZC_OFF_MASK);

Same here

Best,
Caleb

> +}
> +
>  #endif
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path
  2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
@ 2026-04-07 19:47   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2026-04-07 19:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
> each page's PFN in the per-device maple tree, and verifies all pages
> belong to the same registered buffer at contiguous offsets.
>
> Add ublk_iod_is_shmem_zc() inline helper for checking whether a
> request uses the shmem zero-copy path.
>
> Integrate into the I/O path:
> - ublk_setup_iod(): if pages match a registered buffer, set
>   UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
> - ublk_start_io(): skip ublk_map_io() for zero-copy requests
> - __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests
>
> The feature remains disabled (ublk_support_shmem_zc() returns false)
> until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 77 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 76 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index ac6ccc174d44..d53865437600 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -356,6 +356,8 @@ struct ublk_params_header {
>
>  static void ublk_io_release(void *priv);
>  static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> +static bool ublk_try_buf_match(struct ublk_device *ub, struct request *rq,
> +                                 u32 *buf_idx, u32 *buf_off);

buf_idx could be a u16 * for consistency?

>  static void ublk_buf_cleanup(struct ublk_device *ub);
>  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> @@ -426,6 +428,12 @@ static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
>         return false;
>  }
>
> +static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
> +                                       unsigned int tag)
> +{
> +       return ublk_get_iod(ubq, tag)->op_flags & UBLK_IO_F_SHMEM_ZC;
> +}
> +
>  static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
>  {
>         return false;
> @@ -1494,6 +1502,18 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
>         iod->nr_sectors = blk_rq_sectors(req);
>         iod->start_sector = blk_rq_pos(req);
>
> +       /* Try shmem zero-copy match before setting addr */
> +       if (ublk_support_shmem_zc(ubq) && ublk_rq_has_data(req)) {
> +               u32 buf_idx, buf_off;
> +
> +               if (ublk_try_buf_match(ubq->dev, req,
> +                                         &buf_idx, &buf_off)) {
> +                       iod->op_flags |= UBLK_IO_F_SHMEM_ZC;
> +                       iod->addr = ublk_shmem_zc_addr(buf_idx, buf_off);
> +                       return BLK_STS_OK;
> +               }
> +       }
> +
>         iod->addr = io->buf.addr;
>
>         return BLK_STS_OK;
> @@ -1539,6 +1559,10 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>             req_op(req) != REQ_OP_DRV_IN)
>                 goto exit;
>
> +       /* shmem zero copy: no data to unmap, pages already shared */
> +       if (ublk_iod_is_shmem_zc(req->mq_hctx->driver_data, req->tag))

This is a lot of pointer chasing. Could we track this with a flag on
struct ublk_io instead?

> +               goto exit;
> +
>         /* for READ request, writing data in iod->addr to rq buffers */
>         unmapped_bytes = ublk_unmap_io(need_map, req, io);
>
> @@ -1697,8 +1721,13 @@ static void ublk_auto_buf_dispatch(const struct ublk_queue *ubq,
>  static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
>                           struct ublk_io *io)
>  {
> -       unsigned mapped_bytes = ublk_map_io(ubq, req, io);
> +       unsigned mapped_bytes;
>
> +       /* shmem zero copy: skip data copy, pages already shared */
> +       if (ublk_iod_is_shmem_zc(ubq, req->tag))
> +               return true;
> +
> +       mapped_bytes = ublk_map_io(ubq, req, io);
>
>         /* partially mapped, update io descriptor */
>         if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
> @@ -5458,7 +5487,53 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
>         mtree_destroy(&ub->buf_tree);
>  }
>
> +/* Check if request pages match a registered shared memory buffer */
> +static bool ublk_try_buf_match(struct ublk_device *ub,
> +                                  struct request *rq,
> +                                  u32 *buf_idx, u32 *buf_off)
> +{
> +       struct req_iterator iter;
> +       struct bio_vec bv;
> +       int index = -1;
> +       unsigned long expected_offset = 0;
> +       bool first = true;

Could check index < 0 in place of first?

> +
> +       rq_for_each_bvec(bv, rq, iter) {
> +               unsigned long pfn = page_to_pfn(bv.bv_page);
> +               struct ublk_buf_range *range;
> +               unsigned long off;
>
> +               range = mtree_load(&ub->buf_tree, pfn);
> +               if (!range)
> +                       return false;
> +
> +               off = range->base_offset +
> +                       (pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;

Doesn't this need to check that the end of the bvec is less than the
end of the range? Otherwise, the bvec could extend into physically
contiguous pages that aren't part of the registered range.

Also, the range could precompute base_pfn - base_offset / PAGE_SIZE
instead of base_offset to make this a bit cheaper.

> +
> +               if (first) {
> +                       /* Read-only buffer can't serve READ (kernel writes) */
> +                       if ((range->flags & UBLK_SHMEM_BUF_READ_ONLY) &&
> +                           req_op(rq) != REQ_OP_WRITE)
> +                               return false;
> +                       index = range->buf_index;
> +                       expected_offset = off;
> +                       *buf_off = off;
> +                       first = false;
> +               } else {
> +                       if (range->buf_index != index)
> +                               return false;
> +                       if (off != expected_offset)
> +                               return false;
> +               }
> +               expected_offset += bv.bv_len;
> +       }
> +
> +       if (first)
> +               return false;

How is this case possible? That would mean the request has no bvecs,
but ublk_try_buf_match() is only called for requests with data, right?

Best,
Caleb

> +
> +       *buf_idx = index;
> +       return true;
> +}
>
>  static int ublk_ctrl_uring_cmd_permission(struct ublk_device *ub,
>                 u32 cmd_op, struct ublksrv_ctrl_cmd *header)
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
  2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
@ 2026-04-07 19:47   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2026-04-07 19:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
> Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
> returning false to checking the actual flag, enabling the shared
> memory zero-copy feature for devices that request it.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  Documentation/block/ublk.rst  | 117 ++++++++++++++++++++++++++++++++++
>  drivers/block/ublk_drv.c      |   7 +-
>  include/uapi/linux/ublk_cmd.h |   7 ++
>  3 files changed, 128 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 6ad28039663d..a818e09a4b66 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -485,6 +485,123 @@ Limitations
>    in case that too many ublk devices are handled by this single io_ring_ctx
>    and each one has very large queue depth
>
> +Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
> +------------------------------------------
> +
> +The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
> +that works by sharing physical memory pages between the client application
> +and the ublk server. Unlike the io_uring fixed buffer approach above,
> +shared memory zero copy does not require io_uring buffer registration
> +per I/O — instead, it relies on the kernel matching page frame numbers
> +(PFNs) at I/O time. This allows the ublk server to access the shared

Maybe "physical pages" would be clearer than the kernel-internal
concept of "page frame numbers"?

> +buffer directly, which is unlikely for the io_uring fixed buffer
> +approach.
> +
> +Motivation
> +~~~~~~~~~~
> +
> +Shared memory zero copy takes a different approach: if the client
> +application and the ublk server both map the same physical memory, there is
> +nothing to copy. The kernel detects the shared pages automatically and
> +tells the server where the data already lives.
> +
> +``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
> +applications — when the client is willing to allocate I/O buffers from
> +shared memory, the entire data path becomes zero-copy without any per-I/O
> +overhead.

nit: The shmem buffer lookup still has some overhead. I think just
"becomes zero-copy" would be fine.

> +
> +Use Cases
> +~~~~~~~~~
> +
> +This feature is useful when the client application can be configured to
> +use a specific shared memory region for its I/O buffers:
> +
> +- **Custom storage clients** that allocate I/O buffers from shared memory
> +  (memfd, hugetlbfs) and issue direct I/O to the ublk device
> +- **Database engines** that use pre-allocated buffer pools with O_DIRECT
> +
> +How It Works
> +~~~~~~~~~~~~
> +
> +1. The ublk server and client both ``mmap()`` the same file (memfd or
> +   hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
> +   same physical pages.
> +
> +2. The ublk server registers its mapping with the kernel::
> +
> +     struct ublk_buf_reg buf = { .addr = mmap_va, .len = size };
> +     ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);

This doesn't look like valid C syntax. Maybe it could say something like:
struct ublksrv_ctrl_cmd cmd = {.dev_id = dev_id, .addr = &buf, .len =
sizeof(buf)};
io_uring_prep_uring_cmd(sqe, UBLK_U_CMD_REG_BUF, ublk_control_fd);
memcpy(sqe->cmd, &cmd, sizeof(cmd));

> +
> +   The kernel pins the pages and builds a PFN lookup tree.
> +
> +3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
> +   the kernel checks whether the I/O buffer pages match any registered
> +   pages by comparing PFNs.
> +
> +4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
> +   descriptor and encodes the buffer index and offset in ``addr``::
> +
> +     if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
> +         /* Data is already in our shared mapping — zero copy */
> +         index  = ublk_shmem_zc_index(iod->addr);
> +         offset = ublk_shmem_zc_offset(iod->addr);
> +         buf = shmem_table[index].mmap_base + offset;
> +     }
> +
> +5. If pages do not match (e.g., the client used a non-shared buffer),
> +   the I/O falls back to the normal copy path silently.
> +
> +The shared memory can be set up via two methods:
> +
> +- **Socket-based**: the client sends a memfd to the ublk server via
> +  ``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
> +- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
> +  hugetlbfs file. No IPC needed — same file gives same physical pages.
> +
> +Advantages
> +~~~~~~~~~~
> +
> +- **Simple**: no per-I/O buffer registration or unregistration commands.
> +  Once the shared buffer is registered, all matching I/O is zero-copy
> +  automatically.
> +- **Direct buffer access**: the ublk server can read and write the shared
> +  buffer directly via its own mmap, without going through io_uring fixed
> +  buffer operations. This is more friendly for server implementations.
> +- **Fast**: PFN matching is a single maple tree lookup per bvec. No
> +  io_uring command round-trips for buffer management.
> +- **Compatible**: non-matching I/O silently falls back to the copy path.
> +  The device works normally for any client, with zero-copy as an
> +  optimization when shared memory is available.
> +
> +Limitations
> +~~~~~~~~~~~
> +
> +- **Requires client cooperation**: the client must allocate its I/O
> +  buffers from the shared memory region. This requires a custom or
> +  configured client — standard applications using their own buffers
> +  will not benefit.
> +- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
> +  the page cache, which allocates its own pages. These kernel-allocated
> +  pages will never match the registered shared buffer. Only ``O_DIRECT``
> +  puts the client's buffer pages directly into the block I/O.

One other limitation that might be worth mentioning is that
scatter/gather I/O can't use the SHMEM_ZC optimization, as the
request's data must be contiguous in the registered virtual address
range.

Best,
Caleb

> +
> +Control Commands
> +~~~~~~~~~~~~~~~~
> +
> +- ``UBLK_U_CMD_REG_BUF``
> +
> +  Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
> +  ``struct ublk_buf_reg`` containing the buffer virtual address and size.
> +  Returns the assigned buffer index (>= 0) on success. The kernel pins
> +  pages and builds the PFN lookup tree. Queue freeze is handled
> +  internally.
> +
> +- ``UBLK_U_CMD_UNREG_BUF``
> +
> +  Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
> +  buffer index. Unpins pages and removes PFN entries from the lookup
> +  tree.
> +
>  References
>  ==========
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index d53865437600..c2b9992503a4 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -85,7 +85,8 @@
>                 | (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ? UBLK_F_INTEGRITY : 0) \
>                 | UBLK_F_SAFE_STOP_DEV \
>                 | UBLK_F_BATCH_IO \
> -               | UBLK_F_NO_AUTO_PART_SCAN)
> +               | UBLK_F_NO_AUTO_PART_SCAN \
> +               | UBLK_F_SHMEM_ZC)
>
>  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
>                 | UBLK_F_USER_RECOVERY_REISSUE \
> @@ -425,7 +426,7 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
>
>  static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
>  {
> -       return false;
> +       return ubq->flags & UBLK_F_SHMEM_ZC;
>  }
>
>  static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
> @@ -436,7 +437,7 @@ static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
>
>  static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
>  {
> -       return false;
> +       return ub->dev_info.flags & UBLK_F_SHMEM_ZC;
>  }
>
>  static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index 52bb9b843d73..ecd258847d3d 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -408,6 +408,13 @@ struct ublk_shmem_buf_reg {
>  /* Disable automatic partition scanning when device is started */
>  #define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
>
> +/*
> + * Enable shared memory zero copy. When enabled, the server can register
> + * shared memory buffers via UBLK_U_CMD_REG_BUF. If a block request's
> + * pages match a registered buffer, UBLK_IO_F_SHMEM_ZC is set and addr
> + * encodes the buffer index + offset instead of a userspace buffer address.
> + */
> +#define UBLK_F_SHMEM_ZC        (1ULL << 19)
>
>  /* device state */
>  #define UBLK_S_DEV_DEAD        0
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf
  2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
@ 2026-04-07 19:50   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2026-04-07 19:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> The pages[] array (kvmalloc'd, 8 bytes per page = 2MB for a 1GB buffer)
> was stored permanently in struct ublk_buf but only needed during
> pin_user_pages_fast() and maple tree construction. Since the maple tree
> already stores PFN ranges via ublk_buf_range, struct page pointers can
> be recovered via pfn_to_page() during unregistration.
>
> Make pages[] a temporary allocation in ublk_ctrl_reg_buf(), freed
> immediately after the maple tree is built. Rewrite __ublk_ctrl_unreg_buf()
> to iterate the maple tree for matching buf_index entries, recovering
> struct page pointers via pfn_to_page() and unpinning in batches of 32.
> Simplify ublk_buf_erase_ranges() to iterate the maple tree by buf_index
> instead of walking the now-removed pages[] array.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 87 +++++++++++++++++++++++++---------------
>  1 file changed, 55 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c2b9992503a4..2e475bdc54dd 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -296,7 +296,6 @@ struct ublk_queue {
>
>  /* Per-registered shared memory buffer */
>  struct ublk_buf {
> -       struct page **pages;
>         unsigned int nr_pages;
>  };

It looks like nr_pages doesn't need to be stored either, it could just
be passed to __ublk_ctrl_reg_buf(). Then I think we could get rid of
struct ublk_buf and the xarray entirely. We really just need a bitmap
for allocating buffer indices.

>
> @@ -5261,27 +5260,25 @@ static void ublk_unquiesce_and_resume(struct gendisk *disk)
>   * coalescing consecutive PFNs into single range entries.
>   * Returns 0 on success, negative error with partial insertions unwound.
>   */
> -/* Erase coalesced PFN ranges from the maple tree for pages [0, nr_pages) */
> -static void ublk_buf_erase_ranges(struct ublk_device *ub,
> -                                 struct ublk_buf *ubuf,
> -                                 unsigned long nr_pages)
> +/* Erase coalesced PFN ranges from the maple tree matching buf_index */
> +static void ublk_buf_erase_ranges(struct ublk_device *ub, int buf_index)
>  {
> -       unsigned long i;
> -
> -       for (i = 0; i < nr_pages; ) {
> -               unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> -               unsigned long start = i;
> +       MA_STATE(mas, &ub->buf_tree, 0, ULONG_MAX);
> +       struct ublk_buf_range *range;
>
> -               while (i + 1 < nr_pages &&
> -                      page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> -                       i++;
> -               i++;
> -               kfree(mtree_erase(&ub->buf_tree, pfn));
> +       mas_lock(&mas);
> +       mas_for_each(&mas, range, ULONG_MAX) {
> +               if (range->buf_index == buf_index) {
> +                       mas_erase(&mas);
> +                       kfree(range);
> +               }
>         }
> +       mas_unlock(&mas);
>  }
>
>  static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
> -                              struct ublk_buf *ubuf, int index,
> +                              struct ublk_buf *ubuf,
> +                              struct page **pages, int index,
>                                unsigned short flags)
>  {
>         unsigned long nr_pages = ubuf->nr_pages;
> @@ -5289,13 +5286,13 @@ static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
>         int ret;
>
>         for (i = 0; i < nr_pages; ) {
> -               unsigned long pfn = page_to_pfn(ubuf->pages[i]);
> +               unsigned long pfn = page_to_pfn(pages[i]);
>                 unsigned long start = i;
>                 struct ublk_buf_range *range;
>
>                 /* Find run of consecutive PFNs */
>                 while (i + 1 < nr_pages &&
> -                      page_to_pfn(ubuf->pages[i + 1]) == pfn + (i - start) + 1)
> +                      page_to_pfn(pages[i + 1]) == pfn + (i - start) + 1)
>                         i++;
>                 i++;    /* past the last page in this run */
>
> @@ -5320,7 +5317,7 @@ static int __ublk_ctrl_reg_buf(struct ublk_device *ub,
>         return 0;
>
>  unwind:
> -       ublk_buf_erase_ranges(ub, ubuf, i);
> +       ublk_buf_erase_ranges(ub, index);
>         return ret;
>  }
>
> @@ -5335,6 +5332,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>         void __user *argp = (void __user *)(unsigned long)header->addr;
>         struct ublk_shmem_buf_reg buf_reg;
>         unsigned long addr, size, nr_pages;
> +       struct page **pages = NULL;
>         unsigned int gup_flags;
>         struct gendisk *disk;
>         struct ublk_buf *ubuf;
> @@ -5371,9 +5369,8 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>                 goto put_disk;
>         }
>
> -       ubuf->pages = kvmalloc_array(nr_pages, sizeof(*ubuf->pages),
> -                                    GFP_KERNEL);
> -       if (!ubuf->pages) {
> +       pages = kvmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
> +       if (!pages) {
>                 ret = -ENOMEM;
>                 goto err_free;
>         }
> @@ -5382,7 +5379,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>         if (!(buf_reg.flags & UBLK_SHMEM_BUF_READ_ONLY))
>                 gup_flags |= FOLL_WRITE;
>
> -       pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, ubuf->pages);
> +       pinned = pin_user_pages_fast(addr, nr_pages, gup_flags, pages);
>         if (pinned < 0) {
>                 ret = pinned;
>                 goto err_free_pages;
> @@ -5406,7 +5403,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>         if (ret)
>                 goto err_unlock;
>
> -       ret = __ublk_ctrl_reg_buf(ub, ubuf, index, buf_reg.flags);
> +       ret = __ublk_ctrl_reg_buf(ub, ubuf, pages, index, buf_reg.flags);
>         if (ret) {
>                 xa_erase(&ub->bufs_xa, index);
>                 goto err_unlock;
> @@ -5414,6 +5411,7 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>
>         mutex_unlock(&ub->mutex);
>
> +       kvfree(pages);
>         ublk_unquiesce_and_resume(disk);
>         ublk_put_disk(disk);
>         return index;
> @@ -5422,9 +5420,9 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>         mutex_unlock(&ub->mutex);
>         ublk_unquiesce_and_resume(disk);
>  err_unpin:
> -       unpin_user_pages(ubuf->pages, pinned);
> +       unpin_user_pages(pages, pinned);
>  err_free_pages:
> -       kvfree(ubuf->pages);
> +       kvfree(pages);
>  err_free:
>         kfree(ubuf);
>  put_disk:
> @@ -5433,11 +5431,36 @@ static int ublk_ctrl_reg_buf(struct ublk_device *ub,
>  }
>
>  static void __ublk_ctrl_unreg_buf(struct ublk_device *ub,
> -                                 struct ublk_buf *ubuf)
> +                                 struct ublk_buf *ubuf, int buf_index)

ubuf is only passed to kfree() now, maybe it would make sense to move
that to the caller so the argument can be dropped?

Best,
Caleb

>  {
> -       ublk_buf_erase_ranges(ub, ubuf, ubuf->nr_pages);
> -       unpin_user_pages(ubuf->pages, ubuf->nr_pages);
> -       kvfree(ubuf->pages);
> +       MA_STATE(mas, &ub->buf_tree, 0, ULONG_MAX);
> +       struct ublk_buf_range *range;
> +       struct page *pages[32];
> +
> +       mas_lock(&mas);
> +       mas_for_each(&mas, range, ULONG_MAX) {
> +               unsigned long base, nr, off;
> +
> +               if (range->buf_index != buf_index)
> +                       continue;
> +
> +               base = range->base_pfn;
> +               nr = mas.last - mas.index + 1;
> +               mas_erase(&mas);
> +
> +               for (off = 0; off < nr; ) {
> +                       unsigned int batch = min_t(unsigned long,
> +                                                  nr - off, 32);
> +                       unsigned int j;
> +
> +                       for (j = 0; j < batch; j++)
> +                               pages[j] = pfn_to_page(base + off + j);
> +                       unpin_user_pages(pages, batch);
> +                       off += batch;
> +               }
> +               kfree(range);
> +       }
> +       mas_unlock(&mas);
>         kfree(ubuf);
>  }
>
> @@ -5468,7 +5491,7 @@ static int ublk_ctrl_unreg_buf(struct ublk_device *ub,
>                 return -ENOENT;
>         }
>
> -       __ublk_ctrl_unreg_buf(ub, ubuf);
> +       __ublk_ctrl_unreg_buf(ub, ubuf, index);
>
>         mutex_unlock(&ub->mutex);
>
> @@ -5483,7 +5506,7 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
>         unsigned long index;
>
>         xa_for_each(&ub->bufs_xa, index, ubuf)
> -               __ublk_ctrl_unreg_buf(ub, ubuf);
> +               __ublk_ctrl_unreg_buf(ub, ubuf, index);
>         xa_destroy(&ub->bufs_xa);
>         mtree_destroy(&ub->buf_tree);
>  }
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-04-07 19:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35   ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47   ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
2026-04-07 19:47   ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50   ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34   ` Jens Axboe
2026-04-07 19:29   ` Caleb Sander Mateos
2026-04-07 13:44 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox