From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Cc: Caleb Sander Mateos <csander@purestorage.com>,
Ming Lei <ming.lei@redhat.com>
Subject: [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
Date: Tue, 31 Mar 2026 23:31:54 +0800 [thread overview]
Message-ID: <20260331153207.3635125-4-ming.lei@redhat.com> (raw)
In-Reply-To: <20260331153207.3635125-1-ming.lei@redhat.com>
Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
Documentation/block/ublk.rst | 117 ++++++++++++++++++++++++++++++++++
drivers/block/ublk_drv.c | 7 +-
include/uapi/linux/ublk_cmd.h | 7 ++
3 files changed, 128 insertions(+), 3 deletions(-)
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 6ad28039663d..a818e09a4b66 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -485,6 +485,123 @@ Limitations
in case that too many ublk devices are handled by this single io_ring_ctx
and each one has very large queue depth
+Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
+------------------------------------------
+
+The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
+that works by sharing physical memory pages between the client application
+and the ublk server. Unlike the io_uring fixed buffer approach above,
+shared memory zero copy does not require io_uring buffer registration
+per I/O — instead, it relies on the kernel matching page frame numbers
+(PFNs) at I/O time. This allows the ublk server to access the shared
+buffer directly, which is unlikely for the io_uring fixed buffer
+approach.
+
+Motivation
+~~~~~~~~~~
+
+Shared memory zero copy takes a different approach: if the client
+application and the ublk server both map the same physical memory, there is
+nothing to copy. The kernel detects the shared pages automatically and
+tells the server where the data already lives.
+
+``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
+applications — when the client is willing to allocate I/O buffers from
+shared memory, the entire data path becomes zero-copy without any per-I/O
+overhead.
+
+Use Cases
+~~~~~~~~~
+
+This feature is useful when the client application can be configured to
+use a specific shared memory region for its I/O buffers:
+
+- **Custom storage clients** that allocate I/O buffers from shared memory
+ (memfd, hugetlbfs) and issue direct I/O to the ublk device
+- **Database engines** that use pre-allocated buffer pools with O_DIRECT
+
+How It Works
+~~~~~~~~~~~~
+
+1. The ublk server and client both ``mmap()`` the same file (memfd or
+ hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
+ same physical pages.
+
+2. The ublk server registers its mapping with the kernel::
+
+ struct ublk_buf_reg buf = { .addr = mmap_va, .len = size };
+ ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
+
+ The kernel pins the pages and builds a PFN lookup tree.
+
+3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
+ the kernel checks whether the I/O buffer pages match any registered
+ pages by comparing PFNs.
+
+4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
+ descriptor and encodes the buffer index and offset in ``addr``::
+
+ if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
+ /* Data is already in our shared mapping — zero copy */
+ index = ublk_shmem_zc_index(iod->addr);
+ offset = ublk_shmem_zc_offset(iod->addr);
+ buf = shmem_table[index].mmap_base + offset;
+ }
+
+5. If pages do not match (e.g., the client used a non-shared buffer),
+ the I/O falls back to the normal copy path silently.
+
+The shared memory can be set up via two methods:
+
+- **Socket-based**: the client sends a memfd to the ublk server via
+ ``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
+- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
+ hugetlbfs file. No IPC needed — same file gives same physical pages.
+
+Advantages
+~~~~~~~~~~
+
+- **Simple**: no per-I/O buffer registration or unregistration commands.
+ Once the shared buffer is registered, all matching I/O is zero-copy
+ automatically.
+- **Direct buffer access**: the ublk server can read and write the shared
+ buffer directly via its own mmap, without going through io_uring fixed
+ buffer operations. This is more friendly for server implementations.
+- **Fast**: PFN matching is a single maple tree lookup per bvec. No
+ io_uring command round-trips for buffer management.
+- **Compatible**: non-matching I/O silently falls back to the copy path.
+ The device works normally for any client, with zero-copy as an
+ optimization when shared memory is available.
+
+Limitations
+~~~~~~~~~~~
+
+- **Requires client cooperation**: the client must allocate its I/O
+ buffers from the shared memory region. This requires a custom or
+ configured client — standard applications using their own buffers
+ will not benefit.
+- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
+ the page cache, which allocates its own pages. These kernel-allocated
+ pages will never match the registered shared buffer. Only ``O_DIRECT``
+ puts the client's buffer pages directly into the block I/O.
+
+Control Commands
+~~~~~~~~~~~~~~~~
+
+- ``UBLK_U_CMD_REG_BUF``
+
+ Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
+ ``struct ublk_buf_reg`` containing the buffer virtual address and size.
+ Returns the assigned buffer index (>= 0) on success. The kernel pins
+ pages and builds the PFN lookup tree. Queue freeze is handled
+ internally.
+
+- ``UBLK_U_CMD_UNREG_BUF``
+
+ Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
+ buffer index. Unpins pages and removes PFN entries from the lookup
+ tree.
+
References
==========
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d53865437600..c2b9992503a4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -85,7 +85,8 @@
| (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ? UBLK_F_INTEGRITY : 0) \
| UBLK_F_SAFE_STOP_DEV \
| UBLK_F_BATCH_IO \
- | UBLK_F_NO_AUTO_PART_SCAN)
+ | UBLK_F_NO_AUTO_PART_SCAN \
+ | UBLK_F_SHMEM_ZC)
#define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
| UBLK_F_USER_RECOVERY_REISSUE \
@@ -425,7 +426,7 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
{
- return false;
+ return ubq->flags & UBLK_F_SHMEM_ZC;
}
static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
@@ -436,7 +437,7 @@ static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
{
- return false;
+ return ub->dev_info.flags & UBLK_F_SHMEM_ZC;
}
static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 52bb9b843d73..ecd258847d3d 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -408,6 +408,13 @@ struct ublk_shmem_buf_reg {
/* Disable automatic partition scanning when device is started */
#define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
+/*
+ * Enable shared memory zero copy. When enabled, the server can register
+ * shared memory buffers via UBLK_U_CMD_REG_BUF. If a block request's
+ * pages match a registered buffer, UBLK_IO_F_SHMEM_ZC is set and addr
+ * encodes the buffer index + offset instead of a userspace buffer address.
+ */
+#define UBLK_F_SHMEM_ZC (1ULL << 19)
/* device state */
#define UBLK_S_DEV_DEAD 0
--
2.53.0
next prev parent reply other threads:[~2026-03-31 15:32 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35 ` Caleb Sander Mateos
2026-04-08 2:23 ` Ming Lei
2026-04-08 15:20 ` Caleb Sander Mateos
2026-04-09 12:18 ` Ming Lei
2026-04-09 21:22 ` Caleb Sander Mateos
2026-04-10 2:28 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-04-08 2:36 ` Ming Lei
2026-04-08 15:28 ` Caleb Sander Mateos
2026-03-31 15:31 ` Ming Lei [this message]
2026-04-07 19:47 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Caleb Sander Mateos
2026-04-08 2:50 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50 ` Caleb Sander Mateos
2026-04-08 2:58 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07 2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34 ` Jens Axboe
2026-04-07 19:29 ` Caleb Sander Mateos
2026-04-08 3:03 ` Ming Lei
2026-04-08 12:52 ` Jens Axboe
2026-04-07 13:44 ` Jens Axboe
2026-04-14 18:56 ` Keith Busch
2026-04-15 8:38 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331153207.3635125-4-ming.lei@redhat.com \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.