From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Cc: Caleb Sander Mateos <csander@purestorage.com>,
Ming Lei <ming.lei@redhat.com>
Subject: [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
Date: Tue, 31 Mar 2026 23:31:54 +0800 [thread overview]
Message-ID: <20260331153207.3635125-4-ming.lei@redhat.com> (raw)
In-Reply-To: <20260331153207.3635125-1-ming.lei@redhat.com>
Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
returning false to checking the actual flag, enabling the shared
memory zero-copy feature for devices that request it.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
Documentation/block/ublk.rst | 117 ++++++++++++++++++++++++++++++++++
drivers/block/ublk_drv.c | 7 +-
include/uapi/linux/ublk_cmd.h | 7 ++
3 files changed, 128 insertions(+), 3 deletions(-)
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 6ad28039663d..a818e09a4b66 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -485,6 +485,123 @@ Limitations
in case that too many ublk devices are handled by this single io_ring_ctx
and each one has very large queue depth
+Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
+------------------------------------------
+
+The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
+that works by sharing physical memory pages between the client application
+and the ublk server. Unlike the io_uring fixed buffer approach above,
+shared memory zero copy does not require io_uring buffer registration
+per I/O — instead, it relies on the kernel matching page frame numbers
+(PFNs) at I/O time. This allows the ublk server to access the shared
+buffer directly, which is unlikely for the io_uring fixed buffer
+approach.
+
+Motivation
+~~~~~~~~~~
+
+Shared memory zero copy takes a different approach: if the client
+application and the ublk server both map the same physical memory, there is
+nothing to copy. The kernel detects the shared pages automatically and
+tells the server where the data already lives.
+
+``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
+applications — when the client is willing to allocate I/O buffers from
+shared memory, the entire data path becomes zero-copy without any per-I/O
+overhead.
+
+Use Cases
+~~~~~~~~~
+
+This feature is useful when the client application can be configured to
+use a specific shared memory region for its I/O buffers:
+
+- **Custom storage clients** that allocate I/O buffers from shared memory
+ (memfd, hugetlbfs) and issue direct I/O to the ublk device
+- **Database engines** that use pre-allocated buffer pools with O_DIRECT
+
+How It Works
+~~~~~~~~~~~~
+
+1. The ublk server and client both ``mmap()`` the same file (memfd or
+ hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
+ same physical pages.
+
+2. The ublk server registers its mapping with the kernel::
+
+ struct ublk_buf_reg buf = { .addr = mmap_va, .len = size };
+ ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
+
+ The kernel pins the pages and builds a PFN lookup tree.
+
+3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
+ the kernel checks whether the I/O buffer pages match any registered
+ pages by comparing PFNs.
+
+4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
+ descriptor and encodes the buffer index and offset in ``addr``::
+
+ if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
+ /* Data is already in our shared mapping — zero copy */
+ index = ublk_shmem_zc_index(iod->addr);
+ offset = ublk_shmem_zc_offset(iod->addr);
+ buf = shmem_table[index].mmap_base + offset;
+ }
+
+5. If pages do not match (e.g., the client used a non-shared buffer),
+ the I/O falls back to the normal copy path silently.
+
+The shared memory can be set up via two methods:
+
+- **Socket-based**: the client sends a memfd to the ublk server via
+ ``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
+- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
+ hugetlbfs file. No IPC needed — same file gives same physical pages.
+
+Advantages
+~~~~~~~~~~
+
+- **Simple**: no per-I/O buffer registration or unregistration commands.
+ Once the shared buffer is registered, all matching I/O is zero-copy
+ automatically.
+- **Direct buffer access**: the ublk server can read and write the shared
+ buffer directly via its own mmap, without going through io_uring fixed
+ buffer operations. This is more friendly for server implementations.
+- **Fast**: PFN matching is a single maple tree lookup per bvec. No
+ io_uring command round-trips for buffer management.
+- **Compatible**: non-matching I/O silently falls back to the copy path.
+ The device works normally for any client, with zero-copy as an
+ optimization when shared memory is available.
+
+Limitations
+~~~~~~~~~~~
+
+- **Requires client cooperation**: the client must allocate its I/O
+ buffers from the shared memory region. This requires a custom or
+ configured client — standard applications using their own buffers
+ will not benefit.
+- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
+ the page cache, which allocates its own pages. These kernel-allocated
+ pages will never match the registered shared buffer. Only ``O_DIRECT``
+ puts the client's buffer pages directly into the block I/O.
+
+Control Commands
+~~~~~~~~~~~~~~~~
+
+- ``UBLK_U_CMD_REG_BUF``
+
+ Register a shared memory buffer. ``ctrl_cmd.addr`` points to a
+ ``struct ublk_buf_reg`` containing the buffer virtual address and size.
+ Returns the assigned buffer index (>= 0) on success. The kernel pins
+ pages and builds the PFN lookup tree. Queue freeze is handled
+ internally.
+
+- ``UBLK_U_CMD_UNREG_BUF``
+
+ Unregister a previously registered buffer. ``ctrl_cmd.data[0]`` is the
+ buffer index. Unpins pages and removes PFN entries from the lookup
+ tree.
+
References
==========
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d53865437600..c2b9992503a4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -85,7 +85,8 @@
| (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ? UBLK_F_INTEGRITY : 0) \
| UBLK_F_SAFE_STOP_DEV \
| UBLK_F_BATCH_IO \
- | UBLK_F_NO_AUTO_PART_SCAN)
+ | UBLK_F_NO_AUTO_PART_SCAN \
+ | UBLK_F_SHMEM_ZC)
#define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
| UBLK_F_USER_RECOVERY_REISSUE \
@@ -425,7 +426,7 @@ static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub)
static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
{
- return false;
+ return ubq->flags & UBLK_F_SHMEM_ZC;
}
static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
@@ -436,7 +437,7 @@ static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
{
- return false;
+ return ub->dev_info.flags & UBLK_F_SHMEM_ZC;
}
static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 52bb9b843d73..ecd258847d3d 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -408,6 +408,13 @@ struct ublk_shmem_buf_reg {
/* Disable automatic partition scanning when device is started */
#define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 18)
+/*
+ * Enable shared memory zero copy. When enabled, the server can register
+ * shared memory buffers via UBLK_U_CMD_REG_BUF. If a block request's
+ * pages match a registered buffer, UBLK_IO_F_SHMEM_ZC is set and addr
+ * encodes the buffer index + offset instead of a userspace buffer address.
+ */
+#define UBLK_F_SHMEM_ZC (1ULL << 19)
/* device state */
#define UBLK_S_DEV_DEAD 0
--
2.53.0
next prev parent reply other threads:[~2026-03-31 15:32 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35 ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-03-31 15:31 ` Ming Lei [this message]
2026-04-07 19:47 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50 ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07 2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34 ` Jens Axboe
2026-04-07 19:29 ` Caleb Sander Mateos
2026-04-07 13:44 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331153207.3635125-4-ming.lei@redhat.com \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox