From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag
Date: Wed, 8 Apr 2026 10:50:51 +0800 [thread overview]
Message-ID: <adXCiySBAocA7Cpm@fedora> (raw)
In-Reply-To: <CADUfDZrFe2jWV1SpA6opQXB_Zx7COGXEeJAFNRWdMAaaDHomWQ@mail.gmail.com>
On Tue, Apr 07, 2026 at 12:47:58PM -0700, Caleb Sander Mateos wrote:
> On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add UBLK_F_SHMEM_ZC (1ULL << 19) to the UAPI header and UBLK_F_ALL.
> > Switch ublk_support_shmem_zc() and ublk_dev_support_shmem_zc() from
> > returning false to checking the actual flag, enabling the shared
> > memory zero-copy feature for devices that request it.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > Documentation/block/ublk.rst | 117 ++++++++++++++++++++++++++++++++++
> > drivers/block/ublk_drv.c | 7 +-
> > include/uapi/linux/ublk_cmd.h | 7 ++
> > 3 files changed, 128 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 6ad28039663d..a818e09a4b66 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -485,6 +485,123 @@ Limitations
> > in case that too many ublk devices are handled by this single io_ring_ctx
> > and each one has very large queue depth
> >
> > +Shared Memory Zero Copy (UBLK_F_SHMEM_ZC)
> > +------------------------------------------
> > +
> > +The ``UBLK_F_SHMEM_ZC`` feature provides an alternative zero-copy path
> > +that works by sharing physical memory pages between the client application
> > +and the ublk server. Unlike the io_uring fixed buffer approach above,
> > +shared memory zero copy does not require io_uring buffer registration
> > +per I/O — instead, it relies on the kernel matching page frame numbers
> > +(PFNs) at I/O time. This allows the ublk server to access the shared
>
> Maybe "physical pages" would be clearer than the kernel-internal
> concept of "page frame numbers"?
OK, but it is one kernel doc, PFN shouldn't be bad.
>
> > +buffer directly, which is unlikely for the io_uring fixed buffer
> > +approach.
> > +
> > +Motivation
> > +~~~~~~~~~~
> > +
> > +Shared memory zero copy takes a different approach: if the client
> > +application and the ublk server both map the same physical memory, there is
> > +nothing to copy. The kernel detects the shared pages automatically and
> > +tells the server where the data already lives.
> > +
> > +``UBLK_F_SHMEM_ZC`` can be thought of as a supplement for optimized client
> > +applications — when the client is willing to allocate I/O buffers from
> > +shared memory, the entire data path becomes zero-copy without any per-I/O
> > +overhead.
>
> nit: The shmem buffer lookup still has some overhead. I think just
> "becomes zero-copy" would be fine.
Fine, mapple tree has very small depth, the lookup cost is pretty small.
>
> > +
> > +Use Cases
> > +~~~~~~~~~
> > +
> > +This feature is useful when the client application can be configured to
> > +use a specific shared memory region for its I/O buffers:
> > +
> > +- **Custom storage clients** that allocate I/O buffers from shared memory
> > + (memfd, hugetlbfs) and issue direct I/O to the ublk device
> > +- **Database engines** that use pre-allocated buffer pools with O_DIRECT
> > +
> > +How It Works
> > +~~~~~~~~~~~~
> > +
> > +1. The ublk server and client both ``mmap()`` the same file (memfd or
> > + hugetlbfs) with ``MAP_SHARED``. This gives both processes access to the
> > + same physical pages.
> > +
> > +2. The ublk server registers its mapping with the kernel::
> > +
> > + struct ublk_buf_reg buf = { .addr = mmap_va, .len = size };
> > + ublk_ctrl_cmd(UBLK_U_CMD_REG_BUF, .addr = &buf);
>
> This doesn't look like valid C syntax. Maybe it could say something like:
> struct ublksrv_ctrl_cmd cmd = {.dev_id = dev_id, .addr = &buf, .len =
> sizeof(buf)};
> io_uring_prep_uring_cmd(sqe, UBLK_U_CMD_REG_BUF, ublk_control_fd);
> memcpy(sqe->cmd, &cmd, sizeof(cmd));
It is pseudocode, looks not a big deal.
>
> > +
> > + The kernel pins the pages and builds a PFN lookup tree.
> > +
> > +3. When the client issues direct I/O (``O_DIRECT``) to ``/dev/ublkb*``,
> > + the kernel checks whether the I/O buffer pages match any registered
> > + pages by comparing PFNs.
> > +
> > +4. On a match, the kernel sets ``UBLK_IO_F_SHMEM_ZC`` in the I/O
> > + descriptor and encodes the buffer index and offset in ``addr``::
> > +
> > + if (iod->op_flags & UBLK_IO_F_SHMEM_ZC) {
> > + /* Data is already in our shared mapping — zero copy */
> > + index = ublk_shmem_zc_index(iod->addr);
> > + offset = ublk_shmem_zc_offset(iod->addr);
> > + buf = shmem_table[index].mmap_base + offset;
> > + }
> > +
> > +5. If pages do not match (e.g., the client used a non-shared buffer),
> > + the I/O falls back to the normal copy path silently.
> > +
> > +The shared memory can be set up via two methods:
> > +
> > +- **Socket-based**: the client sends a memfd to the ublk server via
> > + ``SCM_RIGHTS`` on a unix socket. The server mmaps and registers it.
> > +- **Hugetlbfs-based**: both processes ``mmap(MAP_SHARED)`` the same
> > + hugetlbfs file. No IPC needed — same file gives same physical pages.
> > +
> > +Advantages
> > +~~~~~~~~~~
> > +
> > +- **Simple**: no per-I/O buffer registration or unregistration commands.
> > + Once the shared buffer is registered, all matching I/O is zero-copy
> > + automatically.
> > +- **Direct buffer access**: the ublk server can read and write the shared
> > + buffer directly via its own mmap, without going through io_uring fixed
> > + buffer operations. This is more friendly for server implementations.
> > +- **Fast**: PFN matching is a single maple tree lookup per bvec. No
> > + io_uring command round-trips for buffer management.
> > +- **Compatible**: non-matching I/O silently falls back to the copy path.
> > + The device works normally for any client, with zero-copy as an
> > + optimization when shared memory is available.
> > +
> > +Limitations
> > +~~~~~~~~~~~
> > +
> > +- **Requires client cooperation**: the client must allocate its I/O
> > + buffers from the shared memory region. This requires a custom or
> > + configured client — standard applications using their own buffers
> > + will not benefit.
> > +- **Direct I/O only**: buffered I/O (without ``O_DIRECT``) goes through
> > + the page cache, which allocates its own pages. These kernel-allocated
> > + pages will never match the registered shared buffer. Only ``O_DIRECT``
> > + puts the client's buffer pages directly into the block I/O.
>
> One other limitation that might be worth mentioning is that
> scatter/gather I/O can't use the SHMEM_ZC optimization, as the
> request's data must be contiguous in the registered virtual address
> range.
Good catch, will document this limit.
It could be supported in future by introducing bpf, and bpf prog can use its
map(such as arena) to build iov like data to userspace.
Thanks,
Ming
next prev parent reply other threads:[~2026-04-08 2:51 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35 ` Caleb Sander Mateos
2026-04-08 2:23 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-04-08 2:36 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-04-08 2:50 ` Ming Lei [this message]
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50 ` Caleb Sander Mateos
2026-04-08 2:58 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07 2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34 ` Jens Axboe
2026-04-07 19:29 ` Caleb Sander Mateos
2026-04-08 3:03 ` Ming Lei
2026-04-07 13:44 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adXCiySBAocA7Cpm@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox