From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path
Date: Wed, 8 Apr 2026 10:36:01 +0800 [thread overview]
Message-ID: <adW_EQW1AAJdaWqN@fedora> (raw)
In-Reply-To: <CADUfDZq3rTwcnxAekdan3-fi3qdLwFp-T5LsJJ1viqoUv8RqOw@mail.gmail.com>
On Tue, Apr 07, 2026 at 12:47:29PM -0700, Caleb Sander Mateos wrote:
> On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
> > each page's PFN in the per-device maple tree, and verifies all pages
> > belong to the same registered buffer at contiguous offsets.
> >
> > Add ublk_iod_is_shmem_zc() inline helper for checking whether a
> > request uses the shmem zero-copy path.
> >
> > Integrate into the I/O path:
> > - ublk_setup_iod(): if pages match a registered buffer, set
> > UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
> > - ublk_start_io(): skip ublk_map_io() for zero-copy requests
> > - __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests
> >
> > The feature remains disabled (ublk_support_shmem_zc() returns false)
> > until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/block/ublk_drv.c | 77 +++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 76 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index ac6ccc174d44..d53865437600 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -356,6 +356,8 @@ struct ublk_params_header {
> >
> > static void ublk_io_release(void *priv);
> > static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> > +static bool ublk_try_buf_match(struct ublk_device *ub, struct request *rq,
> > + u32 *buf_idx, u32 *buf_off);
>
> buf_idx could be a u16 * for consistency?
Yeah.
>
> > static void ublk_buf_cleanup(struct ublk_device *ub);
> > static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > @@ -426,6 +428,12 @@ static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
> > return false;
> > }
> >
> > +static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
> > + unsigned int tag)
> > +{
> > + return ublk_get_iod(ubq, tag)->op_flags & UBLK_IO_F_SHMEM_ZC;
> > +}
> > +
> > static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
> > {
> > return false;
> > @@ -1494,6 +1502,18 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
> > iod->nr_sectors = blk_rq_sectors(req);
> > iod->start_sector = blk_rq_pos(req);
> >
> > + /* Try shmem zero-copy match before setting addr */
> > + if (ublk_support_shmem_zc(ubq) && ublk_rq_has_data(req)) {
> > + u32 buf_idx, buf_off;
> > +
> > + if (ublk_try_buf_match(ubq->dev, req,
> > + &buf_idx, &buf_off)) {
> > + iod->op_flags |= UBLK_IO_F_SHMEM_ZC;
> > + iod->addr = ublk_shmem_zc_addr(buf_idx, buf_off);
> > + return BLK_STS_OK;
> > + }
> > + }
> > +
> > iod->addr = io->buf.addr;
> >
> > return BLK_STS_OK;
> > @@ -1539,6 +1559,10 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
> > req_op(req) != REQ_OP_DRV_IN)
> > goto exit;
> >
> > + /* shmem zero copy: no data to unmap, pages already shared */
> > + if (ublk_iod_is_shmem_zc(req->mq_hctx->driver_data, req->tag))
>
> This is a lot of pointer chasing. Could we track this with a flag on
> struct ublk_io instead?
Yeah, it can be done by adding and setting the new flag in ublk_start_io(), or
we can just pass `ubq` to __ublk_complete_rq() from its call sites.
I prefer to the latter.
>
> > + goto exit;
> > +
> > /* for READ request, writing data in iod->addr to rq buffers */
> > unmapped_bytes = ublk_unmap_io(need_map, req, io);
> >
> > @@ -1697,8 +1721,13 @@ static void ublk_auto_buf_dispatch(const struct ublk_queue *ubq,
> > static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
> > struct ublk_io *io)
> > {
> > - unsigned mapped_bytes = ublk_map_io(ubq, req, io);
> > + unsigned mapped_bytes;
> >
> > + /* shmem zero copy: skip data copy, pages already shared */
> > + if (ublk_iod_is_shmem_zc(ubq, req->tag))
> > + return true;
> > +
> > + mapped_bytes = ublk_map_io(ubq, req, io);
> >
> > /* partially mapped, update io descriptor */
> > if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
> > @@ -5458,7 +5487,53 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
> > mtree_destroy(&ub->buf_tree);
> > }
> >
> > +/* Check if request pages match a registered shared memory buffer */
> > +static bool ublk_try_buf_match(struct ublk_device *ub,
> > + struct request *rq,
> > + u32 *buf_idx, u32 *buf_off)
> > +{
> > + struct req_iterator iter;
> > + struct bio_vec bv;
> > + int index = -1;
> > + unsigned long expected_offset = 0;
> > + bool first = true;
>
> Could check index < 0 in place of first?
>
> > +
> > + rq_for_each_bvec(bv, rq, iter) {
> > + unsigned long pfn = page_to_pfn(bv.bv_page);
> > + struct ublk_buf_range *range;
> > + unsigned long off;
> >
> > + range = mtree_load(&ub->buf_tree, pfn);
> > + if (!range)
> > + return false;
> > +
> > + off = range->base_offset +
> > + (pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;
>
> Doesn't this need to check that the end of the bvec is less than the
> end of the range? Otherwise, the bvec could extend into physically
> contiguous pages that aren't part of the registered range.
It is actually fixed in my local tree:
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2e475bdc54dd..69db3b3b9071 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -5524,13 +5524,20 @@ static bool ublk_try_buf_match(struct ublk_device *ub,
rq_for_each_bvec(bv, rq, iter) {
unsigned long pfn = page_to_pfn(bv.bv_page);
+ unsigned long end_pfn = pfn +
+ ((bv.bv_offset + bv.bv_len - 1) >> PAGE_SHIFT);
struct ublk_buf_range *range;
unsigned long off;
+ MA_STATE(mas, &ub->buf_tree, pfn, pfn);
- range = mtree_load(&ub->buf_tree, pfn);
+ range = mas_walk(&mas);
if (!range)
return false;
+ /* verify all pages in this bvec fall within the range */
+ if (end_pfn > mas.last)
+ return false;
+
off = range->base_offset +
(pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;
>
> Also, the range could precompute base_pfn - base_offset / PAGE_SIZE
> instead of base_offset to make this a bit cheaper.
>
> > +
> > + if (first) {
> > + /* Read-only buffer can't serve READ (kernel writes) */
> > + if ((range->flags & UBLK_SHMEM_BUF_READ_ONLY) &&
> > + req_op(rq) != REQ_OP_WRITE)
> > + return false;
> > + index = range->buf_index;
> > + expected_offset = off;
> > + *buf_off = off;
> > + first = false;
> > + } else {
> > + if (range->buf_index != index)
> > + return false;
> > + if (off != expected_offset)
> > + return false;
> > + }
> > + expected_offset += bv.bv_len;
> > + }
> > +
> > + if (first)
> > + return false;
>
> How is this case possible? That would mean the request has no bvecs,
> but ublk_try_buf_match() is only called for requests with data, right?
Your are right, will clean it in next version.
Thanks,
Ming
next prev parent reply other threads:[~2026-04-08 2:36 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35 ` Caleb Sander Mateos
2026-04-08 2:23 ` Ming Lei
2026-04-08 15:20 ` Caleb Sander Mateos
2026-04-09 12:18 ` Ming Lei
2026-04-09 21:22 ` Caleb Sander Mateos
2026-04-10 2:28 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-04-08 2:36 ` Ming Lei [this message]
2026-04-08 15:28 ` Caleb Sander Mateos
2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
2026-04-07 19:47 ` Caleb Sander Mateos
2026-04-08 2:50 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50 ` Caleb Sander Mateos
2026-04-08 2:58 ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07 2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34 ` Jens Axboe
2026-04-07 19:29 ` Caleb Sander Mateos
2026-04-08 3:03 ` Ming Lei
2026-04-08 12:52 ` Jens Axboe
2026-04-07 13:44 ` Jens Axboe
2026-04-14 18:56 ` Keith Busch
2026-04-15 8:38 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adW_EQW1AAJdaWqN@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.