public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path
Date: Wed, 8 Apr 2026 10:36:01 +0800	[thread overview]
Message-ID: <adW_EQW1AAJdaWqN@fedora> (raw)
In-Reply-To: <CADUfDZq3rTwcnxAekdan3-fi3qdLwFp-T5LsJJ1viqoUv8RqOw@mail.gmail.com>

On Tue, Apr 07, 2026 at 12:47:29PM -0700, Caleb Sander Mateos wrote:
> On Tue, Mar 31, 2026 at 8:32 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add ublk_try_buf_match() which walks a request's bio_vecs, looks up
> > each page's PFN in the per-device maple tree, and verifies all pages
> > belong to the same registered buffer at contiguous offsets.
> >
> > Add ublk_iod_is_shmem_zc() inline helper for checking whether a
> > request uses the shmem zero-copy path.
> >
> > Integrate into the I/O path:
> > - ublk_setup_iod(): if pages match a registered buffer, set
> >   UBLK_IO_F_SHMEM_ZC and encode buffer index + offset in addr
> > - ublk_start_io(): skip ublk_map_io() for zero-copy requests
> > - __ublk_complete_rq(): skip ublk_unmap_io() for zero-copy requests
> >
> > The feature remains disabled (ublk_support_shmem_zc() returns false)
> > until the UBLK_F_SHMEM_ZC flag is enabled in the next patch.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 77 +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 76 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index ac6ccc174d44..d53865437600 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -356,6 +356,8 @@ struct ublk_params_header {
> >
> >  static void ublk_io_release(void *priv);
> >  static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> > +static bool ublk_try_buf_match(struct ublk_device *ub, struct request *rq,
> > +                                 u32 *buf_idx, u32 *buf_off);
> 
> buf_idx could be a u16 * for consistency?

Yeah.

> 
> >  static void ublk_buf_cleanup(struct ublk_device *ub);
> >  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > @@ -426,6 +428,12 @@ static inline bool ublk_support_shmem_zc(const struct ublk_queue *ubq)
> >         return false;
> >  }
> >
> > +static inline bool ublk_iod_is_shmem_zc(const struct ublk_queue *ubq,
> > +                                       unsigned int tag)
> > +{
> > +       return ublk_get_iod(ubq, tag)->op_flags & UBLK_IO_F_SHMEM_ZC;
> > +}
> > +
> >  static inline bool ublk_dev_support_shmem_zc(const struct ublk_device *ub)
> >  {
> >         return false;
> > @@ -1494,6 +1502,18 @@ static blk_status_t ublk_setup_iod(struct ublk_queue *ubq, struct request *req)
> >         iod->nr_sectors = blk_rq_sectors(req);
> >         iod->start_sector = blk_rq_pos(req);
> >
> > +       /* Try shmem zero-copy match before setting addr */
> > +       if (ublk_support_shmem_zc(ubq) && ublk_rq_has_data(req)) {
> > +               u32 buf_idx, buf_off;
> > +
> > +               if (ublk_try_buf_match(ubq->dev, req,
> > +                                         &buf_idx, &buf_off)) {
> > +                       iod->op_flags |= UBLK_IO_F_SHMEM_ZC;
> > +                       iod->addr = ublk_shmem_zc_addr(buf_idx, buf_off);
> > +                       return BLK_STS_OK;
> > +               }
> > +       }
> > +
> >         iod->addr = io->buf.addr;
> >
> >         return BLK_STS_OK;
> > @@ -1539,6 +1559,10 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
> >             req_op(req) != REQ_OP_DRV_IN)
> >                 goto exit;
> >
> > +       /* shmem zero copy: no data to unmap, pages already shared */
> > +       if (ublk_iod_is_shmem_zc(req->mq_hctx->driver_data, req->tag))
> 
> This is a lot of pointer chasing. Could we track this with a flag on
> struct ublk_io instead?

Yeah, it can be done by adding and setting the new flag in ublk_start_io(), or
we can just pass `ubq` to __ublk_complete_rq() from its call sites.

I prefer to the latter.

> 
> > +               goto exit;
> > +
> >         /* for READ request, writing data in iod->addr to rq buffers */
> >         unmapped_bytes = ublk_unmap_io(need_map, req, io);
> >
> > @@ -1697,8 +1721,13 @@ static void ublk_auto_buf_dispatch(const struct ublk_queue *ubq,
> >  static bool ublk_start_io(const struct ublk_queue *ubq, struct request *req,
> >                           struct ublk_io *io)
> >  {
> > -       unsigned mapped_bytes = ublk_map_io(ubq, req, io);
> > +       unsigned mapped_bytes;
> >
> > +       /* shmem zero copy: skip data copy, pages already shared */
> > +       if (ublk_iod_is_shmem_zc(ubq, req->tag))
> > +               return true;
> > +
> > +       mapped_bytes = ublk_map_io(ubq, req, io);
> >
> >         /* partially mapped, update io descriptor */
> >         if (unlikely(mapped_bytes != blk_rq_bytes(req))) {
> > @@ -5458,7 +5487,53 @@ static void ublk_buf_cleanup(struct ublk_device *ub)
> >         mtree_destroy(&ub->buf_tree);
> >  }
> >
> > +/* Check if request pages match a registered shared memory buffer */
> > +static bool ublk_try_buf_match(struct ublk_device *ub,
> > +                                  struct request *rq,
> > +                                  u32 *buf_idx, u32 *buf_off)
> > +{
> > +       struct req_iterator iter;
> > +       struct bio_vec bv;
> > +       int index = -1;
> > +       unsigned long expected_offset = 0;
> > +       bool first = true;
> 
> Could check index < 0 in place of first?
> 
> > +
> > +       rq_for_each_bvec(bv, rq, iter) {
> > +               unsigned long pfn = page_to_pfn(bv.bv_page);
> > +               struct ublk_buf_range *range;
> > +               unsigned long off;
> >
> > +               range = mtree_load(&ub->buf_tree, pfn);
> > +               if (!range)
> > +                       return false;
> > +
> > +               off = range->base_offset +
> > +                       (pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;
> 
> Doesn't this need to check that the end of the bvec is less than the
> end of the range? Otherwise, the bvec could extend into physically
> contiguous pages that aren't part of the registered range.

It is actually fixed in my local tree:

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2e475bdc54dd..69db3b3b9071 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -5524,13 +5524,20 @@ static bool ublk_try_buf_match(struct ublk_device *ub,

        rq_for_each_bvec(bv, rq, iter) {
                unsigned long pfn = page_to_pfn(bv.bv_page);
+               unsigned long end_pfn = pfn +
+                       ((bv.bv_offset + bv.bv_len - 1) >> PAGE_SHIFT);
                struct ublk_buf_range *range;
                unsigned long off;
+               MA_STATE(mas, &ub->buf_tree, pfn, pfn);

-               range = mtree_load(&ub->buf_tree, pfn);
+               range = mas_walk(&mas);
                if (!range)
                        return false;

+               /* verify all pages in this bvec fall within the range */
+               if (end_pfn > mas.last)
+                       return false;
+
                off = range->base_offset +
                        (pfn - range->base_pfn) * PAGE_SIZE + bv.bv_offset;

> 
> Also, the range could precompute base_pfn - base_offset / PAGE_SIZE
> instead of base_offset to make this a bit cheaper.
> 
> > +
> > +               if (first) {
> > +                       /* Read-only buffer can't serve READ (kernel writes) */
> > +                       if ((range->flags & UBLK_SHMEM_BUF_READ_ONLY) &&
> > +                           req_op(rq) != REQ_OP_WRITE)
> > +                               return false;
> > +                       index = range->buf_index;
> > +                       expected_offset = off;
> > +                       *buf_off = off;
> > +                       first = false;
> > +               } else {
> > +                       if (range->buf_index != index)
> > +                               return false;
> > +                       if (off != expected_offset)
> > +                               return false;
> > +               }
> > +               expected_offset += bv.bv_len;
> > +       }
> > +
> > +       if (first)
> > +               return false;
> 
> How is this case possible? That would mean the request has no bvecs,
> but ublk_try_buf_match() is only called for requests with data, right?

Your are right, will clean it in next version.
 
Thanks,
Ming


  reply	other threads:[~2026-04-08  2:36 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31 15:31 [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-03-31 15:31 ` [PATCH v2 01/10] ublk: add UBLK_U_CMD_REG_BUF/UNREG_BUF control commands Ming Lei
2026-04-07 19:35   ` Caleb Sander Mateos
2026-04-08  2:23     ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 02/10] ublk: add PFN-based buffer matching in I/O path Ming Lei
2026-04-07 19:47   ` Caleb Sander Mateos
2026-04-08  2:36     ` Ming Lei [this message]
2026-03-31 15:31 ` [PATCH v2 03/10] ublk: enable UBLK_F_SHMEM_ZC feature flag Ming Lei
2026-04-07 19:47   ` Caleb Sander Mateos
2026-04-08  2:50     ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 04/10] ublk: eliminate permanent pages[] array from struct ublk_buf Ming Lei
2026-04-07 19:50   ` Caleb Sander Mateos
2026-04-08  2:58     ` Ming Lei
2026-03-31 15:31 ` [PATCH v2 05/10] selftests/ublk: add shared memory zero-copy support in kublk Ming Lei
2026-03-31 15:31 ` [PATCH v2 06/10] selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target Ming Lei
2026-03-31 15:31 ` [PATCH v2 07/10] selftests/ublk: add shared memory zero-copy test Ming Lei
2026-03-31 15:31 ` [PATCH v2 08/10] selftests/ublk: add hugetlbfs shmem_zc test for loop target Ming Lei
2026-03-31 15:32 ` [PATCH v2 09/10] selftests/ublk: add filesystem fio verify test for shmem_zc Ming Lei
2026-03-31 15:32 ` [PATCH v2 10/10] selftests/ublk: add read-only buffer registration test Ming Lei
2026-04-07  2:38 ` [PATCH v2 00/10] ublk: add shared memory zero-copy support Ming Lei
2026-04-07 13:34   ` Jens Axboe
2026-04-07 19:29   ` Caleb Sander Mateos
2026-04-08  3:03     ` Ming Lei
2026-04-07 13:44 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adW_EQW1AAJdaWqN@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox