public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
Date: Wed, 5 Nov 2025 09:48:09 +0800	[thread overview]
Message-ID: <aQqs2TlXU0UYlsuy@fedora> (raw)
In-Reply-To: <CADUfDZqKV2SzbWoe4gr4aSPaBtr+VwmEgEidZKo=LQBU9Quf2Q@mail.gmail.com>

On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
> On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > > user page reference count increments and decrements and needing to split
> > > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > > considerably.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > ---
> > > > >  drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > >  1 file changed, 14 insertions(+), 48 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > index 0c74a41a6753..852350e639d6 100644
> > > > > --- a/drivers/block/ublk_drv.c
> > > > > +++ b/drivers/block/ublk_drv.c
> > > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > >       .open =         ublk_open,
> > > > >       .free_disk =    ublk_free_disk,
> > > > >       .report_zones = ublk_report_zones,
> > > > >  };
> > > > >
> > > > > -#define UBLK_MAX_PIN_PAGES   32
> > > > > -
> > > > >  struct ublk_io_iter {
> > > > > -     struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > >       struct bio *bio;
> > > > >       struct bvec_iter iter;
> > > > >  };
> > > >
> > > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > > perf drop.
> > >
> > > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > > user address range:
> > >
> > > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > > {
> > >         if (WARN_ON_ONCE(i->data_source))
> > >                 return 0;
> > >         if (user_backed_iter(i))
> > >                 might_fault();
> > >         return iterate_and_advance(i, bytes, (void *)addr,
> > >                                    copy_to_user_iter, memcpy_to_iter);
> > > }
> > >
> > > Which just checks that the address range doesn't include any kernel
> > > addresses and then memcpy()s directly via the userspace virtual
> > > addresses:
> > >
> > > static __always_inline
> > > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > >                          size_t len, void *from, void *priv2)
> > > {
> > >         if (should_fail_usercopy())
> > >                 return len;
> > >         if (access_ok(iter_to, len)) {
> > >                 from += progress;
> > >                 instrument_copy_to_user(iter_to, from, len);
> > >                 len = raw_copy_to_user(iter_to, from, len);
> > >         }
> > >         return len;
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > > {
> > >         return copy_user_generic((__force void *)dst, src, size);
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > copy_user_generic(void *to, const void *from, unsigned long len)
> > > {
> > >         stac();
> > >         /*
> > >          * If CPU has FSRM feature, use 'rep movs'.
> > >          * Otherwise, use rep_movs_alternative.
> > >          */
> > >         asm volatile(
> > >                 "1:\n\t"
> > >                 ALTERNATIVE("rep movsb",
> > >                             "call rep_movs_alternative",
> > > ALT_NOT(X86_FEATURE_FSRM))
> > >                 "2:\n"
> > >                 _ASM_EXTABLE_UA(1b, 2b)
> > >                 :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > >                 : : "memory", "rax");
> > >         clac();
> > >         return len;
> > > }
> > >
> > > Am I missing something?
> >
> > page is allocated & mapped in page fault handler.
> 
> Right, physical pages certainly need to be allocated for the virtual
> address range being copied to/from. But that would have happened
> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> you point out, in the common case that the virtual pages are already
> mapped to physical pages, the copy won't cause any page faults.
> 
> >
> > However, in typical cases, pages in io buffer shouldn't be swapped out
> > frequently, so this cleanup may be good, I will run some perf test.
> 
> Thanks for testing.

`fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
patch in my test VM, so looks very nice improvement.

Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
and this change is correct because the copy is guaranteed to be done in ublk
daemon context.

> 
> >
> > Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> > bvec_kmap_local(), and the two helper can handle one whole bvec instead
> > of single page.
> 
> Yes, that's a good idea. Thanks, I didn't know about that.
> 
> >
> > Then rq_for_each_bvec() can be used directly, and `ublk_io_iter` may be
> > killed.
> 
> Hmm, we still need a way to offset into the request (i.e. what
> ublk_advance_io_iter() does currently). Are you thinking of a single
> rq_for_each_bvec() loop that would skip bvecs until the offset is
> reached and then copy until reaching the end of the user iterator?

Yeah, that is basically what ublk_advance_io_iter() does.


Thanks,
Ming


  reply	other threads:[~2025-11-05  1:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-31  1:05 [PATCH] ublk: use copy_{to,from}_iter() for user copy Caleb Sander Mateos
2025-10-31  3:45 ` Ming Lei
2025-10-31 16:02   ` Caleb Sander Mateos
2025-10-31 23:03     ` Ming Lei
2025-11-03 16:40       ` Caleb Sander Mateos
2025-11-05  1:48         ` Ming Lei [this message]
2025-11-05 15:26           ` Jens Axboe
2025-11-05 15:37             ` Caleb Sander Mateos
2025-11-05 15:41               ` Jens Axboe
2025-11-05 16:16         ` Caleb Sander Mateos
2025-11-05 17:10           ` Caleb Sander Mateos
2025-11-05 17:16             ` Caleb Sander Mateos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQqs2TlXU0UYlsuy@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox