* [PATCH] ublk: use copy_{to,from}_iter() for user copy
@ 2025-10-31 1:05 Caleb Sander Mateos
2025-10-31 3:45 ` Ming Lei
0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-10-31 1:05 UTC (permalink / raw)
To: Ming Lei, Jens Axboe; +Cc: Caleb Sander Mateos, linux-block, linux-kernel
ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
iov_iter_get_pages2() to extract the pages from the iov_iter and
memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
user page reference count increments and decrements and needing to split
the memcpy() at user page boundaries. It also simplifies the code
considerably.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
---
drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
1 file changed, 14 insertions(+), 48 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0c74a41a6753..852350e639d6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
.open = ublk_open,
.free_disk = ublk_free_disk,
.report_zones = ublk_report_zones,
};
-#define UBLK_MAX_PIN_PAGES 32
-
struct ublk_io_iter {
- struct page *pages[UBLK_MAX_PIN_PAGES];
struct bio *bio;
struct bvec_iter iter;
};
-/* return how many pages are copied */
-static void ublk_copy_io_pages(struct ublk_io_iter *data,
- size_t total, size_t pg_off, int dir)
+/* return how many bytes are copied */
+static size_t ublk_copy_io_pages(struct ublk_io_iter *data,
+ struct iov_iter *uiter, int dir)
{
- unsigned done = 0;
- unsigned pg_idx = 0;
+ size_t done = 0;
- while (done < total) {
+ for (;;) {
struct bio_vec bv = bio_iter_iovec(data->bio, data->iter);
- unsigned int bytes = min3(bv.bv_len, (unsigned)total - done,
- (unsigned)(PAGE_SIZE - pg_off));
void *bv_buf = bvec_kmap_local(&bv);
- void *pg_buf = kmap_local_page(data->pages[pg_idx]);
+ size_t copied;
if (dir == ITER_DEST)
- memcpy(pg_buf + pg_off, bv_buf, bytes);
+ copied = copy_to_iter(bv_buf, bv.bv_len, uiter);
else
- memcpy(bv_buf, pg_buf + pg_off, bytes);
+ copied = copy_from_iter(bv_buf, bv.bv_len, uiter);
- kunmap_local(pg_buf);
kunmap_local(bv_buf);
- /* advance page array */
- pg_off += bytes;
- if (pg_off == PAGE_SIZE) {
- pg_idx += 1;
- pg_off = 0;
- }
-
- done += bytes;
+ done += copied;
+ if (copied < bv.bv_len)
+ break;
/* advance bio */
- bio_advance_iter_single(data->bio, &data->iter, bytes);
+ bio_advance_iter_single(data->bio, &data->iter, copied);
if (!data->iter.bi_size) {
data->bio = data->bio->bi_next;
if (data->bio == NULL)
break;
data->iter = data->bio->bi_iter;
}
}
+ return done;
}
static bool ublk_advance_io_iter(const struct request *req,
struct ublk_io_iter *iter, unsigned int offset)
{
@@ -987,38 +976,15 @@ static bool ublk_advance_io_iter(const struct request *req,
*/
static size_t ublk_copy_user_pages(const struct request *req,
unsigned offset, struct iov_iter *uiter, int dir)
{
struct ublk_io_iter iter;
- size_t done = 0;
if (!ublk_advance_io_iter(req, &iter, offset))
return 0;
- while (iov_iter_count(uiter) && iter.bio) {
- unsigned nr_pages;
- ssize_t len;
- size_t off;
- int i;
-
- len = iov_iter_get_pages2(uiter, iter.pages,
- iov_iter_count(uiter),
- UBLK_MAX_PIN_PAGES, &off);
- if (len <= 0)
- return done;
-
- ublk_copy_io_pages(&iter, len, off, dir);
- nr_pages = DIV_ROUND_UP(len + off, PAGE_SIZE);
- for (i = 0; i < nr_pages; i++) {
- if (dir == ITER_DEST)
- set_page_dirty(iter.pages[i]);
- put_page(iter.pages[i]);
- }
- done += len;
- }
-
- return done;
+ return ublk_copy_io_pages(&iter, uiter, dir);
}
static inline bool ublk_need_map_req(const struct request *req)
{
return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
--
2.45.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-10-31 1:05 [PATCH] ublk: use copy_{to,from}_iter() for user copy Caleb Sander Mateos
@ 2025-10-31 3:45 ` Ming Lei
2025-10-31 16:02 ` Caleb Sander Mateos
0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-10-31 3:45 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, linux-kernel
On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> iov_iter_get_pages2() to extract the pages from the iov_iter and
> memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> user page reference count increments and decrements and needing to split
> the memcpy() at user page boundaries. It also simplifies the code
> considerably.
>
> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> ---
> drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> 1 file changed, 14 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 0c74a41a6753..852350e639d6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> .open = ublk_open,
> .free_disk = ublk_free_disk,
> .report_zones = ublk_report_zones,
> };
>
> -#define UBLK_MAX_PIN_PAGES 32
> -
> struct ublk_io_iter {
> - struct page *pages[UBLK_MAX_PIN_PAGES];
> struct bio *bio;
> struct bvec_iter iter;
> };
->pages[] is actually for pinning user io pages in batch, so killing it may cause
perf drop.
This similar trick is used in direct io code path too.
Thanks
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-10-31 3:45 ` Ming Lei
@ 2025-10-31 16:02 ` Caleb Sander Mateos
2025-10-31 23:03 ` Ming Lei
0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-10-31 16:02 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel
On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > user page reference count increments and decrements and needing to split
> > the memcpy() at user page boundaries. It also simplifies the code
> > considerably.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > ---
> > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > 1 file changed, 14 insertions(+), 48 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 0c74a41a6753..852350e639d6 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > .open = ublk_open,
> > .free_disk = ublk_free_disk,
> > .report_zones = ublk_report_zones,
> > };
> >
> > -#define UBLK_MAX_PIN_PAGES 32
> > -
> > struct ublk_io_iter {
> > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > struct bio *bio;
> > struct bvec_iter iter;
> > };
>
> ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> perf drop.
As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
pinning entirely. It calls copy_to_user_iter() for each contiguous
user address range:
size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
if (WARN_ON_ONCE(i->data_source))
return 0;
if (user_backed_iter(i))
might_fault();
return iterate_and_advance(i, bytes, (void *)addr,
copy_to_user_iter, memcpy_to_iter);
}
Which just checks that the address range doesn't include any kernel
addresses and then memcpy()s directly via the userspace virtual
addresses:
static __always_inline
size_t copy_to_user_iter(void __user *iter_to, size_t progress,
size_t len, void *from, void *priv2)
{
if (should_fail_usercopy())
return len;
if (access_ok(iter_to, len)) {
from += progress;
instrument_copy_to_user(iter_to, from, len);
len = raw_copy_to_user(iter_to, from, len);
}
return len;
}
static __always_inline __must_check unsigned long
raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
{
return copy_user_generic((__force void *)dst, src, size);
}
static __always_inline __must_check unsigned long
copy_user_generic(void *to, const void *from, unsigned long len)
{
stac();
/*
* If CPU has FSRM feature, use 'rep movs'.
* Otherwise, use rep_movs_alternative.
*/
asm volatile(
"1:\n\t"
ALTERNATIVE("rep movsb",
"call rep_movs_alternative",
ALT_NOT(X86_FEATURE_FSRM))
"2:\n"
_ASM_EXTABLE_UA(1b, 2b)
:"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
: : "memory", "rax");
clac();
return len;
}
Am I missing something?
Best,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-10-31 16:02 ` Caleb Sander Mateos
@ 2025-10-31 23:03 ` Ming Lei
2025-11-03 16:40 ` Caleb Sander Mateos
0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-10-31 23:03 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, linux-kernel
On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > user page reference count increments and decrements and needing to split
> > > the memcpy() at user page boundaries. It also simplifies the code
> > > considerably.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > ---
> > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 0c74a41a6753..852350e639d6 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > .open = ublk_open,
> > > .free_disk = ublk_free_disk,
> > > .report_zones = ublk_report_zones,
> > > };
> > >
> > > -#define UBLK_MAX_PIN_PAGES 32
> > > -
> > > struct ublk_io_iter {
> > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > struct bio *bio;
> > > struct bvec_iter iter;
> > > };
> >
> > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > perf drop.
>
> As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> pinning entirely. It calls copy_to_user_iter() for each contiguous
> user address range:
>
> size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> {
> if (WARN_ON_ONCE(i->data_source))
> return 0;
> if (user_backed_iter(i))
> might_fault();
> return iterate_and_advance(i, bytes, (void *)addr,
> copy_to_user_iter, memcpy_to_iter);
> }
>
> Which just checks that the address range doesn't include any kernel
> addresses and then memcpy()s directly via the userspace virtual
> addresses:
>
> static __always_inline
> size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> size_t len, void *from, void *priv2)
> {
> if (should_fail_usercopy())
> return len;
> if (access_ok(iter_to, len)) {
> from += progress;
> instrument_copy_to_user(iter_to, from, len);
> len = raw_copy_to_user(iter_to, from, len);
> }
> return len;
> }
>
> static __always_inline __must_check unsigned long
> raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> {
> return copy_user_generic((__force void *)dst, src, size);
> }
>
> static __always_inline __must_check unsigned long
> copy_user_generic(void *to, const void *from, unsigned long len)
> {
> stac();
> /*
> * If CPU has FSRM feature, use 'rep movs'.
> * Otherwise, use rep_movs_alternative.
> */
> asm volatile(
> "1:\n\t"
> ALTERNATIVE("rep movsb",
> "call rep_movs_alternative",
> ALT_NOT(X86_FEATURE_FSRM))
> "2:\n"
> _ASM_EXTABLE_UA(1b, 2b)
> :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> : : "memory", "rax");
> clac();
> return len;
> }
>
> Am I missing something?
page is allocated & mapped in page fault handler.
However, in typical cases, pages in io buffer shouldn't be swapped out
frequently, so this cleanup may be good, I will run some perf test.
Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
bvec_kmap_local(), and the two helper can handle one whole bvec instead
of single page.
Then rq_for_each_bvec() can be used directly, and `ublk_io_iter` may be
killed.
Thanks,
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-10-31 23:03 ` Ming Lei
@ 2025-11-03 16:40 ` Caleb Sander Mateos
2025-11-05 1:48 ` Ming Lei
2025-11-05 16:16 ` Caleb Sander Mateos
0 siblings, 2 replies; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-11-03 16:40 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel
On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > user page reference count increments and decrements and needing to split
> > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > considerably.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > ---
> > > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 0c74a41a6753..852350e639d6 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > .open = ublk_open,
> > > > .free_disk = ublk_free_disk,
> > > > .report_zones = ublk_report_zones,
> > > > };
> > > >
> > > > -#define UBLK_MAX_PIN_PAGES 32
> > > > -
> > > > struct ublk_io_iter {
> > > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > struct bio *bio;
> > > > struct bvec_iter iter;
> > > > };
> > >
> > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > perf drop.
> >
> > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > user address range:
> >
> > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > {
> > if (WARN_ON_ONCE(i->data_source))
> > return 0;
> > if (user_backed_iter(i))
> > might_fault();
> > return iterate_and_advance(i, bytes, (void *)addr,
> > copy_to_user_iter, memcpy_to_iter);
> > }
> >
> > Which just checks that the address range doesn't include any kernel
> > addresses and then memcpy()s directly via the userspace virtual
> > addresses:
> >
> > static __always_inline
> > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > size_t len, void *from, void *priv2)
> > {
> > if (should_fail_usercopy())
> > return len;
> > if (access_ok(iter_to, len)) {
> > from += progress;
> > instrument_copy_to_user(iter_to, from, len);
> > len = raw_copy_to_user(iter_to, from, len);
> > }
> > return len;
> > }
> >
> > static __always_inline __must_check unsigned long
> > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > {
> > return copy_user_generic((__force void *)dst, src, size);
> > }
> >
> > static __always_inline __must_check unsigned long
> > copy_user_generic(void *to, const void *from, unsigned long len)
> > {
> > stac();
> > /*
> > * If CPU has FSRM feature, use 'rep movs'.
> > * Otherwise, use rep_movs_alternative.
> > */
> > asm volatile(
> > "1:\n\t"
> > ALTERNATIVE("rep movsb",
> > "call rep_movs_alternative",
> > ALT_NOT(X86_FEATURE_FSRM))
> > "2:\n"
> > _ASM_EXTABLE_UA(1b, 2b)
> > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > : : "memory", "rax");
> > clac();
> > return len;
> > }
> >
> > Am I missing something?
>
> page is allocated & mapped in page fault handler.
Right, physical pages certainly need to be allocated for the virtual
address range being copied to/from. But that would have happened
previously in iov_iter_get_pages2(), so this isn't a new cost. And as
you point out, in the common case that the virtual pages are already
mapped to physical pages, the copy won't cause any page faults.
>
> However, in typical cases, pages in io buffer shouldn't be swapped out
> frequently, so this cleanup may be good, I will run some perf test.
Thanks for testing.
>
> Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> bvec_kmap_local(), and the two helper can handle one whole bvec instead
> of single page.
Yes, that's a good idea. Thanks, I didn't know about that.
>
> Then rq_for_each_bvec() can be used directly, and `ublk_io_iter` may be
> killed.
Hmm, we still need a way to offset into the request (i.e. what
ublk_advance_io_iter() does currently). Are you thinking of a single
rq_for_each_bvec() loop that would skip bvecs until the offset is
reached and then copy until reaching the end of the user iterator?
Best,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-03 16:40 ` Caleb Sander Mateos
@ 2025-11-05 1:48 ` Ming Lei
2025-11-05 15:26 ` Jens Axboe
2025-11-05 16:16 ` Caleb Sander Mateos
1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2025-11-05 1:48 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, linux-kernel
On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
> On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > > user page reference count increments and decrements and needing to split
> > > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > > considerably.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > ---
> > > > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > index 0c74a41a6753..852350e639d6 100644
> > > > > --- a/drivers/block/ublk_drv.c
> > > > > +++ b/drivers/block/ublk_drv.c
> > > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > > .open = ublk_open,
> > > > > .free_disk = ublk_free_disk,
> > > > > .report_zones = ublk_report_zones,
> > > > > };
> > > > >
> > > > > -#define UBLK_MAX_PIN_PAGES 32
> > > > > -
> > > > > struct ublk_io_iter {
> > > > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > > struct bio *bio;
> > > > > struct bvec_iter iter;
> > > > > };
> > > >
> > > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > > perf drop.
> > >
> > > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > > user address range:
> > >
> > > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > > {
> > > if (WARN_ON_ONCE(i->data_source))
> > > return 0;
> > > if (user_backed_iter(i))
> > > might_fault();
> > > return iterate_and_advance(i, bytes, (void *)addr,
> > > copy_to_user_iter, memcpy_to_iter);
> > > }
> > >
> > > Which just checks that the address range doesn't include any kernel
> > > addresses and then memcpy()s directly via the userspace virtual
> > > addresses:
> > >
> > > static __always_inline
> > > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > > size_t len, void *from, void *priv2)
> > > {
> > > if (should_fail_usercopy())
> > > return len;
> > > if (access_ok(iter_to, len)) {
> > > from += progress;
> > > instrument_copy_to_user(iter_to, from, len);
> > > len = raw_copy_to_user(iter_to, from, len);
> > > }
> > > return len;
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > > {
> > > return copy_user_generic((__force void *)dst, src, size);
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > copy_user_generic(void *to, const void *from, unsigned long len)
> > > {
> > > stac();
> > > /*
> > > * If CPU has FSRM feature, use 'rep movs'.
> > > * Otherwise, use rep_movs_alternative.
> > > */
> > > asm volatile(
> > > "1:\n\t"
> > > ALTERNATIVE("rep movsb",
> > > "call rep_movs_alternative",
> > > ALT_NOT(X86_FEATURE_FSRM))
> > > "2:\n"
> > > _ASM_EXTABLE_UA(1b, 2b)
> > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > > : : "memory", "rax");
> > > clac();
> > > return len;
> > > }
> > >
> > > Am I missing something?
> >
> > page is allocated & mapped in page fault handler.
>
> Right, physical pages certainly need to be allocated for the virtual
> address range being copied to/from. But that would have happened
> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> you point out, in the common case that the virtual pages are already
> mapped to physical pages, the copy won't cause any page faults.
>
> >
> > However, in typical cases, pages in io buffer shouldn't be swapped out
> > frequently, so this cleanup may be good, I will run some perf test.
>
> Thanks for testing.
`fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
patch in my test VM, so looks very nice improvement.
Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
and this change is correct because the copy is guaranteed to be done in ublk
daemon context.
>
> >
> > Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> > bvec_kmap_local(), and the two helper can handle one whole bvec instead
> > of single page.
>
> Yes, that's a good idea. Thanks, I didn't know about that.
>
> >
> > Then rq_for_each_bvec() can be used directly, and `ublk_io_iter` may be
> > killed.
>
> Hmm, we still need a way to offset into the request (i.e. what
> ublk_advance_io_iter() does currently). Are you thinking of a single
> rq_for_each_bvec() loop that would skip bvecs until the offset is
> reached and then copy until reaching the end of the user iterator?
Yeah, that is basically what ublk_advance_io_iter() does.
Thanks,
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-05 1:48 ` Ming Lei
@ 2025-11-05 15:26 ` Jens Axboe
2025-11-05 15:37 ` Caleb Sander Mateos
0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2025-11-05 15:26 UTC (permalink / raw)
To: Ming Lei, Caleb Sander Mateos; +Cc: linux-block, linux-kernel
On 11/4/25 6:48 PM, Ming Lei wrote:
> On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
>> On Fri, Oct 31, 2025 at 4:04?PM Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>> On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
>>>> On Thu, Oct 30, 2025 at 8:45?PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>
>>>>> On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
>>>>>> ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
>>>>>> iov_iter_get_pages2() to extract the pages from the iov_iter and
>>>>>> memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
>>>>>> Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
>>>>>> user page reference count increments and decrements and needing to split
>>>>>> the memcpy() at user page boundaries. It also simplifies the code
>>>>>> considerably.
>>>>>>
>>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>>>>> ---
>>>>>> drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
>>>>>> 1 file changed, 14 insertions(+), 48 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>>> index 0c74a41a6753..852350e639d6 100644
>>>>>> --- a/drivers/block/ublk_drv.c
>>>>>> +++ b/drivers/block/ublk_drv.c
>>>>>> @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
>>>>>> .open = ublk_open,
>>>>>> .free_disk = ublk_free_disk,
>>>>>> .report_zones = ublk_report_zones,
>>>>>> };
>>>>>>
>>>>>> -#define UBLK_MAX_PIN_PAGES 32
>>>>>> -
>>>>>> struct ublk_io_iter {
>>>>>> - struct page *pages[UBLK_MAX_PIN_PAGES];
>>>>>> struct bio *bio;
>>>>>> struct bvec_iter iter;
>>>>>> };
>>>>>
>>>>> ->pages[] is actually for pinning user io pages in batch, so killing it may cause
>>>>> perf drop.
>>>>
>>>> As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
>>>> pinning entirely. It calls copy_to_user_iter() for each contiguous
>>>> user address range:
>>>>
>>>> size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>>>> {
>>>> if (WARN_ON_ONCE(i->data_source))
>>>> return 0;
>>>> if (user_backed_iter(i))
>>>> might_fault();
>>>> return iterate_and_advance(i, bytes, (void *)addr,
>>>> copy_to_user_iter, memcpy_to_iter);
>>>> }
>>>>
>>>> Which just checks that the address range doesn't include any kernel
>>>> addresses and then memcpy()s directly via the userspace virtual
>>>> addresses:
>>>>
>>>> static __always_inline
>>>> size_t copy_to_user_iter(void __user *iter_to, size_t progress,
>>>> size_t len, void *from, void *priv2)
>>>> {
>>>> if (should_fail_usercopy())
>>>> return len;
>>>> if (access_ok(iter_to, len)) {
>>>> from += progress;
>>>> instrument_copy_to_user(iter_to, from, len);
>>>> len = raw_copy_to_user(iter_to, from, len);
>>>> }
>>>> return len;
>>>> }
>>>>
>>>> static __always_inline __must_check unsigned long
>>>> raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
>>>> {
>>>> return copy_user_generic((__force void *)dst, src, size);
>>>> }
>>>>
>>>> static __always_inline __must_check unsigned long
>>>> copy_user_generic(void *to, const void *from, unsigned long len)
>>>> {
>>>> stac();
>>>> /*
>>>> * If CPU has FSRM feature, use 'rep movs'.
>>>> * Otherwise, use rep_movs_alternative.
>>>> */
>>>> asm volatile(
>>>> "1:\n\t"
>>>> ALTERNATIVE("rep movsb",
>>>> "call rep_movs_alternative",
>>>> ALT_NOT(X86_FEATURE_FSRM))
>>>> "2:\n"
>>>> _ASM_EXTABLE_UA(1b, 2b)
>>>> :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
>>>> : : "memory", "rax");
>>>> clac();
>>>> return len;
>>>> }
>>>>
>>>> Am I missing something?
>>>
>>> page is allocated & mapped in page fault handler.
>>
>> Right, physical pages certainly need to be allocated for the virtual
>> address range being copied to/from. But that would have happened
>> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
>> you point out, in the common case that the virtual pages are already
>> mapped to physical pages, the copy won't cause any page faults.
>>
>>>
>>> However, in typical cases, pages in io buffer shouldn't be swapped out
>>> frequently, so this cleanup may be good, I will run some perf test.
>>
>> Thanks for testing.
>
> `fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
> patch in my test VM, so looks very nice improvement.
>
> Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
> and this change is correct because the copy is guaranteed to be done in ublk
> daemon context.
We good to queue this up then?
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-05 15:26 ` Jens Axboe
@ 2025-11-05 15:37 ` Caleb Sander Mateos
2025-11-05 15:41 ` Jens Axboe
0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-11-05 15:37 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ming Lei, linux-block, linux-kernel
On Wed, Nov 5, 2025 at 7:26 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/4/25 6:48 PM, Ming Lei wrote:
> > On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
> >> On Fri, Oct 31, 2025 at 4:04?PM Ming Lei <ming.lei@redhat.com> wrote:
> >>>
> >>> On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> >>>> On Thu, Oct 30, 2025 at 8:45?PM Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>
> >>>>> On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> >>>>>> ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> >>>>>> iov_iter_get_pages2() to extract the pages from the iov_iter and
> >>>>>> memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> >>>>>> Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> >>>>>> user page reference count increments and decrements and needing to split
> >>>>>> the memcpy() at user page boundaries. It also simplifies the code
> >>>>>> considerably.
> >>>>>>
> >>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> >>>>>> ---
> >>>>>> drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> >>>>>> 1 file changed, 14 insertions(+), 48 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>>>>> index 0c74a41a6753..852350e639d6 100644
> >>>>>> --- a/drivers/block/ublk_drv.c
> >>>>>> +++ b/drivers/block/ublk_drv.c
> >>>>>> @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> >>>>>> .open = ublk_open,
> >>>>>> .free_disk = ublk_free_disk,
> >>>>>> .report_zones = ublk_report_zones,
> >>>>>> };
> >>>>>>
> >>>>>> -#define UBLK_MAX_PIN_PAGES 32
> >>>>>> -
> >>>>>> struct ublk_io_iter {
> >>>>>> - struct page *pages[UBLK_MAX_PIN_PAGES];
> >>>>>> struct bio *bio;
> >>>>>> struct bvec_iter iter;
> >>>>>> };
> >>>>>
> >>>>> ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> >>>>> perf drop.
> >>>>
> >>>> As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> >>>> pinning entirely. It calls copy_to_user_iter() for each contiguous
> >>>> user address range:
> >>>>
> >>>> size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> >>>> {
> >>>> if (WARN_ON_ONCE(i->data_source))
> >>>> return 0;
> >>>> if (user_backed_iter(i))
> >>>> might_fault();
> >>>> return iterate_and_advance(i, bytes, (void *)addr,
> >>>> copy_to_user_iter, memcpy_to_iter);
> >>>> }
> >>>>
> >>>> Which just checks that the address range doesn't include any kernel
> >>>> addresses and then memcpy()s directly via the userspace virtual
> >>>> addresses:
> >>>>
> >>>> static __always_inline
> >>>> size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> >>>> size_t len, void *from, void *priv2)
> >>>> {
> >>>> if (should_fail_usercopy())
> >>>> return len;
> >>>> if (access_ok(iter_to, len)) {
> >>>> from += progress;
> >>>> instrument_copy_to_user(iter_to, from, len);
> >>>> len = raw_copy_to_user(iter_to, from, len);
> >>>> }
> >>>> return len;
> >>>> }
> >>>>
> >>>> static __always_inline __must_check unsigned long
> >>>> raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> >>>> {
> >>>> return copy_user_generic((__force void *)dst, src, size);
> >>>> }
> >>>>
> >>>> static __always_inline __must_check unsigned long
> >>>> copy_user_generic(void *to, const void *from, unsigned long len)
> >>>> {
> >>>> stac();
> >>>> /*
> >>>> * If CPU has FSRM feature, use 'rep movs'.
> >>>> * Otherwise, use rep_movs_alternative.
> >>>> */
> >>>> asm volatile(
> >>>> "1:\n\t"
> >>>> ALTERNATIVE("rep movsb",
> >>>> "call rep_movs_alternative",
> >>>> ALT_NOT(X86_FEATURE_FSRM))
> >>>> "2:\n"
> >>>> _ASM_EXTABLE_UA(1b, 2b)
> >>>> :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> >>>> : : "memory", "rax");
> >>>> clac();
> >>>> return len;
> >>>> }
> >>>>
> >>>> Am I missing something?
> >>>
> >>> page is allocated & mapped in page fault handler.
> >>
> >> Right, physical pages certainly need to be allocated for the virtual
> >> address range being copied to/from. But that would have happened
> >> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> >> you point out, in the common case that the virtual pages are already
> >> mapped to physical pages, the copy won't cause any page faults.
> >>
> >>>
> >>> However, in typical cases, pages in io buffer shouldn't be swapped out
> >>> frequently, so this cleanup may be good, I will run some perf test.
> >>
> >> Thanks for testing.
> >
> > `fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
> > patch in my test VM, so looks very nice improvement.
> >
> > Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
> > and this change is correct because the copy is guaranteed to be done in ublk
> > daemon context.
>
> We good to queue this up then?
Let me write a v2 implementing Ming's suggestions to use
copy_page_{to,from}_iter() and get rid of the open-coded bvec
iteration.
Thanks,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-05 15:37 ` Caleb Sander Mateos
@ 2025-11-05 15:41 ` Jens Axboe
0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2025-11-05 15:41 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Ming Lei, linux-block, linux-kernel
On 11/5/25 8:37 AM, Caleb Sander Mateos wrote:
> On Wed, Nov 5, 2025 at 7:26?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 11/4/25 6:48 PM, Ming Lei wrote:
>>> On Mon, Nov 03, 2025 at 08:40:30AM -0800, Caleb Sander Mateos wrote:
>>>> On Fri, Oct 31, 2025 at 4:04?PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>
>>>>> On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
>>>>>> On Thu, Oct 30, 2025 at 8:45?PM Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>
>>>>>>> On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
>>>>>>>> ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
>>>>>>>> iov_iter_get_pages2() to extract the pages from the iov_iter and
>>>>>>>> memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
>>>>>>>> Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
>>>>>>>> user page reference count increments and decrements and needing to split
>>>>>>>> the memcpy() at user page boundaries. It also simplifies the code
>>>>>>>> considerably.
>>>>>>>>
>>>>>>>> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
>>>>>>>> ---
>>>>>>>> drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
>>>>>>>> 1 file changed, 14 insertions(+), 48 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>>>>> index 0c74a41a6753..852350e639d6 100644
>>>>>>>> --- a/drivers/block/ublk_drv.c
>>>>>>>> +++ b/drivers/block/ublk_drv.c
>>>>>>>> @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
>>>>>>>> .open = ublk_open,
>>>>>>>> .free_disk = ublk_free_disk,
>>>>>>>> .report_zones = ublk_report_zones,
>>>>>>>> };
>>>>>>>>
>>>>>>>> -#define UBLK_MAX_PIN_PAGES 32
>>>>>>>> -
>>>>>>>> struct ublk_io_iter {
>>>>>>>> - struct page *pages[UBLK_MAX_PIN_PAGES];
>>>>>>>> struct bio *bio;
>>>>>>>> struct bvec_iter iter;
>>>>>>>> };
>>>>>>>
>>>>>>> ->pages[] is actually for pinning user io pages in batch, so killing it may cause
>>>>>>> perf drop.
>>>>>>
>>>>>> As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
>>>>>> pinning entirely. It calls copy_to_user_iter() for each contiguous
>>>>>> user address range:
>>>>>>
>>>>>> size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>>>>>> {
>>>>>> if (WARN_ON_ONCE(i->data_source))
>>>>>> return 0;
>>>>>> if (user_backed_iter(i))
>>>>>> might_fault();
>>>>>> return iterate_and_advance(i, bytes, (void *)addr,
>>>>>> copy_to_user_iter, memcpy_to_iter);
>>>>>> }
>>>>>>
>>>>>> Which just checks that the address range doesn't include any kernel
>>>>>> addresses and then memcpy()s directly via the userspace virtual
>>>>>> addresses:
>>>>>>
>>>>>> static __always_inline
>>>>>> size_t copy_to_user_iter(void __user *iter_to, size_t progress,
>>>>>> size_t len, void *from, void *priv2)
>>>>>> {
>>>>>> if (should_fail_usercopy())
>>>>>> return len;
>>>>>> if (access_ok(iter_to, len)) {
>>>>>> from += progress;
>>>>>> instrument_copy_to_user(iter_to, from, len);
>>>>>> len = raw_copy_to_user(iter_to, from, len);
>>>>>> }
>>>>>> return len;
>>>>>> }
>>>>>>
>>>>>> static __always_inline __must_check unsigned long
>>>>>> raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
>>>>>> {
>>>>>> return copy_user_generic((__force void *)dst, src, size);
>>>>>> }
>>>>>>
>>>>>> static __always_inline __must_check unsigned long
>>>>>> copy_user_generic(void *to, const void *from, unsigned long len)
>>>>>> {
>>>>>> stac();
>>>>>> /*
>>>>>> * If CPU has FSRM feature, use 'rep movs'.
>>>>>> * Otherwise, use rep_movs_alternative.
>>>>>> */
>>>>>> asm volatile(
>>>>>> "1:\n\t"
>>>>>> ALTERNATIVE("rep movsb",
>>>>>> "call rep_movs_alternative",
>>>>>> ALT_NOT(X86_FEATURE_FSRM))
>>>>>> "2:\n"
>>>>>> _ASM_EXTABLE_UA(1b, 2b)
>>>>>> :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
>>>>>> : : "memory", "rax");
>>>>>> clac();
>>>>>> return len;
>>>>>> }
>>>>>>
>>>>>> Am I missing something?
>>>>>
>>>>> page is allocated & mapped in page fault handler.
>>>>
>>>> Right, physical pages certainly need to be allocated for the virtual
>>>> address range being copied to/from. But that would have happened
>>>> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
>>>> you point out, in the common case that the virtual pages are already
>>>> mapped to physical pages, the copy won't cause any page faults.
>>>>
>>>>>
>>>>> However, in typical cases, pages in io buffer shouldn't be swapped out
>>>>> frequently, so this cleanup may be good, I will run some perf test.
>>>>
>>>> Thanks for testing.
>>>
>>> `fio/t/io_uring` shows 40% improvement on `./kublk -t null -q 2` with this
>>> patch in my test VM, so looks very nice improvement.
>>>
>>> Also it works well by forcing to pass IOSQE_ASYNC on the ublk uring_cmd,
>>> and this change is correct because the copy is guaranteed to be done in ublk
>>> daemon context.
>>
>> We good to queue this up then?
>
> Let me write a v2 implementing Ming's suggestions to use
> copy_page_{to,from}_iter() and get rid of the open-coded bvec
> iteration.
Sounds good.
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-03 16:40 ` Caleb Sander Mateos
2025-11-05 1:48 ` Ming Lei
@ 2025-11-05 16:16 ` Caleb Sander Mateos
2025-11-05 17:10 ` Caleb Sander Mateos
1 sibling, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-11-05 16:16 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel
On Mon, Nov 3, 2025 at 8:40 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > > user page reference count increments and decrements and needing to split
> > > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > > considerably.
> > > > >
> > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > ---
> > > > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > index 0c74a41a6753..852350e639d6 100644
> > > > > --- a/drivers/block/ublk_drv.c
> > > > > +++ b/drivers/block/ublk_drv.c
> > > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > > .open = ublk_open,
> > > > > .free_disk = ublk_free_disk,
> > > > > .report_zones = ublk_report_zones,
> > > > > };
> > > > >
> > > > > -#define UBLK_MAX_PIN_PAGES 32
> > > > > -
> > > > > struct ublk_io_iter {
> > > > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > > struct bio *bio;
> > > > > struct bvec_iter iter;
> > > > > };
> > > >
> > > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > > perf drop.
> > >
> > > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > > user address range:
> > >
> > > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > > {
> > > if (WARN_ON_ONCE(i->data_source))
> > > return 0;
> > > if (user_backed_iter(i))
> > > might_fault();
> > > return iterate_and_advance(i, bytes, (void *)addr,
> > > copy_to_user_iter, memcpy_to_iter);
> > > }
> > >
> > > Which just checks that the address range doesn't include any kernel
> > > addresses and then memcpy()s directly via the userspace virtual
> > > addresses:
> > >
> > > static __always_inline
> > > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > > size_t len, void *from, void *priv2)
> > > {
> > > if (should_fail_usercopy())
> > > return len;
> > > if (access_ok(iter_to, len)) {
> > > from += progress;
> > > instrument_copy_to_user(iter_to, from, len);
> > > len = raw_copy_to_user(iter_to, from, len);
> > > }
> > > return len;
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > > {
> > > return copy_user_generic((__force void *)dst, src, size);
> > > }
> > >
> > > static __always_inline __must_check unsigned long
> > > copy_user_generic(void *to, const void *from, unsigned long len)
> > > {
> > > stac();
> > > /*
> > > * If CPU has FSRM feature, use 'rep movs'.
> > > * Otherwise, use rep_movs_alternative.
> > > */
> > > asm volatile(
> > > "1:\n\t"
> > > ALTERNATIVE("rep movsb",
> > > "call rep_movs_alternative",
> > > ALT_NOT(X86_FEATURE_FSRM))
> > > "2:\n"
> > > _ASM_EXTABLE_UA(1b, 2b)
> > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > > : : "memory", "rax");
> > > clac();
> > > return len;
> > > }
> > >
> > > Am I missing something?
> >
> > page is allocated & mapped in page fault handler.
>
> Right, physical pages certainly need to be allocated for the virtual
> address range being copied to/from. But that would have happened
> previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> you point out, in the common case that the virtual pages are already
> mapped to physical pages, the copy won't cause any page faults.
>
> >
> > However, in typical cases, pages in io buffer shouldn't be swapped out
> > frequently, so this cleanup may be good, I will run some perf test.
>
> Thanks for testing.
>
> >
> > Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> > bvec_kmap_local(), and the two helper can handle one whole bvec instead
> > of single page.
>
> Yes, that's a good idea. Thanks, I didn't know about that.
Looking into this further, copy_page_{to,from}_iter() doesn't seem
like an improvement. It still uses kmap_local_page() +
copy_{to,from}_iter() + kunmap_local() internally, and it splits the
copy at page boundaries instead of copying the whole bvec at once. I
don't think it's worth switching just to hide the bvec_kmap_local() +
kunmap_local() calls.
Best,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-05 16:16 ` Caleb Sander Mateos
@ 2025-11-05 17:10 ` Caleb Sander Mateos
2025-11-05 17:16 ` Caleb Sander Mateos
0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-11-05 17:10 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel
On Wed, Nov 5, 2025 at 8:16 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Mon, Nov 3, 2025 at 8:40 AM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > > > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > > > user page reference count increments and decrements and needing to split
> > > > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > > > considerably.
> > > > > >
> > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > > ---
> > > > > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > index 0c74a41a6753..852350e639d6 100644
> > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > > > .open = ublk_open,
> > > > > > .free_disk = ublk_free_disk,
> > > > > > .report_zones = ublk_report_zones,
> > > > > > };
> > > > > >
> > > > > > -#define UBLK_MAX_PIN_PAGES 32
> > > > > > -
> > > > > > struct ublk_io_iter {
> > > > > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > > > struct bio *bio;
> > > > > > struct bvec_iter iter;
> > > > > > };
> > > > >
> > > > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > > > perf drop.
> > > >
> > > > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > > > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > > > user address range:
> > > >
> > > > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > > > {
> > > > if (WARN_ON_ONCE(i->data_source))
> > > > return 0;
> > > > if (user_backed_iter(i))
> > > > might_fault();
> > > > return iterate_and_advance(i, bytes, (void *)addr,
> > > > copy_to_user_iter, memcpy_to_iter);
> > > > }
> > > >
> > > > Which just checks that the address range doesn't include any kernel
> > > > addresses and then memcpy()s directly via the userspace virtual
> > > > addresses:
> > > >
> > > > static __always_inline
> > > > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > > > size_t len, void *from, void *priv2)
> > > > {
> > > > if (should_fail_usercopy())
> > > > return len;
> > > > if (access_ok(iter_to, len)) {
> > > > from += progress;
> > > > instrument_copy_to_user(iter_to, from, len);
> > > > len = raw_copy_to_user(iter_to, from, len);
> > > > }
> > > > return len;
> > > > }
> > > >
> > > > static __always_inline __must_check unsigned long
> > > > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > > > {
> > > > return copy_user_generic((__force void *)dst, src, size);
> > > > }
> > > >
> > > > static __always_inline __must_check unsigned long
> > > > copy_user_generic(void *to, const void *from, unsigned long len)
> > > > {
> > > > stac();
> > > > /*
> > > > * If CPU has FSRM feature, use 'rep movs'.
> > > > * Otherwise, use rep_movs_alternative.
> > > > */
> > > > asm volatile(
> > > > "1:\n\t"
> > > > ALTERNATIVE("rep movsb",
> > > > "call rep_movs_alternative",
> > > > ALT_NOT(X86_FEATURE_FSRM))
> > > > "2:\n"
> > > > _ASM_EXTABLE_UA(1b, 2b)
> > > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > > > : : "memory", "rax");
> > > > clac();
> > > > return len;
> > > > }
> > > >
> > > > Am I missing something?
> > >
> > > page is allocated & mapped in page fault handler.
> >
> > Right, physical pages certainly need to be allocated for the virtual
> > address range being copied to/from. But that would have happened
> > previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> > you point out, in the common case that the virtual pages are already
> > mapped to physical pages, the copy won't cause any page faults.
> >
> > >
> > > However, in typical cases, pages in io buffer shouldn't be swapped out
> > > frequently, so this cleanup may be good, I will run some perf test.
> >
> > Thanks for testing.
> >
> > >
> > > Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> > > bvec_kmap_local(), and the two helper can handle one whole bvec instead
> > > of single page.
> >
> > Yes, that's a good idea. Thanks, I didn't know about that.
>
> Looking into this further, copy_page_{to,from}_iter() doesn't seem
> like an improvement. It still uses kmap_local_page() +
> copy_{to,from}_iter() + kunmap_local() internally, and it splits the
> copy at page boundaries instead of copying the whole bvec at once. I
> don't think it's worth switching just to hide the bvec_kmap_local() +
> kunmap_local() calls.
Maybe that's because bvec_kmap_local() only maps the first page of the
bvec even if it spans multiple pages? Seems like it may be an existing
bug that ublk_copy_io_pages() can copy past the end of the page
returned from bvec_kmap_local(). But given that kmap_local_page() is a
no-op when CONFIG_HIGHMEM is disabled, it's probably not visible in
typical configurations.
Best,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] ublk: use copy_{to,from}_iter() for user copy
2025-11-05 17:10 ` Caleb Sander Mateos
@ 2025-11-05 17:16 ` Caleb Sander Mateos
0 siblings, 0 replies; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-11-05 17:16 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-kernel
On Wed, Nov 5, 2025 at 9:10 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Wed, Nov 5, 2025 at 8:16 AM Caleb Sander Mateos
> <csander@purestorage.com> wrote:
> >
> > On Mon, Nov 3, 2025 at 8:40 AM Caleb Sander Mateos
> > <csander@purestorage.com> wrote:
> > >
> > > On Fri, Oct 31, 2025 at 4:04 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Fri, Oct 31, 2025 at 09:02:48AM -0700, Caleb Sander Mateos wrote:
> > > > > On Thu, Oct 30, 2025 at 8:45 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Oct 30, 2025 at 07:05:21PM -0600, Caleb Sander Mateos wrote:
> > > > > > > ublk_copy_user_pages()/ublk_copy_io_pages() currently uses
> > > > > > > iov_iter_get_pages2() to extract the pages from the iov_iter and
> > > > > > > memcpy()s between the bvec_iter and the iov_iter's pages one at a time.
> > > > > > > Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the
> > > > > > > user page reference count increments and decrements and needing to split
> > > > > > > the memcpy() at user page boundaries. It also simplifies the code
> > > > > > > considerably.
> > > > > > >
> > > > > > > Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
> > > > > > > ---
> > > > > > > drivers/block/ublk_drv.c | 62 +++++++++-------------------------------
> > > > > > > 1 file changed, 14 insertions(+), 48 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > > index 0c74a41a6753..852350e639d6 100644
> > > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > > @@ -912,58 +912,47 @@ static const struct block_device_operations ub_fops = {
> > > > > > > .open = ublk_open,
> > > > > > > .free_disk = ublk_free_disk,
> > > > > > > .report_zones = ublk_report_zones,
> > > > > > > };
> > > > > > >
> > > > > > > -#define UBLK_MAX_PIN_PAGES 32
> > > > > > > -
> > > > > > > struct ublk_io_iter {
> > > > > > > - struct page *pages[UBLK_MAX_PIN_PAGES];
> > > > > > > struct bio *bio;
> > > > > > > struct bvec_iter iter;
> > > > > > > };
> > > > > >
> > > > > > ->pages[] is actually for pinning user io pages in batch, so killing it may cause
> > > > > > perf drop.
> > > > >
> > > > > As far as I can tell, copy_to_iter()/copy_from_iter() avoids the page
> > > > > pinning entirely. It calls copy_to_user_iter() for each contiguous
> > > > > user address range:
> > > > >
> > > > > size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
> > > > > {
> > > > > if (WARN_ON_ONCE(i->data_source))
> > > > > return 0;
> > > > > if (user_backed_iter(i))
> > > > > might_fault();
> > > > > return iterate_and_advance(i, bytes, (void *)addr,
> > > > > copy_to_user_iter, memcpy_to_iter);
> > > > > }
> > > > >
> > > > > Which just checks that the address range doesn't include any kernel
> > > > > addresses and then memcpy()s directly via the userspace virtual
> > > > > addresses:
> > > > >
> > > > > static __always_inline
> > > > > size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> > > > > size_t len, void *from, void *priv2)
> > > > > {
> > > > > if (should_fail_usercopy())
> > > > > return len;
> > > > > if (access_ok(iter_to, len)) {
> > > > > from += progress;
> > > > > instrument_copy_to_user(iter_to, from, len);
> > > > > len = raw_copy_to_user(iter_to, from, len);
> > > > > }
> > > > > return len;
> > > > > }
> > > > >
> > > > > static __always_inline __must_check unsigned long
> > > > > raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
> > > > > {
> > > > > return copy_user_generic((__force void *)dst, src, size);
> > > > > }
> > > > >
> > > > > static __always_inline __must_check unsigned long
> > > > > copy_user_generic(void *to, const void *from, unsigned long len)
> > > > > {
> > > > > stac();
> > > > > /*
> > > > > * If CPU has FSRM feature, use 'rep movs'.
> > > > > * Otherwise, use rep_movs_alternative.
> > > > > */
> > > > > asm volatile(
> > > > > "1:\n\t"
> > > > > ALTERNATIVE("rep movsb",
> > > > > "call rep_movs_alternative",
> > > > > ALT_NOT(X86_FEATURE_FSRM))
> > > > > "2:\n"
> > > > > _ASM_EXTABLE_UA(1b, 2b)
> > > > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT
> > > > > : : "memory", "rax");
> > > > > clac();
> > > > > return len;
> > > > > }
> > > > >
> > > > > Am I missing something?
> > > >
> > > > page is allocated & mapped in page fault handler.
> > >
> > > Right, physical pages certainly need to be allocated for the virtual
> > > address range being copied to/from. But that would have happened
> > > previously in iov_iter_get_pages2(), so this isn't a new cost. And as
> > > you point out, in the common case that the virtual pages are already
> > > mapped to physical pages, the copy won't cause any page faults.
> > >
> > > >
> > > > However, in typical cases, pages in io buffer shouldn't be swapped out
> > > > frequently, so this cleanup may be good, I will run some perf test.
> > >
> > > Thanks for testing.
> > >
> > > >
> > > > Also copy_page_from_iter()/copy_page_to_iter() can be used for avoiding
> > > > bvec_kmap_local(), and the two helper can handle one whole bvec instead
> > > > of single page.
> > >
> > > Yes, that's a good idea. Thanks, I didn't know about that.
> >
> > Looking into this further, copy_page_{to,from}_iter() doesn't seem
> > like an improvement. It still uses kmap_local_page() +
> > copy_{to,from}_iter() + kunmap_local() internally, and it splits the
> > copy at page boundaries instead of copying the whole bvec at once. I
> > don't think it's worth switching just to hide the bvec_kmap_local() +
> > kunmap_local() calls.
>
> Maybe that's because bvec_kmap_local() only maps the first page of the
> bvec even if it spans multiple pages? Seems like it may be an existing
> bug that ublk_copy_io_pages() can copy past the end of the page
> returned from bvec_kmap_local(). But given that kmap_local_page() is a
> no-op when CONFIG_HIGHMEM is disabled, it's probably not visible in
> typical configurations.
I guess bio_iter_iovec() restricts the bvec to a single page, so
there's no existing issue. If we switch to rq_for_each_bvec(), we'll
need to use copy_page_{to,from}_iter(). Sorry for the thinking out
loud...
Best,
Caleb
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-11-05 17:16 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 1:05 [PATCH] ublk: use copy_{to,from}_iter() for user copy Caleb Sander Mateos
2025-10-31 3:45 ` Ming Lei
2025-10-31 16:02 ` Caleb Sander Mateos
2025-10-31 23:03 ` Ming Lei
2025-11-03 16:40 ` Caleb Sander Mateos
2025-11-05 1:48 ` Ming Lei
2025-11-05 15:26 ` Jens Axboe
2025-11-05 15:37 ` Caleb Sander Mateos
2025-11-05 15:41 ` Jens Axboe
2025-11-05 16:16 ` Caleb Sander Mateos
2025-11-05 17:10 ` Caleb Sander Mateos
2025-11-05 17:16 ` Caleb Sander Mateos
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).