From: Ming Lei <ming.lei@redhat.com>
To: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Cc: linux-block@vger.kernel.org,
Harris James R <james.r.harris@intel.com>,
linux-kernel@vger.kernel.org, io-uring@vger.kernel.org,
Gabriel Krisman Bertazi <krisman@collabora.com>,
Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
Stefan Hajnoczi <stefanha@redhat.com>,
Jens Axboe <axboe@kernel.dk>,
ming.lei@redhat.com
Subject: Re: [PATCH V3 1/1] ublk: add io_uring based userspace block driver
Date: Thu, 30 Jun 2022 20:33:20 +0800 [thread overview]
Message-ID: <Yr2YEIoBPOLxq6NB@T590> (raw)
In-Reply-To: <fdd06581-a8aa-5948-6043-fc7e3381eb2d@linux.alibaba.com>
On Thu, Jun 30, 2022 at 07:35:11PM +0800, Ziyang Zhang wrote:
> On 2022/6/29 00:08, Ming Lei wrote:
>
> [...]
>
> > +#define UBLK_MAX_PIN_PAGES 32
> > +
> > +static inline void ublk_release_pages(struct ublk_queue *ubq, struct page **pages,
> > + int nr_pages)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pages; i++)
> > + put_page(pages[i]);
> > +}
> > +
> > +static inline int ublk_pin_user_pages(struct ublk_queue *ubq, u64 start_vm,
> > + unsigned int nr_pages, unsigned int gup_flags,
> > + struct page **pages)
> > +{
> > + return get_user_pages_fast(start_vm, nr_pages, gup_flags, pages);
> > +}
>
> > +
> > +static inline unsigned ublk_copy_bv(struct bio_vec *bv, void **bv_addr,
> > + void *pg_addr, unsigned int *pg_off,
> > + unsigned int *pg_len, bool to_bv)
> > +{
> > + unsigned len = min_t(unsigned, bv->bv_len, *pg_len);
> > +
> > + if (*bv_addr == NULL)
> > + *bv_addr = kmap_local_page(bv->bv_page);
> > +
> > + if (to_bv)
> > + memcpy(*bv_addr + bv->bv_offset, pg_addr + *pg_off, len);
> > + else
> > + memcpy(pg_addr + *pg_off, *bv_addr + bv->bv_offset, len);
> > +
> > + bv->bv_offset += len;
> > + bv->bv_len -= len;
> > + *pg_off += len;
> > + *pg_len -= len;
> > +
> > + if (!bv->bv_len) {
> > + kunmap_local(*bv_addr);
> > + *bv_addr = NULL;
> > + }
> > +
> > + return len;
> > +}
> > +
> > +/* copy rq pages to ublksrv vm address pointed by io->addr */
> > +static int ublk_copy_pages(struct ublk_queue *ubq, struct request *rq, bool to_rq,
> > + unsigned int max_bytes)
> > +{
> > + unsigned int gup_flags = to_rq ? 0 : FOLL_WRITE;
> > + struct ublk_io *io = &ubq->ios[rq->tag];
> > + struct page *pgs[UBLK_MAX_PIN_PAGES];
> > + struct req_iterator req_iter;
> > + struct bio_vec bv;
> > + const unsigned int rq_bytes = min(blk_rq_bytes(rq), max_bytes);
> > + unsigned long start = io->addr, left = rq_bytes;
> > + unsigned int idx = 0, pg_len = 0, pg_off = 0;
> > + int nr_pin = 0;
> > + void *pg_addr = NULL;
> > + struct page *curr = NULL;
> > +
> > + rq_for_each_segment(bv, rq, req_iter) {
> > + unsigned len, bv_off = bv.bv_offset, bv_len = bv.bv_len;
> > + void *bv_addr = NULL;
> > +
> > +refill:
> > + if (pg_len == 0) {
> > + unsigned int off = 0;
> > +
> > + if (pg_addr) {
> > + kunmap_local(pg_addr);
> > + if (!to_rq)
> > + set_page_dirty_lock(curr);
> > + pg_addr = NULL;
> > + }
> > +
> > + /* refill pages */
> > + if (idx >= nr_pin) {
> > + unsigned int max_pages;
> > +
> > + ublk_release_pages(ubq, pgs, nr_pin);
> > +
> > + off = start & (PAGE_SIZE - 1);
> > + max_pages = min_t(unsigned, (off + left +
> > + PAGE_SIZE - 1) >> PAGE_SHIFT,
> > + UBLK_MAX_PIN_PAGES);
> > + nr_pin = ublk_pin_user_pages(ubq, start,
> > + max_pages, gup_flags, pgs);
> > + if (nr_pin < 0)
> > + goto exit;
> > + idx = 0;
> > + }
> > + pg_off = off;
> > + pg_len = min(PAGE_SIZE - off, left);
> > + off = 0;
> > + curr = pgs[idx++];
> > + pg_addr = kmap_local_page(curr);
> > + }
> > +
> > + len = ublk_copy_bv(&bv, &bv_addr, pg_addr, &pg_off, &pg_len,
> > + to_rq);
> > + /* either one of the two has been consumed */
> > + WARN_ON_ONCE(bv.bv_len && pg_len);
> > + start += len;
> > + left -= len;
> > +
> > + /* overflow */
> > + WARN_ON_ONCE(left > rq_bytes);
> > + WARN_ON_ONCE(bv.bv_len > bv_len);
> > + if (bv.bv_len)
> > + goto refill;
> > +
> > + bv.bv_len = bv_len;
> > + bv.bv_offset = bv_off;
> > + }
> > + if (pg_addr) {
> > + kunmap_local(pg_addr);
> > + if (!to_rq)
> > + set_page_dirty_lock(curr);
> > + }
> > + ublk_release_pages(ubq, pgs, nr_pin);
> > +
> > +exit:
> > + return rq_bytes - left;
> > +}
> > +
>
> Hi Ming,
>
> I note that you pin the user buffer's pages, memcpy() and release them immediately.
>
> 1) I think maybe copy_page_from_iter() is another choice for copying user buffer to biovecs
> since copy_page_from_iter() do not pin pages(But it may raise page fault).
copy_page_from_iter/copy_page_to_iter needs the userspage page,
then copy between the userspace page and bvec_iter pages, what it does
is just kmap/copy/kunmap.
Not see it is useful here.
>
> 2) Or will you design some mechanism such as LRU to manage these pinned pages?
> For example pin those pages frequently required for a long time and release
> those pages not used for a long time.
> I remember you have talked about this LRU on pinned pages?
I'd explain it a bit.
When I worked on v1/v2, 'perf report' shows that get_user_pages_fast()
as one of top samples. Turns out it is a bug, which is fixed in
https://github.com/ming1/linux/commit/3c9fd476951759858cc548dee4cedc074194d0b0
After the issue is fixed, not see get_user_pages_fast() being hot spot
any more. I actually implemented one patch which pins all pages in
the ubd device whole lifetime, but not see obvious improvement, so I gave
up the idea.
In the test VM on my laptop, single job ubd/null randwrite can reach 700K iops.
>
> Which one do you think is better? copy_page_from_iter() or pin pages with LRU?
> Maybe it depends on the user's workload?
So far in the enablement stage, I think the current approach is just fine,
but we still can improve it in future.
Thanks,
Ming
next prev parent reply other threads:[~2022-06-30 12:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-28 16:08 [PATCH V3 0/1] ublk: add io_uring based userspace block driver Ming Lei
2022-06-28 16:08 ` [PATCH V3 1/1] " Ming Lei
2022-06-30 11:35 ` Ziyang Zhang
2022-06-30 12:33 ` Ming Lei [this message]
2022-07-01 2:47 ` Ziyang Zhang
2022-07-01 4:06 ` Ming Lei
2022-07-04 11:17 ` Sagi Grimberg
2022-07-04 12:34 ` Ming Lei
2022-07-04 14:00 ` Sagi Grimberg
2022-07-04 16:13 ` Gabriel Krisman Bertazi
2022-07-04 16:19 ` Sagi Grimberg
2022-07-05 0:43 ` Ming Lei
2022-07-04 22:10 ` Gabriel Krisman Bertazi
2022-07-05 4:16 ` Ziyang Zhang
2022-07-05 8:12 ` Ming Lei
2022-07-05 8:06 ` Ming Lei
2022-07-07 7:49 ` Ming Lei
2022-07-07 7:58 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yr2YEIoBPOLxq6NB@T590 \
--to=ming.lei@redhat.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=io-uring@vger.kernel.org \
--cc=james.r.harris@intel.com \
--cc=krisman@collabora.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=stefanha@redhat.com \
--cc=xiaoguang.wang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.