From: Ming Lei <ming.lei@redhat.com>
To: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Cc: ZiyangZhang <ZiyangZhang@linux.alibaba.com>,
Gabriel Krisman Bertazi <krisman@collabora.com>,
joseph.qi@linux.alibaba.com, linux-block@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>
Subject: Re: Follow up on UBD discussion
Date: Mon, 9 May 2022 15:36:08 +0800 [thread overview]
Message-ID: <YnjEaM5T2aO0mlyi@T590> (raw)
In-Reply-To: <55edea6e-e7dc-054a-b79b-fcfc40c22f2f@linux.alibaba.com>
On Mon, May 09, 2022 at 12:05:54PM +0800, Xiaoguang Wang wrote:
> hi,
>
> > On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
> >> On 2022/5/3 16:02, Ming Lei wrote:
> >>> Hello Gabriel,
> >>>
> >>> CC linux-block and hope you don't mind, :-)
> >>>
> >>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
> >>>> Hi Ming,
> >>>>
> >>>> First of all, I hope I didn't put you on the spot too much during the
> >>>> discussion. My original proposal was to propose my design, but your
> >>>> implementation quite solved the questions I had. :)
> >>> I think that is open source, then we can put efforts together to make things
> >>> better.
> >>>
> >>>> I'd like to follow up to restart the communication and see
> >>>> where I can help more with UBD. As I said during the talk, I've
> >>>> done some fio runs and I was able to saturate NBD much faster than UBD:
> >>>>
> >>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
> >>> Yeah, that is true since NBD has extra socket communication cost which
> >>> can't be efficient as io_uring.
> >>>
> >>>> I've also wrote some fixes to the initialization path, which I
> >>>> planned to send to you as soon as you published your code, but I think
> >>>> you might want to take a look already and see if you want to just squash
> >>>> it into your code base.
> >>>>
> >>>> I pushed those fixes here:
> >>>>
> >>>> https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
> >>> I have added the 1st fix and 3rd patch into my tree:
> >>>
> >>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
> >>>
> >>> The added check in 2nd patch is done lockless, which may not be reliable
> >>> enough, so I didn't add it. Also adding device is in slow path, and no
> >>> necessary to improve in that code path.
> >>>
> >>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
> >>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
> >>> builtin.
> >>>
> >>> ubdsrv part has been cleaned up too:
> >>>
> >>> https://github.com/ming1/ubdsrv
> >>>
> >>>> I'm looking into adding support for multiple driver queues next, and
> >>>> should be able to share some patches on that shortly.
> >>> OK, please post them on linux-block so that more eyes can look at the
> >>> code, meantime the ubdsrv side needs to handle MQ too.
> >>>
> >>> Sooner or later, the single ubdsrv task may be saturated by copying data and
> >>> io_uring command communication only, which can be shown by running io on
> >>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
> >>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
> >>>
> >>>
> >>> Thanks,
> >>> Ming
> >> Hi Ming,
> >>
> >> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
> >> replace TCMU by UBD as a new choice for implementing userspace bdev for
> >> its high performance and simplicity.
> >>
> >> First, we have conducted some tests by fio and perf to evaluate UBD.
> >>
> >> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
> >> the complicated SCSI layer and does not support multiqueue. However
> >> UBD is simply using io_uring passthrough and may support multiqueue in
> >> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> > MQ isn't hard to support, and it is basically workable now:
> >
> > https://github.com/ming1/ubdsrv/commits/devel
> > https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
> >
> > Just the affinity of pthread for each queue isn't setup yet.
> >
> >> 2) Some functions in UBD result in high CPU utilization and we guess
> >> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
> >> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
> >> ready to be submitted. Besides, ubd_copy_pages() asks CPU to copy data
> >> between bio vectors and UBD internal buffers while handling write and
> >> read requests and it could be eliminated by supporting zero-copy.
> > copy itself doesn't take much cpu, see the following trace:
> >
> > - 34.36% 3.73% ubd [kernel.kallsyms] [k] ubd_copy_pages.isra.0 ▒
> > - 30.63% ubd_copy_pages.isra.0 ▒
> > - 23.86% internal_get_user_pages_fast ▒
> > + 21.14% get_user_pages_unlocked ▒
> > + 2.62% lockless_pages_from_mm ▒
> > 6.42% ubd_release_pages.constprop.0
> >
> > And we may provide option to allow to pin pages in the disk lifetime for avoiding
> > the cost in _get_user_pages_fast().
> >
> > zero-copy has to touch page table, and its cost may be expensive too.
> > The big problem is that MM doesn't provide mechanism to support generic
> > remapping kernel pages to userspace.
> >
> >> Second, I'd like to share some ideas on UBD. I'm not sure if they are
> >> reasonable so please figure out my mistakes.
> >>
> >> 1) UBD issues one sqe to commit last completed request and fetch a new
> >> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
> >> one cqe for the fetch command. We have evaluated that io_submit_sqes()
> >> costs some CPU and steps of building a new sqe may lower throughput.
> >> Here I'd like to give a new solution: never submit sqe but trump up a
> >> cqe(with information of new UBD IO request) when calling queue_rq(). I
> >> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
> >> user issues only one sqe for polling an fd and repeatedly gets multiple
> >> cqes when new events occur. Dose this solution break the architecture of
> >> UBD?
> > But each cqe has to be associated with one sqe, if I understand
> > correctly.
> Yeah, for current io_uring implementation, it is. But if io_uring offers below
> helper:
> void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
> {
> struct io_ring_ctx *ctx;
> ctx = file->private_data;
>
> spin_lock(&ctx->completion_lock);
> __io_fill_cqe(ctx, user_data, res, cflags);
> io_commit_cqring(ctx);
> spin_unlock(&ctx->completion_lock);
> io_cqring_ev_posted(ctx);
> }
>
> Then in ubd driver:
> 1) device setup stage
> We attach io_uring files and user_data to every ubd hard queue.
>
> 2) when blk-mq->queue_rq() is called.
> io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
> qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.
The above way is actually anti io_uring design, and I don't think it may
improve much since submitting UBD_IO_COMMIT_AND_FETCH_REQ is pretty lightweight.
Also without re-submitting UBD_IO_COMMIT_AND_FETCH_REQ command, how can you
commit io handling result from ubd server and ask ubd driver to complete
io request?
Thanks,
Ming
next prev parent reply other threads:[~2022-05-09 7:52 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <874k27rfwm.fsf@collabora.com>
2022-05-03 8:02 ` Follow up on UBD discussion Ming Lei
2022-05-07 4:20 ` ZiyangZhang
2022-05-07 17:13 ` Ming Lei
2022-05-09 4:05 ` Xiaoguang Wang
2022-05-09 7:36 ` Ming Lei [this message]
2022-05-09 11:53 ` Xiaoguang Wang
2022-05-09 13:15 ` Ming Lei
2022-05-10 4:38 ` Xiaoguang Wang
2022-05-09 8:11 ` Ziyang Zhang
2022-05-09 11:24 ` Ming Lei
2022-05-10 4:46 ` Ziyang Zhang
2022-05-11 1:52 ` Ming Lei
2022-05-10 4:55 ` Ziyang Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YnjEaM5T2aO0mlyi@T590 \
--to=ming.lei@redhat.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=joseph.qi@linux.alibaba.com \
--cc=krisman@collabora.com \
--cc=linux-block@vger.kernel.org \
--cc=xiaoguang.wang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox