From: Bernd Schubert <bschubert@ddn.com>
To: Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
Pavel Begunkov <asml.silence@gmail.com>,
Miklos Szeredi <mszeredi@redhat.com>,
Christoph Hellwig <hch@lst.de>,
Ziyang Zhang <ZiyangZhang@linux.alibaba.com>,
Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Cc: "lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>,
"io-uring@vger.kernel.org" <io-uring@vger.kernel.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support
Date: Fri, 5 May 2023 21:57:47 +0000 [thread overview]
Message-ID: <41cfb9c2-9774-e9e1-d8e7-4999a710f2e7@ddn.com> (raw)
In-Reply-To: <ZEx+h/iFf46XiWG1@ovpn-8-24.pek2.redhat.com>
Hi Ming,
On 4/29/23 04:18, Ming Lei wrote:
> Hello,
>
> ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
> lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
> that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.
>
> So this is one important performance improvement.
>
> So far there are three proposal:
looks like there is no dedicated session. Could we still have a
discussion in a free slot, if possible?
Thanks,
Bernd
>
> 1) splice based
>
> - spliced page from ->splice_read() can't be written
>
> ublk READ request can't be handled because spliced page can't be written
> to, and extending splice for ublk zero copy isn't one good solution[3]
>
> - it is very hard to meet above requirements wrt. request buffer lifetime
>
> splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> attention to ublk request buffer lifetime. If is very inefficient to respect
> request buffer lifetime by using all pipe buffer's ->release() which requires
> all pipe buffers and pipe to be kept when ublk server handles IO. That means
> one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> provided buffer, and the pipe needs to be populated with pages in ublk request
> buffer.
>
> IMO, it isn't one good way to take splice from both correctness and performance
> viewpoint.
>
> 2) io_uring register buffer based
>
> - the main idea is to register one runtime buffer in fast io path, and
> unregister it after the buffer is used by the following OPs
>
> - the main problem is that bad performance caused by io_uring link model
>
> registering buffer has to be one OP, same with unregistering buffer; the
> following normal OPs(such as FS IO) have to depend on the registering
> buffer OP, then io_uring link has to be used.
>
> It is normal to see more than one normal OPs which depend on the registering
> buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and
> unregistering buffer) have to be linked together, then normal(FS IO) OPs
> have to be submitted one by one, and this way is slow, because there is
> often no dependency among all these normal FS OPs. Basically io_uring
> link model does not support this kind of 1:N dependency.
>
> No one posted code for showing this approach yet.
>
> 3) io_uring fused command[1]
>
> - fused command extend current io_uring usage by allowing submitting following
> FS OPs(called secondary OPs) after the primary command provides buffer, and
> primary command won't be completed until all secondary OPs are done.
>
> This way solves the problem in 2), and meantime avoids the buffer register cost in
> both submission and completion IO fast code path because the primary command won't
> be completed until all secondary OPs are done, so no need to write/read the
> buffer into per-context global data structure.
>
> Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed,
> and performance is pretty good, and even IOPS of 4k IO gets a little
> improved in some workloads, or at least no perf regression is observed
> for small size IO.
>
> fused command can be thought as one single request logically, just it has more
> than one SQE(all share same link flag), that is why is named as fused command.
>
> - the only concern is that fused command starts one use usage of io_uring, but
> still not see comments wrt. what/why is bad with this kind of new usage/interface.
>
> I propose this topic and want to discuss about how to move on with this
> feature.
>
>
> [1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@redhat.com/
> [2] https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/
> [3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
>
>
> Thanks,
> Ming
>
next prev parent reply other threads:[~2023-05-05 21:58 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-29 2:18 [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support Ming Lei
2023-05-05 21:57 ` Bernd Schubert [this message]
2023-05-06 1:38 ` Ming Lei
2023-05-08 2:16 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41cfb9c2-9774-e9e1-d8e7-4999a710f2e7@ddn.com \
--to=bschubert@ddn.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=io-uring@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ming.lei@redhat.com \
--cc=mszeredi@redhat.com \
--cc=xiaoguang.wang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox