From: Maxim Levitsky <mlevitsk@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>, Kevin Wolf <kwolf@redhat.com>,
Sergio Lopez <slp@redhat.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Max Reitz <mreitz@redhat.com>,
saket.sinha89@gmail.com, Paolo Bonzini <pbonzini@redhat.com>,
Julia Suvorova <jusual@mail.ru>,
Aarushi Mehta <mehta.aaru20@gmail.com>,
Markus Armbruster <armbru@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
Date: Wed, 19 Jun 2019 13:47:01 +0300 [thread overview]
Message-ID: <f24926a73f00be2f1d7de2c8b09680a98c595fd5.camel@redhat.com> (raw)
In-Reply-To: <20190619101414.GA13569@stefanha-x1.localdomain>
On Wed, 2019-06-19 at 11:14 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > > + if (!cqes) {
> > > + break;
> > > + }
> > > + LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > > + ret = cqes->res;
> > > +
> > > + if (ret == luringcb->qiov->size) {
> > > + ret = 0;
> > > + } else if (ret >= 0) {
> >
> >
> > You should very carefully check the allowed return values here.
> >
> > It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise
> > you will get the number of bytes read, which might be less that what was asked for, which implies that you
> > need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO
> >
> > (0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF)
> >
> >
> > Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments.
> > including lack of check for -EINTR.
> >
> > I assume that since aio is linux specific, and it only supports direct IO, it happens
> > to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this)
> >
> > On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage,
> > retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read.
> >
> > Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode),
> > and in this case, it goes through the same ->read_iter which is pretty much the same path that
> > regular read() takes and so it might return short reads and or -EINTR.
>
> Interesting point. Investigating EINTR should at least be a TODO
> comment and needs to be resolved before io_uring lands in a QEMU
> release.
>
> > > +static int ioq_submit(LuringState *s)
> > > +{
> > > + int ret = 0;
> > > + LuringAIOCB *luringcb, *luringcb_next;
> > > +
> > > + while (s->io_q.in_queue > 0) {
> > > + QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> > > + luringcb_next) {
> >
> > I am torn about the 'sq_overflow' name. it seems to me that its not immediately clear that these
> > are the requests that are waiting because the io uring got full, but I can't now think of a better name.
> >
> > Maybe add a comment here to explain what is going on here?
>
> Hmm...I suggested this name because I thought it was clear. But the
> fact that it puzzled you proves it wasn't clear :-).
>
> Can anyone think of a better name? It's the queue we keep in QEMU to
> hold requests while the io_uring sq ring is full.
>
> > Also maybe we could somehow utilize the plug/unplug facility to avoid reaching that state in first place?
> > Maybe the block layer has some kind of 'max outstanding requests' limit that could be used?
> >
> > In my nvme-mdev I opted to not process the input queues when such a condition is detected, but here you can't as the block layer
> > pretty much calls you to process the requests.
>
> Block layer callers are allowed to submit as many I/O requests as they
> like and there is no feedback mechanism. It's up to linux-aio.c and
> io_uring.c to handle the case where host kernel I/O submission resources
> are exhausted.
>
> Plug/unplug is a batching performance optimization to reduce the number
> of io_uring_enter() calls but it does not stop the callers from
> submitting more I/O requests. So plug/unplug isn't directly applicable
> here.
Thanks for the explanation! I guess we can leave that name as is, but add some comment or so
in the place where the queue is accessed.
>
> > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> > > + uint64_t offset, int type)
> > > +{
> > > + struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> > > + if (!sqes) {
> > > + sqes = &luringcb->sqeq;
> > > + QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> > > + }
> > > +
> > > + switch (type) {
> > > + case QEMU_AIO_WRITE:
> > > + io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > > + luringcb->qiov->niov, offset);
> > > + break;
> > > + case QEMU_AIO_READ:
> > > + io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > > + luringcb->qiov->niov, offset);
> > > + break;
> > > + case QEMU_AIO_FLUSH:
> > > + io_uring_prep_fsync(sqes, fd, 0);
> > > + break;
> > > + default:
> > > + fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > > + __func__, type);
> >
> > Nitpick: Don't we use some king of error printing functions like 'error_setg' rather that fprintf?
>
> Here we're not in a context where an Error object can be returned (e.g.
> printed by the QMP monitor). We only have an errno return value that
> the emulated storage controller may squash down further to a single
> EIO-type error code.
>
> 'type' is a QEMU-internal value so the default case is basically
> assert(false); /* we should never get here */
>
> For these reasons the fprintf() seems okay here.
All right then.
>
> > I plan on this or next week to do some benchmarks of the code and I will share the results as soon
> > as I do them.
>
> Excellent, Aarushi has been benchmarking too. Perhaps you can share the
> QEMU command-line and fio configuration so that the results can be
> compared.
I'll do that soon. I already did some initial benchmarks (didn't save the results yet), and in the current state the io_uring is a bit, tiny bit slower
that libaio, which is nothing wrong because io_uring has lot of benefits in addition to performance, plus the current code can be optimized futher.
I'll post all the benchmarks I am doing once again now, as soon as I have them.
Best regards,
Maxim Levitsky
next prev parent reply other threads:[~2019-06-19 10:55 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring Aarushi Mehta
2019-06-11 7:36 ` Fam Zheng
2019-06-11 9:32 ` Stefan Hajnoczi
2019-07-03 6:26 ` Markus Armbruster
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 03/12] block/block: add BDRV flag " Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
2019-06-11 11:17 ` Fam Zheng
2019-06-12 14:43 ` Stefan Hajnoczi
2019-06-17 12:26 ` Maxim Levitsky
2019-06-19 10:14 ` Stefan Hajnoczi
2019-06-19 10:47 ` Maxim Levitsky [this message]
2019-06-22 15:10 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface Aarushi Mehta
2019-06-17 12:33 ` Maxim Levitsky
2019-06-19 10:21 ` Stefan Hajnoczi
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring Aarushi Mehta
2019-06-17 12:56 ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option Aarushi Mehta
2019-06-17 13:01 ` Maxim Levitsky
2019-06-19 10:24 ` Stefan Hajnoczi
2019-06-19 10:48 ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring Aarushi Mehta
2019-06-17 14:12 ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
2019-06-11 9:47 ` Stefan Hajnoczi
2019-06-11 11:34 ` Fam Zheng
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling Aarushi Mehta
2019-06-11 9:51 ` Stefan Hajnoczi
2019-06-17 14:14 ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
2019-06-11 9:54 ` Stefan Hajnoczi
2019-06-11 12:26 ` Fam Zheng
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
2019-06-11 9:58 ` Stefan Hajnoczi
2019-06-17 14:21 ` Maxim Levitsky
2019-06-11 9:56 ` [Qemu-devel] [PATCH v5 00/12] Add support " Stefan Hajnoczi
2019-06-22 15:13 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2019-06-23 14:03 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f24926a73f00be2f1d7de2c8b09680a98c595fd5.camel@redhat.com \
--to=mlevitsk@redhat.com \
--cc=armbru@redhat.com \
--cc=fam@euphon.net \
--cc=jusual@mail.ru \
--cc=kwolf@redhat.com \
--cc=mehta.aaru20@gmail.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=saket.sinha89@gmail.com \
--cc=slp@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).