From: Kevin Wolf <kwolf@redhat.com>
To: Ming Lei <ming.lei@canonical.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
Fam Zheng <famz@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
tom.leiming@gmail.com, qemu-devel <qemu-devel@nongnu.org>,
Stefan Hajnoczi <stefanha@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Mon, 11 Aug 2014 16:03:56 +0200 [thread overview]
Message-ID: <20140811140356.GA3980@noname.redhat.com> (raw)
In-Reply-To: <20140810114624.0305b7af@tom-ThinkPad-T410>
Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
> Hi Kevin, Paolo, Stefan and all,
>
>
> On Wed, 6 Aug 2014 10:48:55 +0200
> Kevin Wolf <kwolf@redhat.com> wrote:
>
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>
> >
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> >
> > | bypass | fixed coro | buggy coro
> > ----------------+---------------+---------------+--------------
> > time | 1.09s | 1.10s | 1.62s
> > L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438
> > insns per cycle | 2.39 | 2.39 | 1.90
> >
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
>
> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> loading, and cause operations per sec very low(~40K/sec), finally I write a new
> and simple one which can generate hundreds of kilo operations per sec and
> the number should match with some fast storage devices, and it does show there
> is not small effect from coroutine.
>
> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> only 3M operations/sec can be got, and without using coroutine, the number can
> reach 16M/sec, and there is more than 4 times difference!!!
I see that you're measuring a lot of things, but the one thing that is
unclear to me is what question those benchmarks are supposed to answer.
Basically I see two different, useful types of benchmark:
1. Look at coroutines in isolation and try to get a directly coroutine-
related function (like create/destroy or yield/reenter) faster. This
is what tests/test-coroutine does.
This is quite good at telling you what costs the coroutine functions
have and where you need to optimise - without taking the pratical
benefits into account, so it's not suitable for comparison.
2. Look at the whole thing in its realistic environment. This should
probably involve at least some asynchronous I/O, but ideally use the
whole block layer. qemu-img bench tries to do this. For being even
closer to the real environment you'd have to use the virtio-blk code
as well, which you currently only get with a full VM (perhaps qtest
could do something interesting here in theory).
This is good for telling how big the costs are in relation to the
total workload (and code saved elsewhere) in practice. This is the
set of tests that can meaningfully be compared to a callback-based
solution.
Running arbitrary workloads like getppid() or open/read/close isn't as
useful as these. It doesn't isolate the coroutines as well as tests that
run literally nothing else than coroutine functions, and it is too
removed from the actual use case to get the relation between additional
costs, saving and total workload figured out for the real case.
> From another file read bench which is the default one:
>
> just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
>
> without using coroutine, operations per second can increase ~20% compared
> with using coroutine. If reading 1024 bytes each time, the number still can
> increase ~10%. The operations per second level is between 200K~400K per
> sec which should match the IOPS in dataplane test, and the tests are
> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
>
> When reading 8192 and more bytes each time, the difference between using
> coroutine and not can't be observed obviously.
All it tells you is that the variation of the workload can make the
coroutine cost disappear in the noise. It doesn't tell you much about
how the real use case.
And you're comparing apples and oranges anyway: The real question in
qemu is whether you use coroutines or pass around heap-allocated state
between callbacks. Your benchmark doesn't have a single callback because
it hasn't got any asynchronous operations and doesn't need to allocate
and pass any state.
It does, however, have an unnecessary yield() for the coroutine case
because you felt that the real case is more complex and does yield
(which is true, but it's more complex for both coroutines and
callbacks).
> Surely, the test result should depend on how fast the machine is, but even
> for fast machine, I guess the similar result still can be observed by
> decreasing read bytes each time.
Yes, results looked similar on my laptop. (They just don't tell me
much.)
Let's have a look at some fio results from my laptop:
aggrb KB/s | master | coroutine | bypass
------------+-----------+-----------+------------
run 1 | 419934 | 449518 | 445823
run 2 | 444358 | 456365 | 448332
run 3 | 444076 | 455209 | 441552
And here from my lab test box:
aggrb KB/s | master | coroutine | bypass
------------+-----------+-----------+------------
run 1 | 25330 | 56378 | 53541
run 2 | 26041 | 55709 | 54136
run 3 | 25811 | 56829 | 49080
The improvement of the bypass patches is barely measurable on my laptop
(if it even exists), whereas it seems to be a pretty big thing for my
lab test box. In any case, the optimised coroutine code seems to beat
the bypass on both machines. (That is for random reads anyway. For
sequential, I get a much larger variation, and on my lab test box bypass
is ahead, whereas on my laptop both are roughly on the same level.)
Another thing I tried is creating the coroutine already in virtio-blk to
avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
the result of my benchmarks there, maybe you have an idea: For random
reads, I see a significant improvement, for sequential however a clear
degradation.
aggrb MB/s | bypass | coroutine | virtio-blk-created coroutine
------------+-----------+-----------+------------------------------
seq. read | 738 | 738 | 694
random read | 442 | 459 | 475
I would appreciate any ideas about what's going on with sequential reads
here and how it can be fixed. Anyway, on my machines, coroutines don't
look like a lost case at all.
Kevin
next prev parent reply other threads:[~2014-08-11 14:04 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-05 3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
2014-08-05 11:55 ` Eric Blake
2014-08-05 12:05 ` Michael S. Tsirkin
2014-08-05 12:21 ` Eric Blake
2014-08-05 12:51 ` Michael S. Tsirkin
2014-08-06 2:35 ` Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
2014-08-05 12:30 ` Eric Blake
2014-08-06 2:45 ` Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
2014-08-05 12:43 ` Eric Blake
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
2014-08-05 3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
2014-08-05 9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
2014-08-05 9:50 ` Ming Lei
2014-08-05 9:56 ` Kevin Wolf
2014-08-05 10:50 ` Ming Lei
2014-08-05 13:59 ` Stefan Hajnoczi
2014-08-05 9:48 ` Kevin Wolf
2014-08-05 10:00 ` Ming Lei
2014-08-05 11:44 ` Paolo Bonzini
2014-08-05 13:48 ` Stefan Hajnoczi
2014-08-05 14:47 ` Kevin Wolf
2014-08-06 5:33 ` Ming Lei
2014-08-06 7:45 ` Paolo Bonzini
2014-08-06 8:38 ` Ming Lei
2014-08-06 8:50 ` Paolo Bonzini
2014-08-06 13:53 ` Ming Lei
2014-08-06 8:48 ` Kevin Wolf
2014-08-06 9:37 ` Ming Lei
2014-08-06 10:09 ` Kevin Wolf
2014-08-06 11:28 ` Ming Lei
2014-08-06 11:44 ` Ming Lei
2014-08-06 15:40 ` Kevin Wolf
2014-08-07 10:27 ` Ming Lei
2014-08-07 10:52 ` Ming Lei
2014-08-07 11:06 ` Kevin Wolf
2014-08-07 13:03 ` Ming Lei
2014-08-07 13:51 ` Kevin Wolf
2014-08-08 10:32 ` Ming Lei
2014-08-08 11:26 ` Ming Lei
2014-08-10 3:46 ` Ming Lei
2014-08-11 14:03 ` Kevin Wolf [this message]
2014-08-12 7:53 ` Ming Lei
2014-08-12 11:40 ` Kevin Wolf
2014-08-12 12:14 ` Ming Lei
2014-08-11 19:37 ` Paolo Bonzini
2014-08-12 8:12 ` Ming Lei
2014-08-12 19:08 ` Paolo Bonzini
2014-08-13 9:54 ` Kevin Wolf
2014-08-13 13:16 ` Paolo Bonzini
2014-08-13 13:49 ` Ming Lei
2014-08-14 9:39 ` Stefan Hajnoczi
2014-08-14 10:12 ` Ming Lei
2014-08-15 20:16 ` Paolo Bonzini
2014-08-13 10:19 ` Ming Lei
2014-08-13 12:35 ` Paolo Bonzini
2014-08-13 8:55 ` Stefan Hajnoczi
2014-08-13 11:43 ` Ming Lei
2014-08-13 12:35 ` Paolo Bonzini
2014-08-13 13:07 ` Ming Lei
2014-08-14 10:46 ` Kevin Wolf
2014-08-15 10:39 ` Ming Lei
2014-08-15 20:15 ` Paolo Bonzini
2014-08-16 8:20 ` Ming Lei
2014-08-17 5:29 ` Paolo Bonzini
2014-08-18 8:58 ` Kevin Wolf
2014-08-06 9:37 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140811140356.GA3980@noname.redhat.com \
--to=kwolf@redhat.com \
--cc=famz@redhat.com \
--cc=ming.lei@canonical.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=tom.leiming@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).