Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: Ming Lei <ming.lei@canonical.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Fam Zheng <famz@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Wed, 6 Aug 2014 12:09:18 +0200	[thread overview]
Message-ID: <20140806100918.GC4090@noname.str.redhat.com> (raw)
In-Reply-To: <CACVXFVMsbuYdto_Vz8n9VZKFfYpYN2-0nRy2ksWgcE4DgypC8g@mail.gmail.com>

Am 06.08.2014 um 11:37 hat Ming Lei geschrieben:
> On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> >> Hi Kevin,
> >>
> >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
> >> >> I have been wondering how to prove that the root cause is the ucontext
> >> >> coroutine mechanism (stack switching).  Here is an idea:
> >> >>
> >> >> Hack your "bypass" code path to run the request inside a coroutine.
> >> >> That way you can compare "bypass without coroutine" against "bypass with
> >> >> coroutine".
> >> >>
> >> >> Right now I think there are doubts because the bypass code path is
> >> >> indeed a different (and not 100% correct) code path.  So this approach
> >> >> might prove that the coroutines are adding the overhead and not
> >> >> something that you bypassed.
> >> >
> >> > My doubts aren't only that the overhead might not come from the
> >> > coroutines, but also whether any coroutine-related overhead is really
> >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> >> > just that instead of introducing additional code paths.
> >>
> >> OK, thank you for taking look at the problem, and hope we can
> >> figure out the root cause, :-)
> >>
> >> >
> >> > Another thought I had was this: If the performance difference is indeed
> >> > only coroutines, then that is completely inside the block layer and we
> >> > don't actually need a VM to test it. We could instead have something
> >> > like a simple qemu-img based benchmark and should be observing the same.
> >>
> >> Even it is simpler to run a coroutine-only benchmark, and I just
> >> wrote a raw one, and looks coroutine does decrease performance
> >> a lot, please see the attachment patch, and thanks for your template
> >> to help me add the 'co_bench' command in qemu-img.
> >
> > Yes, we can look at coroutines microbenchmarks in isolation. I actually
> > did do that yesterday with the yield test from tests/test-coroutine.c.
> > And in fact profiling immediately showed something to optimise:
> > pthread_getspecific() was quite high, replacing it by __thread on
> > systems where it works is more efficient and helped the numbers a bit.
> > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even
> > in qemu-img bench), maybe there's even something that can be done here.
> 
> The lock/unlock in dataplane is often from memory_region_find(), and Paolo
> should have done lots of work on that.
> 
> >
> > However, I just wasn't sure whether a change on this level would be
> > relevant in a realistic environment. This is the reason why I wanted to
> > get a benchmark involving the block layer and some I/O.
> >
> >> From the profiling data in below link:
> >>
> >>     http://pastebin.com/YwH2uwbq
> >>
> >> With coroutine, the running time for same loading is increased
> >> ~50%(1.325s vs. 0.903s), and dcache load events is increased
> >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
> >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
> >>
> >> The bypass code in the benchmark is very similar with the approach
> >> used in the bypass patch, since linux-aio with O_DIRECT seldom
> >> blocks in the the kernel I/O path.
> >>
> >> Maybe the benchmark is a bit extremely, but given modern storage
> >> device may reach millions of IOPS, and it is very easy to slow down
> >> the I/O by coroutine.
> >
> > I think in order to optimise coroutines, such benchmarks are fair game.
> > It's just not guaranteed that the effects are exactly the same on real
> > workloads, so we should take the results with a grain of salt.
> >
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> >
> >                 | bypass        | fixed coro    | buggy coro
> > ----------------+---------------+---------------+--------------
> > time            | 1.09s         | 1.10s         | 1.62s
> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> > insns per cycle | 2.39          | 2.39          | 1.90
> >
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
> 
> When I comment qemu_coroutine_yield(), looks result of
> bypass and fixed coro is very similar as your test, and I am just
> wondering if stack is always switched in qemu_coroutine_enter()
> without calling qemu_coroutine_yield().

Yes, definitely. qemu_coroutine_enter() always involves calling
qemu_coroutine_switch(), which is the stack switch.

> Without the yield, the benchmark can't emulate coroutine usage in
> bdrv_aio_readv/writev() path any more, and bypass in the patchset
> skips two qemu_coroutine_enter() and one qemu_coroutine_yield()
> for each bdrv_aio_readv/writev().

It's not completely comparable anyway because you're not going through a
main loop and callbacks from there for your benchmark.

But fair enough: Keep the yield, but enter the coroutine twice then. You
get slightly worse results then, but that's more like doubling the very
small difference between "bypass" and "fixed coro" (1.11s / 946,434,327
/ 2.37), not like the horrible performance of the buggy version.

Actually, that's within the error of measurement for time and
insns/cycle, so running it for a bit longer:

                | bypass    | coro      | + yield   | buggy coro
----------------+-----------+-----------+-----------+--------------
time            | 21.45s    | 21.68s    | 21.83s    | 97.05s
L1-dcache-loads | 18,049 M  | 18,387 M  | 18,618 M  | 26,062 M
insns per cycle | 2.42      | 2.40      | 2.41      | 1.75

> >> > I played a bit with the following, I hope it's not too naive. I couldn't
> >> > see a difference with your patches, but at least one reason for this is
> >> > probably that my laptop SSD isn't fast enough to make the CPU the
> >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next
> >> > thing. (I actually wrote the patch up just for some profiling on my own,
> >> > not for comparing throughput, but it should be usable for that as well.)
> >>
> >> This might not be good for the test since it is basically a sequential
> >> read test, which can be optimized a lot by kernel. And I always use
> >> randread benchmark.
> >
> > Yes, I shortly pondered whether I should implement random offsets
> > instead. But then I realised that a quicker kernel operation would only
> > help the benchmark because we want it to test the CPU consumption in
> > userspace. So the faster the kernel gets, the better for us, because it
> > should make the impact of coroutines bigger.
> 
> OK, I will compare coroutine vs. bypass-co with the benchmark.

Ok, thanks.

Kevin

next prev parent reply	other threads:[~2014-08-06 10:09 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-05  3:33 [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 01/17] qemu/obj_pool.h: introduce object allocation pool Ming Lei
2014-08-05 11:55   ` Eric Blake
2014-08-05 12:05     ` Michael S. Tsirkin
2014-08-05 12:21       ` Eric Blake
2014-08-05 12:51         ` Michael S. Tsirkin
2014-08-06  2:35     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 02/17] dataplane: use object pool to speed up allocation for virtio blk request Ming Lei
2014-08-05 12:30   ` Eric Blake
2014-08-06  2:45     ` Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 03/17] qemu coroutine: support bypass mode Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 04/17] block: prepare for supporting selective bypass coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 05/17] garbage collector: introduced for support of " Ming Lei
2014-08-05 12:43   ` Eric Blake
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 06/17] block: introduce bdrv_co_can_bypass_co Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 07/17] block: support to bypass qemu coroutinue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 08/17] Revert "raw-posix: drop raw_get_aio_fd() since it is no longer used" Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 09/17] dataplane: enable selective bypassing coroutine Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 10/17] linux-aio: fix submit aio as a batch Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 11/17] linux-aio: handling -EAGAIN for !s->io_q.plugged case Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 12/17] linux-aio: increase max event to 256 Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 13/17] linux-aio: remove 'node' from 'struct qemu_laiocb' Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 14/17] hw/virtio/virtio-blk.h: introduce VIRTIO_BLK_F_MQ Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 15/17] virtio-blk: support multi queue for non-dataplane Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 16/17] virtio-blk: dataplane: support multi virtqueue Ming Lei
2014-08-05  3:33 ` [Qemu-devel] [PATCH v1 17/17] hw/virtio-pci: introduce num_queues property Ming Lei
2014-08-05  9:38 ` [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support Stefan Hajnoczi
2014-08-05  9:50   ` Ming Lei
2014-08-05  9:56     ` Kevin Wolf
2014-08-05 10:50       ` Ming Lei
2014-08-05 13:59     ` Stefan Hajnoczi
2014-08-05  9:48 ` Kevin Wolf
2014-08-05 10:00   ` Ming Lei
2014-08-05 11:44     ` Paolo Bonzini
2014-08-05 13:48     ` Stefan Hajnoczi
2014-08-05 14:47       ` Kevin Wolf
2014-08-06  5:33         ` Ming Lei
2014-08-06  7:45           ` Paolo Bonzini
2014-08-06  8:38             ` Ming Lei
2014-08-06  8:50               ` Paolo Bonzini
2014-08-06 13:53                 ` Ming Lei
2014-08-06  8:48           ` Kevin Wolf
2014-08-06  9:37             ` Ming Lei
2014-08-06 10:09               ` Kevin Wolf [this message]
2014-08-06 11:28                 ` Ming Lei
2014-08-06 11:44                   ` Ming Lei
2014-08-06 15:40                   ` Kevin Wolf
2014-08-07 10:27                     ` Ming Lei
2014-08-07 10:52                       ` Ming Lei
2014-08-07 11:06                         ` Kevin Wolf
2014-08-07 13:03                           ` Ming Lei
2014-08-07 13:51                       ` Kevin Wolf
2014-08-08 10:32                         ` Ming Lei
2014-08-08 11:26                           ` Ming Lei
2014-08-10  3:46             ` Ming Lei
2014-08-11 14:03               ` Kevin Wolf
2014-08-12  7:53                 ` Ming Lei
2014-08-12 11:40                   ` Kevin Wolf
2014-08-12 12:14                     ` Ming Lei
2014-08-11 19:37               ` Paolo Bonzini
2014-08-12  8:12                 ` Ming Lei
2014-08-12 19:08                   ` Paolo Bonzini
2014-08-13  9:54                     ` Kevin Wolf
2014-08-13 13:16                       ` Paolo Bonzini
2014-08-13 13:49                         ` Ming Lei
2014-08-14  9:39                           ` Stefan Hajnoczi
2014-08-14 10:12                             ` Ming Lei
2014-08-15 20:16                             ` Paolo Bonzini
2014-08-13 10:19                     ` Ming Lei
2014-08-13 12:35                       ` Paolo Bonzini
2014-08-13  8:55                 ` Stefan Hajnoczi
2014-08-13 11:43                 ` Ming Lei
2014-08-13 12:35                   ` Paolo Bonzini
2014-08-13 13:07                     ` Ming Lei
2014-08-14 10:46                 ` Kevin Wolf
2014-08-15 10:39                   ` Ming Lei
2014-08-15 20:15                   ` Paolo Bonzini
2014-08-16  8:20                     ` Ming Lei
2014-08-17  5:29                     ` Paolo Bonzini
2014-08-18  8:58                       ` Kevin Wolf
2014-08-06  9:37           ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140806100918.GC4090@noname.str.redhat.com \
    --to=kwolf@redhat.com \
    --cc=famz@redhat.com \
    --cc=ming.lei@canonical.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).