Re: [Qemu-devel] QEMU event loop optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sergio Lopez <slp@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Sergio Lopez <slp@redhat.com>,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] QEMU event loop optimizations
Date: Fri, 05 Apr 2019 18:29:49 +0200	[thread overview]
Message-ID: <878swomn42.fsf@redhat.com> (raw)
In-Reply-To: <20190326131822.GD15011@stefanha-x1.localdomain>

[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]

Stefan Hajnoczi writes:

> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.

To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".

# fio --time_based --runtime=30 --rw=randread --name=randread \
 --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1

| avg_lat (us) | master | qbsn* |
|   run1       | 11.32  | 10.96 |
|   run2       | 11.37  | 10.79 |
|   run3       | 11.42  | 10.67 |
|   run4       | 11.32  | 11.06 |
|   run5       | 11.42  | 11.19 |
|   run6       | 11.42  | 10.91 |
 * patched with aio: add optimized qemu_bh_schedule_nested() API

Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.

I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.

Sergio.

[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: Sergio Lopez <slp@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel@nongnu.org, Sergio Lopez <slp@redhat.com>
Subject: Re: [Qemu-devel] QEMU event loop optimizations
Date: Fri, 05 Apr 2019 18:29:49 +0200	[thread overview]
Message-ID: <878swomn42.fsf@redhat.com> (raw)
Message-ID: <20190405162949.KDxqBUTuznaZPWlOM7sxudpk8ndTAToM6XJEIop-Tys@z> (raw)
In-Reply-To: <20190326131822.GD15011@stefanha-x1.localdomain>

[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]

Stefan Hajnoczi writes:

> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.

To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".

# fio --time_based --runtime=30 --rw=randread --name=randread \
 --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1

| avg_lat (us) | master | qbsn* |
|   run1       | 11.32  | 10.96 |
|   run2       | 11.37  | 10.79 |
|   run3       | 11.42  | 10.67 |
|   run4       | 11.32  | 11.06 |
|   run5       | 11.42  | 11.19 |
|   run6       | 11.42  | 10.91 |
 * patched with aio: add optimized qemu_bh_schedule_nested() API

Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.

I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.

Sergio.

[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

next      parent reply	other threads:[~2019-04-05 16:30 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190326131822.GD15011@stefanha-x1.localdomain>
2019-04-05 16:29 ` Sergio Lopez [this message]
2019-04-05 16:29   ` [Qemu-devel] QEMU event loop optimizations Sergio Lopez
2019-04-08  8:29   ` Stefan Hajnoczi
2019-04-08  8:29     ` Stefan Hajnoczi
     [not found] ` <55751c00-0854-ea4d-75b5-ab82b4eeb70d@redhat.com>
2019-04-02 16:18   ` Kevin Wolf
2019-04-02 16:25     ` Paolo Bonzini
2019-04-05 16:33   ` Sergio Lopez
2019-04-05 16:33     ` Sergio Lopez
2019-04-08 10:42     ` Paolo Bonzini
2019-04-08 10:42       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878swomn42.fsf@redhat.com \
    --to=slp@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.