public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi.kivity@gmail.com>
To: Paolo Bonzini <pbonzini@redhat.com>, Gleb Natapov <gleb@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>, kvm list <kvm@vger.kernel.org>
Subject: Re: Seeking a KVM benchmark
Date: Mon, 10 Nov 2014 16:23:45 +0200	[thread overview]
Message-ID: <5460CA71.2050701@gmail.com> (raw)
In-Reply-To: <5460AC7C.8040409@redhat.com>


On 11/10/2014 02:15 PM, Paolo Bonzini wrote:
>
> On 10/11/2014 11:45, Gleb Natapov wrote:
>>> I tried making also the other shared MSRs the same between guest and
>>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
>>> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
>>> do want to dig out my old Core 2 and see how the new test fares, but it
>>> really looks like your patch will be in 3.19.
>> Please test on wide variety of HW before final decision.
> Yes, definitely.
>
>> Also it would
>> be nice to ask Intel what is expected overhead. It is awesome if they
>> mange to add EFER switching with non measurable overhead, but also hard
>> to believe :)
> So let's see what happens.  Sneak preview: the result is definitely worth
> asking Intel about.
>
> I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
> kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
> for others to reproduce the results.  I only ran the inl_from_qemu test.
>
> Perf stat reports that the processor goes from 0.65 to 0.46
> instructions per cycle, which is consistent with the improvement from
> 19k to 12k cycles per iteration.
>
> Unpatched KVM-unit-tests:
>
>       3,385,586,563 cycles                    #    3.189 GHz                     [83.25%]
>       2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle    [83.37%]
>       2,083,556,270 stalled-cycles-backend    #   61.54% backend  cycles idle    [66.71%]
>       1,573,854,041 instructions              #    0.46  insns per cycle
>                                               #    1.57  stalled cycles per insn [83.20%]
>         1.108486526 seconds time elapsed
>
>
> Patched KVM-unit-tests:
>
>       3,252,297,378 cycles                    #    3.147 GHz                     [83.32%]
>       2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle    [83.36%]
>       1,560,371,769 stalled-cycles-backend    #   47.98% backend  cycles idle    [66.51%]
>       2,133,698,018 instructions              #    0.66  insns per cycle
>                                               #    0.94  stalled cycles per insn [83.45%]
>         1.072395697 seconds time elapsed
>
> Playing with other events shows that the unpatched benchmark has an
> awful load of TLB misses
>
> Unpatched:
>
>              30,311 iTLB-loads
>         464,641,844 dTLB-loads
>          10,813,839 dTLB-load-misses          #    2.33% of all dTLB cache hits
>          20436,027 iTLB-load-misses          #  67421.16% of all iTLB cache hits
>
> Patched:
>
>           1,440,033 iTLB-loads
>         640,970,836 dTLB-loads
>           2,345,112 dTLB-load-misses          #    0.37% of all dTLB cache hits
>             270,884 iTLB-load-misses          #   18.81% of all iTLB cache hits
>
> This is 100% reproducible.  The meaning of the numbers is clearer if you
> look up the raw event numbers in the Intel manuals:
>
> - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level
> TLB] hits. No page walk."
>
> - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that
> cause page walks."
>
> So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and
> friends show that the unpatched KVM wastes about 0.1 seconds more than
> the patched KVM on page walks:
>
> Unpatched:
>
>          22,583,440 r449             (cycles on dTLB store miss page walks)
>          40,452,018 r408             (cycles on dTLB load miss page walks)
>           2,115,981 r485             (cycles on iTLB miss page walks)
> ------------------------
>          65,151,439 total
>
> Patched:
>
>          24,430,676 r449             (cycles on dTLB store miss page walks)
>         196,017,693 r408             (cycles on dTLB load miss page walks)
>         213,266,243 r485             (cycles on iTLB miss page walks)
> -------------------------
>         433,714,612 total
>
> These 0.1 seconds probably are all on instructions that would have been
> fast, since the slow instructions responsible for the low IPC are the
> microcoded instructions including VMX and other privileged stuff.
>
> Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
> and 260k in patched KVM.  Let's see where they come from:
>
> Unpatched:
>
> +  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
> +   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
>
> It's expected that most TLB misses happen just before a page fault (there
> are also events to count how many TLB misses do result in a page fault,
> if you care about that), and thus are accounted to the first instruction of the
> exception handler.
>
> We do not know what causes second-level TLB _flushes_ but it's quite
> expected that you'll have a TLB miss after them and possibly a page fault.
> And anyway 98.97% of them coming from native_write_msr_safe is totally
> anomalous.
>
> A patched benchmark shows no second-level TLB flush occurs after a WRMSR:
>
> +  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
> +   9.07%  qemu-kvm  [kvm_intel]        [k] vmx_flush_tlb
> +   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
> +   5.68%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_mm_range
> +   4.87%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb
> +   1.36%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_page
>
>
> So basically VMX EFER writes are optimized, while non-VMX EFER writes
> cause a TLB flush, at least on a Sandy Bridge.  Ouch!
>

It's not surprising [1].  Since the meaning of some PTE bits change [2], the
TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
if EFER changed between two invocations of the same VPID, which isn't the
case.

[1] after the fact
[2] although those bits were reserved with NXE=0, so they shouldn't have 
any TLB footprint

  reply	other threads:[~2014-11-10 14:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-07  6:27 Seeking a KVM benchmark Andy Lutomirski
2014-11-07  7:17 ` Paolo Bonzini
2014-11-07 17:59   ` Andy Lutomirski
2014-11-07 18:11     ` Andy Lutomirski
2014-11-08 12:01     ` Gleb Natapov
2014-11-08 16:00       ` Andy Lutomirski
2014-11-08 16:44         ` Andy Lutomirski
2014-11-09  8:52           ` Gleb Natapov
2014-11-09 16:36             ` Andy Lutomirski
2014-11-10 10:03               ` Paolo Bonzini
2014-11-10 10:45                 ` Gleb Natapov
2014-11-10 12:15                   ` Paolo Bonzini
2014-11-10 14:23                     ` Avi Kivity [this message]
2014-11-10 17:28                       ` Paolo Bonzini
2014-11-10 17:38                         ` Gleb Natapov
2014-11-12 11:33                           ` Paolo Bonzini
2014-11-12 15:22                             ` Gleb Natapov
2014-11-12 15:26                               ` Paolo Bonzini
2014-11-12 15:32                                 ` Gleb Natapov
2014-11-12 15:51                                   ` Paolo Bonzini
2014-11-12 16:07                                     ` Andy Lutomirski
2014-11-12 17:56                                       ` Paolo Bonzini
2014-11-17 11:17                         ` Wanpeng Li
2014-11-17 11:18                           ` Paolo Bonzini
2014-11-17 12:00                             ` Wanpeng Li
2014-11-17 12:04                               ` Paolo Bonzini
2014-11-17 12:14                                 ` Wanpeng Li
2014-11-17 12:22                                   ` Paolo Bonzini
2014-11-11 11:07                     ` Paolo Bonzini
2014-11-10 19:17                   ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5460CA71.2050701@gmail.com \
    --to=avi.kivity@gmail.com \
    --cc=gleb@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox