Seeking a KVM benchmark

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Seeking a KVM benchmark
@ 2014-11-07  6:27 Andy Lutomirski
  2014-11-07  7:17 ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-07  6:27 UTC (permalink / raw)
  To: kvm list

Is there an easy benchmark that's sensitive to the time it takes to
round-trip from userspace to guest and back to userspace?  I think I
may have a big speedup.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-07  6:27 Seeking a KVM benchmark Andy Lutomirski
@ 2014-11-07  7:17 ` Paolo Bonzini
  2014-11-07 17:59   ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-07  7:17 UTC (permalink / raw)
  To: Andy Lutomirski, kvm list



On 07/11/2014 07:27, Andy Lutomirski wrote:
> Is there an easy benchmark that's sensitive to the time it takes to
> round-trip from userspace to guest and back to userspace?  I think I
> may have a big speedup.

The simplest is vmexit.flat from
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git

Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
benchmark.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-07  7:17 ` Paolo Bonzini
@ 2014-11-07 17:59   ` Andy Lutomirski
  2014-11-07 18:11     ` Andy Lutomirski
  2014-11-08 12:01     ` Gleb Natapov
  0 siblings, 2 replies; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-07 17:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm list

On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 07/11/2014 07:27, Andy Lutomirski wrote:
>> Is there an easy benchmark that's sensitive to the time it takes to
>> round-trip from userspace to guest and back to userspace?  I think I
>> may have a big speedup.
>
> The simplest is vmexit.flat from
> git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
>
> Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> benchmark.

Thanks!

That test case is slower than I expected.  I think my change is likely
to save somewhat under 100ns, which is only a couple percent.  I'll
look for more impressive improvements.

On a barely related note, in the process of poking around with this
test, I noticed:

    /* On ept, can't emulate nx, and must switch nx atomically */
    if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
        guest_efer = vmx->vcpu.arch.efer;
        if (!(guest_efer & EFER_LMA))
            guest_efer &= ~EFER_LME;
        add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
        return false;
    }

    return true;

This heuristic seems wrong to me.  wrmsr is serializing and therefore
extremely slow, whereas I imagine that, on CPUs that support it,
atomically switching EFER ought to be reasonably fast.

Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
switching, and having no other relevant effect that I've thought of)
speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
to always use atomic EFER switching, at least when
cpu_has_load_ia32_efer?

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-07 17:59   ` Andy Lutomirski
@ 2014-11-07 18:11     ` Andy Lutomirski
  2014-11-08 12:01     ` Gleb Natapov
  1 sibling, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-07 18:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm list

On Fri, Nov 7, 2014 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 07/11/2014 07:27, Andy Lutomirski wrote:
>>> Is there an easy benchmark that's sensitive to the time it takes to
>>> round-trip from userspace to guest and back to userspace?  I think I
>>> may have a big speedup.
>>
>> The simplest is vmexit.flat from
>> git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
>>
>> Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
>> benchmark.
>
> Thanks!
>
> That test case is slower than I expected.  I think my change is likely
> to save somewhat under 100ns, which is only a couple percent.  I'll
> look for more impressive improvements.
>
> On a barely related note, in the process of poking around with this
> test, I noticed:
>
>     /* On ept, can't emulate nx, and must switch nx atomically */
>     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
>         guest_efer = vmx->vcpu.arch.efer;
>         if (!(guest_efer & EFER_LMA))
>             guest_efer &= ~EFER_LME;
>         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
>         return false;
>     }
>
>     return true;
>
> This heuristic seems wrong to me.  wrmsr is serializing and therefore
> extremely slow, whereas I imagine that, on CPUs that support it,
> atomically switching EFER ought to be reasonably fast.
>
> Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> switching, and having no other relevant effect that I've thought of)
> speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> to always use atomic EFER switching, at least when
> cpu_has_load_ia32_efer?
>

Digging in to the history suggests that I might be right.

There's this:

commit 110312c84b5fbd4daf5de2417fa8ab5ec883858d
Author: Avi Kivity <avi@redhat.com>
Date:   Tue Dec 21 12:54:20 2010 +0200

    KVM: VMX: Optimize atomic EFER load

    When NX is enabled on the host but not on the guest, we use the entry/exit
    msr load facility, which is slow.  Optimize it to use entry/exit efer load,
    which is ~1200 cycles faster.

    Signed-off-by: Avi Kivity <avi@redhat.com>
    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

The NX and atomic EFER heuristic seems to be considerably older than
that.  It could just be that no one ever noticed entry/exit efer load
becoming faster than wrmsr on modern hardware.  Someone should
double-check that I'm not nuts here, though.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-07 17:59   ` Andy Lutomirski
  2014-11-07 18:11     ` Andy Lutomirski
@ 2014-11-08 12:01     ` Gleb Natapov
  2014-11-08 16:00       ` Andy Lutomirski
  1 sibling, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2014-11-08 12:01 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list

On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 07/11/2014 07:27, Andy Lutomirski wrote:
> >> Is there an easy benchmark that's sensitive to the time it takes to
> >> round-trip from userspace to guest and back to userspace?  I think I
> >> may have a big speedup.
> >
> > The simplest is vmexit.flat from
> > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >
> > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> > benchmark.
> 
> Thanks!
> 
> That test case is slower than I expected.  I think my change is likely
> to save somewhat under 100ns, which is only a couple percent.  I'll
> look for more impressive improvements.
> 
> On a barely related note, in the process of poking around with this
> test, I noticed:
> 
>     /* On ept, can't emulate nx, and must switch nx atomically */
>     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
>         guest_efer = vmx->vcpu.arch.efer;
>         if (!(guest_efer & EFER_LMA))
>             guest_efer &= ~EFER_LME;
>         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
>         return false;
>     }
> 
>     return true;
> 
> This heuristic seems wrong to me.  wrmsr is serializing and therefore
> extremely slow, whereas I imagine that, on CPUs that support it,
> atomically switching EFER ought to be reasonably fast.
> 
> Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> switching, and having no other relevant effect that I've thought of)
> speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> to always use atomic EFER switching, at least when
> cpu_has_load_ia32_efer?
> 
The idea behind current logic is that we want to avoid writing an MSR
at all for lightweight exists (those that do not exit to userspace). So
if NX bit is the same for host and guest we can avoid writing EFER on
exit and run with guest's EFER in the kernel. But if userspace exit is
required only then we write host's MSR back, only if guest and host MSRs
are different of course. What bit should be restored on userspace exit
in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?

Your change reduced userspace exit cost by ~30%, but what about exit to kernel? 
We have much more of those.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-08 12:01     ` Gleb Natapov
@ 2014-11-08 16:00       ` Andy Lutomirski
  2014-11-08 16:44         ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-08 16:00 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini

On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote:
>
> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > >
> > > On 07/11/2014 07:27, Andy Lutomirski wrote:
> > >> Is there an easy benchmark that's sensitive to the time it takes to
> > >> round-trip from userspace to guest and back to userspace?  I think I
> > >> may have a big speedup.
> > >
> > > The simplest is vmexit.flat from
> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> > >
> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> > > benchmark.
> >
> > Thanks!
> >
> > That test case is slower than I expected.  I think my change is likely
> > to save somewhat under 100ns, which is only a couple percent.  I'll
> > look for more impressive improvements.
> >
> > On a barely related note, in the process of poking around with this
> > test, I noticed:
> >
> >     /* On ept, can't emulate nx, and must switch nx atomically */
> >     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
> >         guest_efer = vmx->vcpu.arch.efer;
> >         if (!(guest_efer & EFER_LMA))
> >             guest_efer &= ~EFER_LME;
> >         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
> >         return false;
> >     }
> >
> >     return true;
> >
> > This heuristic seems wrong to me.  wrmsr is serializing and therefore
> > extremely slow, whereas I imagine that, on CPUs that support it,
> > atomically switching EFER ought to be reasonably fast.
> >
> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> > switching, and having no other relevant effect that I've thought of)
> > speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> > to always use atomic EFER switching, at least when
> > cpu_has_load_ia32_efer?
> >
> The idea behind current logic is that we want to avoid writing an MSR
> at all for lightweight exists (those that do not exit to userspace). So
> if NX bit is the same for host and guest we can avoid writing EFER on
> exit and run with guest's EFER in the kernel. But if userspace exit is
> required only then we write host's MSR back, only if guest and host MSRs
> are different of course. What bit should be restored on userspace exit
> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?

I don't understand.  AFAICT there are really only two cases: EFER
switched atomically using the best available mechanism on the host
CPU, or EFER switched on userspace exit.  I think there's a
theoretical third possibility: if the guest and host EFER match, then
EFER doesn't need to be switched at all, but this doesn't seem to be
implemented.

>
> Your change reduced userspace exit cost by ~30%, but what about exit to kernel?
> We have much more of those.

My KVM patch to change the heuristic didn't seem to slow down kernel
exits at all.  In fact, it got faster, but possibly not significantly.
This makes me suspect that the newer EFER entry/exit controls are
actually free.  This wouldn't surprise me all that much, since the
microcode has to fiddle with LME and such anyway, and just switching
the whole register could be easier than thinking about which bits to
switch.

My KVM patch and actual benchmarks are here:

http://article.gmane.org/gmane.linux.kernel/1824469

I used the wrong email address for you, and it doesn't seem to have
made it to the KVM list, though.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-08 16:00       ` Andy Lutomirski
@ 2014-11-08 16:44         ` Andy Lutomirski
  2014-11-09  8:52           ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-08 16:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini

On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote:
>>
>> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
>> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > >
>> > >
>> > > On 07/11/2014 07:27, Andy Lutomirski wrote:
>> > >> Is there an easy benchmark that's sensitive to the time it takes to
>> > >> round-trip from userspace to guest and back to userspace?  I think I
>> > >> may have a big speedup.
>> > >
>> > > The simplest is vmexit.flat from
>> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
>> > >
>> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
>> > > benchmark.
>> >
>> > Thanks!
>> >
>> > That test case is slower than I expected.  I think my change is likely
>> > to save somewhat under 100ns, which is only a couple percent.  I'll
>> > look for more impressive improvements.
>> >
>> > On a barely related note, in the process of poking around with this
>> > test, I noticed:
>> >
>> >     /* On ept, can't emulate nx, and must switch nx atomically */
>> >     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
>> >         guest_efer = vmx->vcpu.arch.efer;
>> >         if (!(guest_efer & EFER_LMA))
>> >             guest_efer &= ~EFER_LME;
>> >         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
>> >         return false;
>> >     }
>> >
>> >     return true;
>> >
>> > This heuristic seems wrong to me.  wrmsr is serializing and therefore
>> > extremely slow, whereas I imagine that, on CPUs that support it,
>> > atomically switching EFER ought to be reasonably fast.
>> >
>> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
>> > switching, and having no other relevant effect that I've thought of)
>> > speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
>> > to always use atomic EFER switching, at least when
>> > cpu_has_load_ia32_efer?
>> >
>> The idea behind current logic is that we want to avoid writing an MSR
>> at all for lightweight exists (those that do not exit to userspace). So
>> if NX bit is the same for host and guest we can avoid writing EFER on
>> exit and run with guest's EFER in the kernel. But if userspace exit is
>> required only then we write host's MSR back, only if guest and host MSRs
>> are different of course. What bit should be restored on userspace exit
>> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?
>
> I don't understand.  AFAICT there are really only two cases: EFER
> switched atomically using the best available mechanism on the host
> CPU, or EFER switched on userspace exit.  I think there's a
> theoretical third possibility: if the guest and host EFER match, then
> EFER doesn't need to be switched at all, but this doesn't seem to be
> implemented.

I got this part wrong.  It looks like the user return notifier is
smart enough not to set EFER at all if the guest and host values
match.  Indeed, with stock KVM, if I modify vmexit.c to have exactly
the same EFER as the host (NX and SCE both set), then it runs quickly.
But I get almost exactly the same performance if NX is clear, which is
the case where the built-in entry/exit switching is used.

Admittedly, most guests probably do match the host, so this effect may
be rare in practice.  But possibly the code should be changed either
the way I patched it (always use the built-in switching if available)
or to only do it if the guest and host EFER values differ.  ISTM that,
on modern CPUs, switching EFER on return to userspace is always a big
loss.

If neither change is made, then maybe the test should change to set
SCE so that it isn't so misleadingly slow.

--Andy

>
>>
>> Your change reduced userspace exit cost by ~30%, but what about exit to kernel?
>> We have much more of those.
>
> My KVM patch to change the heuristic didn't seem to slow down kernel
> exits at all.  In fact, it got faster, but possibly not significantly.
> This makes me suspect that the newer EFER entry/exit controls are
> actually free.  This wouldn't surprise me all that much, since the
> microcode has to fiddle with LME and such anyway, and just switching
> the whole register could be easier than thinking about which bits to
> switch.
>
> My KVM patch and actual benchmarks are here:
>
> http://article.gmane.org/gmane.linux.kernel/1824469
>
> I used the wrong email address for you, and it doesn't seem to have
> made it to the KVM list, though.
>
> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-08 16:44         ` Andy Lutomirski
@ 2014-11-09  8:52           ` Gleb Natapov
  2014-11-09 16:36             ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2014-11-09  8:52 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: kvm list, Paolo Bonzini

On Sat, Nov 08, 2014 at 08:44:42AM -0800, Andy Lutomirski wrote:
> On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote:
> >>
> >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
> >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> > >
> >> > >
> >> > > On 07/11/2014 07:27, Andy Lutomirski wrote:
> >> > >> Is there an easy benchmark that's sensitive to the time it takes to
> >> > >> round-trip from userspace to guest and back to userspace?  I think I
> >> > >> may have a big speedup.
> >> > >
> >> > > The simplest is vmexit.flat from
> >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
> >> > >
> >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
> >> > > benchmark.
> >> >
> >> > Thanks!
> >> >
> >> > That test case is slower than I expected.  I think my change is likely
> >> > to save somewhat under 100ns, which is only a couple percent.  I'll
> >> > look for more impressive improvements.
> >> >
> >> > On a barely related note, in the process of poking around with this
> >> > test, I noticed:
> >> >
> >> >     /* On ept, can't emulate nx, and must switch nx atomically */
> >> >     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
> >> >         guest_efer = vmx->vcpu.arch.efer;
> >> >         if (!(guest_efer & EFER_LMA))
> >> >             guest_efer &= ~EFER_LME;
> >> >         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
> >> >         return false;
> >> >     }
> >> >
> >> >     return true;
> >> >
> >> > This heuristic seems wrong to me.  wrmsr is serializing and therefore
> >> > extremely slow, whereas I imagine that, on CPUs that support it,
> >> > atomically switching EFER ought to be reasonably fast.
> >> >
> >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
> >> > switching, and having no other relevant effect that I've thought of)
> >> > speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
> >> > to always use atomic EFER switching, at least when
> >> > cpu_has_load_ia32_efer?
> >> >
> >> The idea behind current logic is that we want to avoid writing an MSR
> >> at all for lightweight exists (those that do not exit to userspace). So
> >> if NX bit is the same for host and guest we can avoid writing EFER on
> >> exit and run with guest's EFER in the kernel. But if userspace exit is
> >> required only then we write host's MSR back, only if guest and host MSRs
> >> are different of course. What bit should be restored on userspace exit
> >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?
> >
> > I don't understand.  AFAICT there are really only two cases: EFER
> > switched atomically using the best available mechanism on the host
> > CPU, or EFER switched on userspace exit.  I think there's a
> > theoretical third possibility: if the guest and host EFER match, then
> > EFER doesn't need to be switched at all, but this doesn't seem to be
> > implemented.
> 
> I got this part wrong.  It looks like the user return notifier is
> smart enough not to set EFER at all if the guest and host values
> match.  Indeed, with stock KVM, if I modify vmexit.c to have exactly
> the same EFER as the host (NX and SCE both set), then it runs quickly.
> But I get almost exactly the same performance if NX is clear, which is
> the case where the built-in entry/exit switching is used.
> 
What's the performance difference?

> Admittedly, most guests probably do match the host, so this effect may
> be rare in practice.  But possibly the code should be changed either
> the way I patched it (always use the built-in switching if available)
> or to only do it if the guest and host EFER values differ.  ISTM that,
> on modern CPUs, switching EFER on return to userspace is always a big
> loss.
We should be careful to not optimise for a wrong case. In common case
userspace exits are extremely rare. Try to trace common workloads with
Linux guest. Windows as a guest has its share of userspace exists, but
this is due to the lack of PV timer support (was it fixed already?).
So if switching EFER has measurable overhead doing it on each exit is a
net loss.

> 
> If neither change is made, then maybe the test should change to set
> SCE so that it isn't so misleadingly slow.
>
The purpose of vmexit test is to show us various overheads, so why not
measure EFER switch overhead by having two tests one with equal EFER
another with different EFER, instead of hiding it.


--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-09  8:52           ` Gleb Natapov
@ 2014-11-09 16:36             ` Andy Lutomirski
  2014-11-10 10:03               ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-09 16:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini

On Sun, Nov 9, 2014 at 12:52 AM, Gleb Natapov <gleb@kernel.org> wrote:
> On Sat, Nov 08, 2014 at 08:44:42AM -0800, Andy Lutomirski wrote:
>> On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote:
>> >>
>> >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote:
>> >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >> > >
>> >> > >
>> >> > > On 07/11/2014 07:27, Andy Lutomirski wrote:
>> >> > >> Is there an easy benchmark that's sensitive to the time it takes to
>> >> > >> round-trip from userspace to guest and back to userspace?  I think I
>> >> > >> may have a big speedup.
>> >> > >
>> >> > > The simplest is vmexit.flat from
>> >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git
>> >> > >
>> >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu
>> >> > > benchmark.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > That test case is slower than I expected.  I think my change is likely
>> >> > to save somewhat under 100ns, which is only a couple percent.  I'll
>> >> > look for more impressive improvements.
>> >> >
>> >> > On a barely related note, in the process of poking around with this
>> >> > test, I noticed:
>> >> >
>> >> >     /* On ept, can't emulate nx, and must switch nx atomically */
>> >> >     if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) {
>> >> >         guest_efer = vmx->vcpu.arch.efer;
>> >> >         if (!(guest_efer & EFER_LMA))
>> >> >             guest_efer &= ~EFER_LME;
>> >> >         add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer);
>> >> >         return false;
>> >> >     }
>> >> >
>> >> >     return true;
>> >> >
>> >> > This heuristic seems wrong to me.  wrmsr is serializing and therefore
>> >> > extremely slow, whereas I imagine that, on CPUs that support it,
>> >> > atomically switching EFER ought to be reasonably fast.
>> >> >
>> >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER
>> >> > switching, and having no other relevant effect that I've thought of)
>> >> > speeds up inl_from_qemu by ~30% on Sandy Bridge.  Would it make sense
>> >> > to always use atomic EFER switching, at least when
>> >> > cpu_has_load_ia32_efer?
>> >> >
>> >> The idea behind current logic is that we want to avoid writing an MSR
>> >> at all for lightweight exists (those that do not exit to userspace). So
>> >> if NX bit is the same for host and guest we can avoid writing EFER on
>> >> exit and run with guest's EFER in the kernel. But if userspace exit is
>> >> required only then we write host's MSR back, only if guest and host MSRs
>> >> are different of course. What bit should be restored on userspace exit
>> >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE?
>> >
>> > I don't understand.  AFAICT there are really only two cases: EFER
>> > switched atomically using the best available mechanism on the host
>> > CPU, or EFER switched on userspace exit.  I think there's a
>> > theoretical third possibility: if the guest and host EFER match, then
>> > EFER doesn't need to be switched at all, but this doesn't seem to be
>> > implemented.
>>
>> I got this part wrong.  It looks like the user return notifier is
>> smart enough not to set EFER at all if the guest and host values
>> match.  Indeed, with stock KVM, if I modify vmexit.c to have exactly
>> the same EFER as the host (NX and SCE both set), then it runs quickly.
>> But I get almost exactly the same performance if NX is clear, which is
>> the case where the built-in entry/exit switching is used.
>>
> What's the performance difference?

Negative.  That is, switching EFER atomically was faster than not
switching it at all.  But this could just be noise.  Here are the
numbers comparing the status quo (SCE cleared in vmexit.c, so switch
on user return) vs. switching atomically at entry/exit.  Sorry about
the formatting.

Test                                  Before      After    Change
cpuid                                   2000       1932    -3.40%
vmcall                                  1914       1817    -5.07%
mov_from_cr8                              13         13     0.00%
mov_to_cr8                                19         19     0.00%
inl_from_pmtimer                       19164      10619   -44.59%
inl_from_qemu                          15662      10302   -34.22%
inl_from_kernel                         3916       3802    -2.91%
outl_to_kernel                          2230       2194    -1.61%
mov_dr                                   172        176     2.33%
ipi                                (skipped)  (skipped)
ipi+halt                           (skipped)  (skipped)
ple-round-robin                           13         13     0.00%
wr_tsc_adjust_msr                       1920       1845    -3.91%
rd_tsc_adjust_msr                       1892       1814    -4.12%
mmio-no-eventfd:pci-mem                16394      11165   -31.90%
mmio-wildcard-eventfd:pci-mem           4607       4645     0.82%
mmio-datamatch-eventfd:pci-mem          4601       4610     0.20%
portio-no-eventfd:pci-io               11507       7942   -30.98%
portio-wildcard-eventfd:pci-io          2239       2225    -0.63%
portio-datamatch-eventfd:pci-
io         2250       2234    -0.71%

The tiny differences for the non-userspace exits could be just noise
or CPU temperature at the time or anything else.


>
>> Admittedly, most guests probably do match the host, so this effect may
>> be rare in practice.  But possibly the code should be changed either
>> the way I patched it (always use the built-in switching if available)
>> or to only do it if the guest and host EFER values differ.  ISTM that,
>> on modern CPUs, switching EFER on return to userspace is always a big
>> loss.
> We should be careful to not optimise for a wrong case. In common case
> userspace exits are extremely rare. Try to trace common workloads with
> Linux guest. Windows as a guest has its share of userspace exists, but
> this is due to the lack of PV timer support (was it fixed already?).
> So if switching EFER has measurable overhead doing it on each exit is a
> net loss.
>
>>
>> If neither change is made, then maybe the test should change to set
>> SCE so that it isn't so misleadingly slow.
>>
> The purpose of vmexit test is to show us various overheads, so why not
> measure EFER switch overhead by having two tests one with equal EFER
> another with different EFER, instead of hiding it.
>

I'll try this.  We might need three tests, though: NX different, NX
same but SCE different, and all flags the same.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-09 16:36             ` Andy Lutomirski
@ 2014-11-10 10:03               ` Paolo Bonzini
  2014-11-10 10:45                 ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-10 10:03 UTC (permalink / raw)
  To: Andy Lutomirski, Gleb Natapov; +Cc: kvm list

On 09/11/2014 17:36, Andy Lutomirski wrote:
>> The purpose of vmexit test is to show us various overheads, so why not
>> measure EFER switch overhead by having two tests one with equal EFER
>> another with different EFER, instead of hiding it.
> 
> I'll try this.  We might need three tests, though: NX different, NX
> same but SCE different, and all flags the same.

The test actually explicitly enables NX in order to put itself in the 
"common case":

commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date:   Tue Jun 8 15:33:29 2010 -0300

    test: vmexit: enable NX

    Enable NX to disable MSR autoload/save. This is the common case anyway.

    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
    Signed-off-by: Avi Kivity <avi@redhat.com>

(this commit is in qemu-kvm.git), so I guess forgetting to set SCE is
just a bug.  The results on my Xeon Sandy Bridge are very interesting:

NX different            ~11.5k (load/save EFER path)
NX same, SCE different  ~19.5k (urn path)
all flags the same      ~10.2k

The inl_from_kernel results have absolutely no change, usually at most 5
cycles difference.  This could be because I've added the SCE=1 variant
directly to vmexit.c, so I'm running the tests one next to the other.

I tried making also the other shared MSRs the same between guest and
host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
do want to dig out my old Core 2 and see how the new test fares, but it
really looks like your patch will be in 3.19.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 10:03               ` Paolo Bonzini
@ 2014-11-10 10:45                 ` Gleb Natapov
  2014-11-10 12:15                   ` Paolo Bonzini
  2014-11-10 19:17                   ` Andy Lutomirski
  0 siblings, 2 replies; 30+ messages in thread
From: Gleb Natapov @ 2014-11-10 10:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Andy Lutomirski, kvm list

On Mon, Nov 10, 2014 at 11:03:35AM +0100, Paolo Bonzini wrote:
> 
> 
> On 09/11/2014 17:36, Andy Lutomirski wrote:
> >> The purpose of vmexit test is to show us various overheads, so why not
> >> measure EFER switch overhead by having two tests one with equal EFER
> >> another with different EFER, instead of hiding it.
> > 
> > I'll try this.  We might need three tests, though: NX different, NX
> > same but SCE different, and all flags the same.
> 
> The test actually explicitly enables NX in order to put itself in the 
> "common case":
> 
> commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36
> Author: Marcelo Tosatti <mtosatti@redhat.com>
> Date:   Tue Jun 8 15:33:29 2010 -0300
> 
>     test: vmexit: enable NX
>     
>     Enable NX to disable MSR autoload/save. This is the common case anyway.
>     
>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>     Signed-off-by: Avi Kivity <avi@redhat.com>
> 
> (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is
> just a bug.  The results on my Xeon Sandy Bridge are very interesting:
> 
> NX different            ~11.5k (load/save EFER path)
> NX same, SCE different  ~19.5k (urn path)
> all flags the same      ~10.2k
> 
> The inl_from_kernel results have absolutely no change, usually at most 5
> cycles difference.  This could be because I've added the SCE=1 variant
> directly to vmexit.c, so I'm running the tests one next to the other.
> 
> I tried making also the other shared MSRs the same between guest and
> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
> do want to dig out my old Core 2 and see how the new test fares, but it
> really looks like your patch will be in 3.19.
>
Please test on wide variety of HW before final decision. Also it would
be nice to ask Intel what is expected overhead. It is awesome if they
mange to add EFER switching with non measurable overhead, but also hard
to believe :) Also Andy had an idea do disable switching in case host
and guest EFERs are the same but IIRC his patch does not include it yet.

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 10:45                 ` Gleb Natapov
@ 2014-11-10 12:15                   ` Paolo Bonzini
  2014-11-10 14:23                     ` Avi Kivity
  2014-11-11 11:07                     ` Paolo Bonzini
  2014-11-10 19:17                   ` Andy Lutomirski
  1 sibling, 2 replies; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-10 12:15 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Andy Lutomirski, kvm list



On 10/11/2014 11:45, Gleb Natapov wrote:
> > I tried making also the other shared MSRs the same between guest and
> > host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
> > has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
> > do want to dig out my old Core 2 and see how the new test fares, but it
> > really looks like your patch will be in 3.19.
>
> Please test on wide variety of HW before final decision.

Yes, definitely.

> Also it would
> be nice to ask Intel what is expected overhead. It is awesome if they
> mange to add EFER switching with non measurable overhead, but also hard
> to believe :)

So let's see what happens.  Sneak preview: the result is definitely worth
asking Intel about.

I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
for others to reproduce the results.  I only ran the inl_from_qemu test.

Perf stat reports that the processor goes from 0.65 to 0.46 
instructions per cycle, which is consistent with the improvement from 
19k to 12k cycles per iteration.

Unpatched KVM-unit-tests:

     3,385,586,563 cycles                    #    3.189 GHz                     [83.25%]
     2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle    [83.37%]
     2,083,556,270 stalled-cycles-backend    #   61.54% backend  cycles idle    [66.71%]
     1,573,854,041 instructions              #    0.46  insns per cycle        
                                             #    1.57  stalled cycles per insn [83.20%]
       1.108486526 seconds time elapsed


Patched KVM-unit-tests:

     3,252,297,378 cycles                    #    3.147 GHz                     [83.32%]
     2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle    [83.36%]
     1,560,371,769 stalled-cycles-backend    #   47.98% backend  cycles idle    [66.51%]
     2,133,698,018 instructions              #    0.66  insns per cycle        
                                             #    0.94  stalled cycles per insn [83.45%]
       1.072395697 seconds time elapsed

Playing with other events shows that the unpatched benchmark has an
awful load of TLB misses

Unpatched:

            30,311 iTLB-loads                                                  
       464,641,844 dTLB-loads                                                  
        10,813,839 dTLB-load-misses          #    2.33% of all dTLB cache hits 
        20436,027 iTLB-load-misses          #  67421.16% of all iTLB cache hits 

Patched:

         1,440,033 iTLB-loads                                                  
       640,970,836 dTLB-loads                                                  
         2,345,112 dTLB-load-misses          #    0.37% of all dTLB cache hits 
           270,884 iTLB-load-misses          #   18.81% of all iTLB cache hits 

This is 100% reproducible.  The meaning of the numbers is clearer if you
look up the raw event numbers in the Intel manuals:

- iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level
TLB] hits. No page walk."

- iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that
cause page walks."

So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and
friends show that the unpatched KVM wastes about 0.1 seconds more than
the patched KVM on page walks:

Unpatched:

        22,583,440 r449             (cycles on dTLB store miss page walks)
        40,452,018 r408             (cycles on dTLB load miss page walks)
         2,115,981 r485             (cycles on iTLB miss page walks)
------------------------
        65,151,439 total

Patched:

        24,430,676 r449             (cycles on dTLB store miss page walks)
       196,017,693 r408             (cycles on dTLB load miss page walks)
       213,266,243 r485             (cycles on iTLB miss page walks)
-------------------------
       433,714,612 total

These 0.1 seconds probably are all on instructions that would have been
fast, since the slow instructions responsible for the low IPC are the
microcoded instructions including VMX and other privileged stuff.

Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
and 260k in patched KVM.  Let's see where they come from:

Unpatched:

+  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
+   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault

It's expected that most TLB misses happen just before a page fault (there
are also events to count how many TLB misses do result in a page fault,
if you care about that), and thus are accounted to the first instruction of the
exception handler.

We do not know what causes second-level TLB _flushes_ but it's quite
expected that you'll have a TLB miss after them and possibly a page fault.
And anyway 98.97% of them coming from native_write_msr_safe is totally
anomalous.

A patched benchmark shows no second-level TLB flush occurs after a WRMSR:

+  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
+   9.07%  qemu-kvm  [kvm_intel]        [k] vmx_flush_tlb
+   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
+   5.68%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_mm_range
+   4.87%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb
+   1.36%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_page


So basically VMX EFER writes are optimized, while non-VMX EFER writes
cause a TLB flush, at least on a Sandy Bridge.  Ouch!

I'll try to reproduce on the Core 2 Duo soon, and inquire Intel about it.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 12:15                   ` Paolo Bonzini
@ 2014-11-10 14:23                     ` Avi Kivity
  2014-11-10 17:28                       ` Paolo Bonzini
  2014-11-11 11:07                     ` Paolo Bonzini
  1 sibling, 1 reply; 30+ messages in thread
From: Avi Kivity @ 2014-11-10 14:23 UTC (permalink / raw)
  To: Paolo Bonzini, Gleb Natapov; +Cc: Andy Lutomirski, kvm list


On 11/10/2014 02:15 PM, Paolo Bonzini wrote:
>
> On 10/11/2014 11:45, Gleb Natapov wrote:
>>> I tried making also the other shared MSRs the same between guest and
>>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
>>> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
>>> do want to dig out my old Core 2 and see how the new test fares, but it
>>> really looks like your patch will be in 3.19.
>> Please test on wide variety of HW before final decision.
> Yes, definitely.
>
>> Also it would
>> be nice to ask Intel what is expected overhead. It is awesome if they
>> mange to add EFER switching with non measurable overhead, but also hard
>> to believe :)
> So let's see what happens.  Sneak preview: the result is definitely worth
> asking Intel about.
>
> I ran these benchmarks with a stock 3.16.6 KVM.  Instead I patched
> kvm-unit-tests to set EFER.SCE in enable_nx.  This makes it much simpler
> for others to reproduce the results.  I only ran the inl_from_qemu test.
>
> Perf stat reports that the processor goes from 0.65 to 0.46
> instructions per cycle, which is consistent with the improvement from
> 19k to 12k cycles per iteration.
>
> Unpatched KVM-unit-tests:
>
>       3,385,586,563 cycles                    #    3.189 GHz                     [83.25%]
>       2,475,979,685 stalled-cycles-frontend   #   73.13% frontend cycles idle    [83.37%]
>       2,083,556,270 stalled-cycles-backend    #   61.54% backend  cycles idle    [66.71%]
>       1,573,854,041 instructions              #    0.46  insns per cycle
>                                               #    1.57  stalled cycles per insn [83.20%]
>         1.108486526 seconds time elapsed
>
>
> Patched KVM-unit-tests:
>
>       3,252,297,378 cycles                    #    3.147 GHz                     [83.32%]
>       2,010,266,184 stalled-cycles-frontend   #   61.81% frontend cycles idle    [83.36%]
>       1,560,371,769 stalled-cycles-backend    #   47.98% backend  cycles idle    [66.51%]
>       2,133,698,018 instructions              #    0.66  insns per cycle
>                                               #    0.94  stalled cycles per insn [83.45%]
>         1.072395697 seconds time elapsed
>
> Playing with other events shows that the unpatched benchmark has an
> awful load of TLB misses
>
> Unpatched:
>
>              30,311 iTLB-loads
>         464,641,844 dTLB-loads
>          10,813,839 dTLB-load-misses          #    2.33% of all dTLB cache hits
>          20436,027 iTLB-load-misses          #  67421.16% of all iTLB cache hits
>
> Patched:
>
>           1,440,033 iTLB-loads
>         640,970,836 dTLB-loads
>           2,345,112 dTLB-load-misses          #    0.37% of all dTLB cache hits
>             270,884 iTLB-load-misses          #   18.81% of all iTLB cache hits
>
> This is 100% reproducible.  The meaning of the numbers is clearer if you
> look up the raw event numbers in the Intel manuals:
>
> - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level
> TLB] hits. No page walk."
>
> - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that
> cause page walks."
>
> So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and
> friends show that the unpatched KVM wastes about 0.1 seconds more than
> the patched KVM on page walks:
>
> Unpatched:
>
>          22,583,440 r449             (cycles on dTLB store miss page walks)
>          40,452,018 r408             (cycles on dTLB load miss page walks)
>           2,115,981 r485             (cycles on iTLB miss page walks)
> ------------------------
>          65,151,439 total
>
> Patched:
>
>          24,430,676 r449             (cycles on dTLB store miss page walks)
>         196,017,693 r408             (cycles on dTLB load miss page walks)
>         213,266,243 r485             (cycles on iTLB miss page walks)
> -------------------------
>         433,714,612 total
>
> These 0.1 seconds probably are all on instructions that would have been
> fast, since the slow instructions responsible for the low IPC are the
> microcoded instructions including VMX and other privileged stuff.
>
> Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM
> and 260k in patched KVM.  Let's see where they come from:
>
> Unpatched:
>
> +  98.97%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
> +   0.70%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
>
> It's expected that most TLB misses happen just before a page fault (there
> are also events to count how many TLB misses do result in a page fault,
> if you care about that), and thus are accounted to the first instruction of the
> exception handler.
>
> We do not know what causes second-level TLB _flushes_ but it's quite
> expected that you'll have a TLB miss after them and possibly a page fault.
> And anyway 98.97% of them coming from native_write_msr_safe is totally
> anomalous.
>
> A patched benchmark shows no second-level TLB flush occurs after a WRMSR:
>
> +  72.41%  qemu-kvm  [kernel.kallsyms]  [k] page_fault
> +   9.07%  qemu-kvm  [kvm_intel]        [k] vmx_flush_tlb
> +   6.60%  qemu-kvm  [kernel.kallsyms]  [k] set_pte_vaddr_pud
> +   5.68%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_mm_range
> +   4.87%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb
> +   1.36%  qemu-kvm  [kernel.kallsyms]  [k] flush_tlb_page
>
>
> So basically VMX EFER writes are optimized, while non-VMX EFER writes
> cause a TLB flush, at least on a Sandy Bridge.  Ouch!
>

It's not surprising [1].  Since the meaning of some PTE bits change [2], the
TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
if EFER changed between two invocations of the same VPID, which isn't the
case.

[1] after the fact
[2] although those bits were reserved with NXE=0, so they shouldn't have 
any TLB footprint

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 14:23                     ` Avi Kivity
@ 2014-11-10 17:28                       ` Paolo Bonzini
  2014-11-10 17:38                         ` Gleb Natapov
  2014-11-17 11:17                         ` Wanpeng Li
  0 siblings, 2 replies; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-10 17:28 UTC (permalink / raw)
  To: Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list

On 10/11/2014 15:23, Avi Kivity wrote:
> It's not surprising [1].  Since the meaning of some PTE bits change [2],
> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
> if EFER changed between two invocations of the same VPID, which isn't the
> case.
> 
> [1] after the fact
> [2] although those bits were reserved with NXE=0, so they shouldn't have
> any TLB footprint

You're right that this is not that surprising after the fact, and that
both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
This is also why I'm curious about the Nehalem.

However note that even toggling the SCE bit is flushing the TLB.  The
NXE bit is not being toggled here!  That's the more surprising part.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 17:28                       ` Paolo Bonzini
@ 2014-11-10 17:38                         ` Gleb Natapov
  2014-11-12 11:33                           ` Paolo Bonzini
  2014-11-17 11:17                         ` Wanpeng Li
  1 sibling, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2014-11-10 17:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list

On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote:
> On 10/11/2014 15:23, Avi Kivity wrote:
> > It's not surprising [1].  Since the meaning of some PTE bits change [2],
> > the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
> > if EFER changed between two invocations of the same VPID, which isn't the
> > case.
> > 
> > [1] after the fact
> > [2] although those bits were reserved with NXE=0, so they shouldn't have
> > any TLB footprint
> 
> You're right that this is not that surprising after the fact, and that
> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
> This is also why I'm curious about the Nehalem.
> 
> However note that even toggling the SCE bit is flushing the TLB.  The
> NXE bit is not being toggled here!  That's the more surprising part.
> 
Just a guess, but may be because writing EFER is not something that happens
often in regular OSes it is not optimized to handle different bits differently.

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 17:38                         ` Gleb Natapov
@ 2014-11-12 11:33                           ` Paolo Bonzini
  2014-11-12 15:22                             ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-12 11:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list



On 10/11/2014 18:38, Gleb Natapov wrote:
> On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote:
>> On 10/11/2014 15:23, Avi Kivity wrote:
>>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
>>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
>>> if EFER changed between two invocations of the same VPID, which isn't the
>>> case.
>>>
>>> [1] after the fact
>>> [2] although those bits were reserved with NXE=0, so they shouldn't have
>>> any TLB footprint
>>
>> You're right that this is not that surprising after the fact, and that
>> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
>> This is also why I'm curious about the Nehalem.
>>
>> However note that even toggling the SCE bit is flushing the TLB.  The
>> NXE bit is not being toggled here!  That's the more surprising part.
>>
> Just a guess, but may be because writing EFER is not something that happens
> often in regular OSes it is not optimized to handle different bits differently.

Yes, that's what Intel said too.

Nehalem results:

userspace exit, urn                      17560 17726 17628 17572 17417
lightweight exit, urn                     3316  3342  3342  3319  3328
userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220

So the benchmark results also explain why skipping the LOAD_EFER does
not give a benefit for guest EFER=host EFER.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 11:33                           ` Paolo Bonzini
@ 2014-11-12 15:22                             ` Gleb Natapov
  2014-11-12 15:26                               ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2014-11-12 15:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list

On Wed, Nov 12, 2014 at 12:33:32PM +0100, Paolo Bonzini wrote:
> 
> 
> On 10/11/2014 18:38, Gleb Natapov wrote:
> > On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote:
> >> On 10/11/2014 15:23, Avi Kivity wrote:
> >>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
> >>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
> >>> if EFER changed between two invocations of the same VPID, which isn't the
> >>> case.
> >>>
> >>> [1] after the fact
> >>> [2] although those bits were reserved with NXE=0, so they shouldn't have
> >>> any TLB footprint
> >>
> >> You're right that this is not that surprising after the fact, and that
> >> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
> >> This is also why I'm curious about the Nehalem.
> >>
> >> However note that even toggling the SCE bit is flushing the TLB.  The
> >> NXE bit is not being toggled here!  That's the more surprising part.
> >>
> > Just a guess, but may be because writing EFER is not something that happens
> > often in regular OSes it is not optimized to handle different bits differently.
> 
> Yes, that's what Intel said too.
> 
> Nehalem results:
> 
> userspace exit, urn                      17560 17726 17628 17572 17417
> lightweight exit, urn                     3316  3342  3342  3319  3328
> userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
> lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
> userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
> lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
> 
Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
that always switch LOAD_EFER?

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 15:22                             ` Gleb Natapov
@ 2014-11-12 15:26                               ` Paolo Bonzini
  2014-11-12 15:32                                 ` Gleb Natapov
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-12 15:26 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list



On 12/11/2014 16:22, Gleb Natapov wrote:
>> > Nehalem results:
>> > 
>> > userspace exit, urn                      17560 17726 17628 17572 17417
>> > lightweight exit, urn                     3316  3342  3342  3319  3328
>> > userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
>> > lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
>> > userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
>> > lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
>> > 
> Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
> that always switch LOAD_EFER?

Skip LOAD_EFER when guest=host.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 15:26                               ` Paolo Bonzini
@ 2014-11-12 15:32                                 ` Gleb Natapov
  2014-11-12 15:51                                   ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2014-11-12 15:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list

On Wed, Nov 12, 2014 at 04:26:29PM +0100, Paolo Bonzini wrote:
> 
> 
> On 12/11/2014 16:22, Gleb Natapov wrote:
> >> > Nehalem results:
> >> > 
> >> > userspace exit, urn                      17560 17726 17628 17572 17417
> >> > lightweight exit, urn                     3316  3342  3342  3319  3328
> >> > userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
> >> > lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
> >> > userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
> >> > lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
> >> > 
> > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
> > that always switch LOAD_EFER?
> 
> Skip LOAD_EFER when guest=host.
> 
So guest=host is a little bit better than guest!=host so looks like skipping LOAD_EFER helps, but
why "lightweight exit, urn" worse than guest=host though, it should be exactly the same as long as NX
bit is the same in urn test, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 15:32                                 ` Gleb Natapov
@ 2014-11-12 15:51                                   ` Paolo Bonzini
  2014-11-12 16:07                                     ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-12 15:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list



On 12/11/2014 16:32, Gleb Natapov wrote:
> > > > userspace exit, urn                      17560 17726 17628 17572 17417
> > > > lightweight exit, urn                     3316  3342  3342  3319  3328
> > > > userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
> > > > lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
> > > > userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
> > > > lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
> > > 
> > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
> > > that always switch LOAD_EFER?
> > 
> > Skip LOAD_EFER when guest=host.
> 
> So guest=host is a little bit better than guest!=host so looks like
> skipping LOAD_EFER helps, but why "lightweight exit, urn" worse than
> guest=host though, it should be exactly the same as long as NX bit is
> the same in urn test, no?

I don't know---it is very much reproducible though.  It is not my
machine so I cannot run perf on it, but I can try to find a similar one
in the next few days.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 15:51                                   ` Paolo Bonzini
@ 2014-11-12 16:07                                     ` Andy Lutomirski
  2014-11-12 17:56                                       ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-12 16:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Gleb Natapov, Avi Kivity, kvm list

On Wed, Nov 12, 2014 at 7:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/11/2014 16:32, Gleb Natapov wrote:
>> > > > userspace exit, urn                      17560 17726 17628 17572 17417
>> > > > lightweight exit, urn                     3316  3342  3342  3319  3328
>> > > > userspace exit, LOAD_EFER, guest!=host   12200 11772 12130 12164 12327
>> > > > lightweight exit, LOAD_EFER, guest!=host  3214  3220  3238  3218  3337
>> > > > userspace exit, LOAD_EFER, guest=host    11983 11780 11920 11919 12040
>> > > > lightweight exit, LOAD_EFER, guest=host   3178  3193  3193  3187  3220
>> > >
>> > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one
>> > > that always switch LOAD_EFER?
>> >
>> > Skip LOAD_EFER when guest=host.
>>
>> So guest=host is a little bit better than guest!=host so looks like
>> skipping LOAD_EFER helps, but why "lightweight exit, urn" worse than
>> guest=host though, it should be exactly the same as long as NX bit is
>> the same in urn test, no?
>
> I don't know---it is very much reproducible though.  It is not my
> machine so I cannot run perf on it, but I can try to find a similar one
> in the next few days.

Assuming you're running both of my patches (LOAD_EFER regardless of
nx, but skip LOAD_EFER of guest == host), then some of the speedup may
be just less code running.  I haven't figured out exactly when
vmx_save_host_state runs, but my patches avoid a call to
kvm_set_shared_msr, which is worth a few cycles.

--Andy

>
> Paolo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-12 16:07                                     ` Andy Lutomirski
@ 2014-11-12 17:56                                       ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-12 17:56 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Gleb Natapov, Avi Kivity, kvm list

> Assuming you're running both of my patches (LOAD_EFER regardless of
> nx, but skip LOAD_EFER of guest == host), then some of the speedup may
> be just less code running.  I haven't figured out exactly when
> vmx_save_host_state runs, but my patches avoid a call to
> kvm_set_shared_msr, which is worth a few cycles.

Yes, that's possible.  vmx_save_host_state is here:

        preempt_disable();

        kvm_x86_ops->prepare_guest_switch(vcpu);  // <<<<
        if (vcpu->fpu_active)
                kvm_load_guest_fpu(vcpu);
        kvm_load_guest_xcr0(vcpu);

        vcpu->mode = IN_GUEST_MODE;

        srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);

and it's a fairly hot function.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 17:28                       ` Paolo Bonzini
  2014-11-10 17:38                         ` Gleb Natapov
@ 2014-11-17 11:17                         ` Wanpeng Li
  2014-11-17 11:18                           ` Paolo Bonzini
  1 sibling, 1 reply; 30+ messages in thread
From: Wanpeng Li @ 2014-11-17 11:17 UTC (permalink / raw)
  To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list

Hi Paolo,
On 11/11/14, 1:28 AM, Paolo Bonzini wrote:
> On 10/11/2014 15:23, Avi Kivity wrote:
>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
>> if EFER changed between two invocations of the same VPID, which isn't the
>> case.

If there need a TLB flush if guest is UP?

Regards,
Wanpeng Li

>>
>> [1] after the fact
>> [2] although those bits were reserved with NXE=0, so they shouldn't have
>> any TLB footprint
> You're right that this is not that surprising after the fact, and that
> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones).
> This is also why I'm curious about the Nehalem.
>
> However note that even toggling the SCE bit is flushing the TLB.  The
> NXE bit is not being toggled here!  That's the more surprising part.
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-17 11:17                         ` Wanpeng Li
@ 2014-11-17 11:18                           ` Paolo Bonzini
  2014-11-17 12:00                             ` Wanpeng Li
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-17 11:18 UTC (permalink / raw)
  To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list



On 17/11/2014 12:17, Wanpeng Li wrote:
>>
>>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
>>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
>>> if EFER changed between two invocations of the same VPID, which isn't
>>> the case.
> 
> If there need a TLB flush if guest is UP?

The wrmsr is in the host, and the TLB flush is done in the processor
microcode.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-17 11:18                           ` Paolo Bonzini
@ 2014-11-17 12:00                             ` Wanpeng Li
  2014-11-17 12:04                               ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Wanpeng Li @ 2014-11-17 12:00 UTC (permalink / raw)
  To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list

Hi Paolo,
On 11/17/14, 7:18 PM, Paolo Bonzini wrote:
>
> On 17/11/2014 12:17, Wanpeng Li wrote:
>>>> It's not surprising [1].  Since the meaning of some PTE bits change [2],
>>>> the TLB has to be flushed.  In VMX we have VPIDs, so we only need to flush
>>>> if EFER changed between two invocations of the same VPID, which isn't
>>>> the case.
>> If there need a TLB flush if guest is UP?
> The wrmsr is in the host, and the TLB flush is done in the processor
> microcode.

Sorry, maybe I didn't state my question clearly. As Avi mentioned above 
"In VMX we have VPIDs, so we only need to flush if EFER changed between 
two invocations of the same VPID", so there is only one VPID if the 
guest is UP, my question is if there need a TLB flush when guest's EFER 
has been changed?

Regards,
Wanpeng Li

>
> Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-17 12:00                             ` Wanpeng Li
@ 2014-11-17 12:04                               ` Paolo Bonzini
  2014-11-17 12:14                                 ` Wanpeng Li
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-17 12:04 UTC (permalink / raw)
  To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list



On 17/11/2014 13:00, Wanpeng Li wrote:
> Sorry, maybe I didn't state my question clearly. As Avi mentioned above
> "In VMX we have VPIDs, so we only need to flush if EFER changed between
> two invocations of the same VPID", so there is only one VPID if the
> guest is UP, my question is if there need a TLB flush when guest's EFER
> has been changed?

Yes, because the meaning of the page table entries has changed.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-17 12:04                               ` Paolo Bonzini
@ 2014-11-17 12:14                                 ` Wanpeng Li
  2014-11-17 12:22                                   ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Wanpeng Li @ 2014-11-17 12:14 UTC (permalink / raw)
  To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list

Hi Paolo,
On 11/17/14, 8:04 PM, Paolo Bonzini wrote:
>
> On 17/11/2014 13:00, Wanpeng Li wrote:
>> Sorry, maybe I didn't state my question clearly. As Avi mentioned above
>> "In VMX we have VPIDs, so we only need to flush if EFER changed between
>> two invocations of the same VPID", so there is only one VPID if the
>> guest is UP, my question is if there need a TLB flush when guest's EFER
>> has been changed?
> Yes, because the meaning of the page table entries has changed.

So both VMX EFER writes and non-VMX EFER writes cause a TLB flush for UP 
guest, is there still a performance improvement in this case?

Regards,
Wanpeng Li

>
> Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-17 12:14                                 ` Wanpeng Li
@ 2014-11-17 12:22                                   ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-17 12:22 UTC (permalink / raw)
  To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list



On 17/11/2014 13:14, Wanpeng Li wrote:
>>
>>> Sorry, maybe I didn't state my question clearly. As Avi mentioned above
>>> "In VMX we have VPIDs, so we only need to flush if EFER changed between
>>> two invocations of the same VPID", so there is only one VPID if the
>>> guest is UP, my question is if there need a TLB flush when guest's EFER
>>> has been changed?
>> Yes, because the meaning of the page table entries has changed.
> 
> So both VMX EFER writes and non-VMX EFER writes cause a TLB flush for UP
> guest, is there still a performance improvement in this case?

Note that the guest's EFER does not change, so no TLB flush happens.
The guest EFER, however, is different from the host's, so if you change
it with a wrmsr in the host you will get a TLB flush on every userspace
exit.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 12:15                   ` Paolo Bonzini
  2014-11-10 14:23                     ` Avi Kivity
@ 2014-11-11 11:07                     ` Paolo Bonzini
  1 sibling, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2014-11-11 11:07 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Andy Lutomirski, kvm list



On 10/11/2014 13:15, Paolo Bonzini wrote:
> 
> 
> On 10/11/2014 11:45, Gleb Natapov wrote:
>>> I tried making also the other shared MSRs the same between guest and
>>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
>>> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
>>> do want to dig out my old Core 2 and see how the new test fares, but it
>>> really looks like your patch will be in 3.19.
>>
>> Please test on wide variety of HW before final decision.
> 
> Yes, definitely.

I've reproduced Andy's results on Ivy Bridge:

NX off                  ~6900 cycles (EFER)
NX on, SCE off         ~14600 cycles (urn)
NX on, SCE on           ~6900 cycles (same value)

I also asked Intel about clarifications.

On Core 2 Duo the results are weird.  There is no LOAD_EFER control,
so Andy's patch does not apply and the only interesting paths are urn
and same value.

The pessimization of EFER writes does _seem_ to be there, since I can
profile for iTLB flushes (r4082 on this microarchitecture) and get:

   0.14%  qemu-kvm  [kernel.kallsyms]  [k] native_write_msr_safe
   0.14%  qemu-kvm  [kernel.kallsyms]  [k] native_flush_tlb

but these are the top two results and it is not clear to me why perf
only records them as "0.14%"...  Also, this machine has no EPT, so virt
suffers a lot from TLB misses anyway.

Nevertheless I tried running kvm-unit-tests with different values of the
MSRs to see what's the behavior.

                                NX=1/SCE=0   NX=1/SCE=1  all MSRs equal
cpuid                           3374         3448        3608
vmcall                          3274         3337        3478
mov_from_cr8                    11           11          11
mov_to_cr8                      15           15          15
inl_from_pmtimer                17803        16346       15156
inl_from_qemu                   17858        16375       15163
inl_from_kernel                 6351         6492        6622
outl_to_kernel                  3850         3900        4053
mov_dr                          116          116         117
ple-round-robin                 15           16          16
wr_tsc_adjust_msr               3334         3417        3570
rd_tsc_adjust_msr               3374         3404        3605
mmio-no-eventfd:pci-mem         19188        17866       16660
mmio-wildcard-eventfd:pci-mem   7319         7414        7595
mmio-datamatch-eventfd:pci-mem  7304         7470        7605
portio-no-eventfd:pci-io        13219        11780       10447
portio-wildcard-eventfd:pci-io  3951         4024        4149
portio-datamatch-eventfd:pci-io 3940         4026        4228

In the last column, all shared MSRs are equal (*) host and guest.  The
difference is very noisy on newer processors, but quite visible on the
older processor.  It is weird though that the light-weight exits become
_more_ expensive as more MSRs are equal between guest and host.

Anyhow, this is more of a curiosity since the proposed patch has no effect.

Next will come Nehalem.  Nehalem has both LOAD_EFER and EPT, so it's
already a good target.  I can test Westmere too, as soon as I find
someone that has it, but it shouldn't give surprises.

Paolo

(*) run this:

#! /usr/bin/env python
class msr(object):
    def __init__(self):
        try:
            self.f = open('/dev/cpu/0/msr', 'r', 0)
        except:
            self.f = open('/dev/msr0', 'r', 0)
    def read(self, index, default = None):
        import struct
        self.f.seek(index)
        try:
            return struct.unpack('Q', self.f.read(8))[0]
        except:
            return default

m = msr()
for i in [0xc0000080, 0xc0000081, 0xc0000082, 0xc0000083, 0xc0000084]:
    print ("wrmsr(0x%x, 0x%x);" % (i, m.read(i)))

and add the result to the enable_nx function.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Seeking a KVM benchmark
  2014-11-10 10:45                 ` Gleb Natapov
  2014-11-10 12:15                   ` Paolo Bonzini
@ 2014-11-10 19:17                   ` Andy Lutomirski
  1 sibling, 0 replies; 30+ messages in thread
From: Andy Lutomirski @ 2014-11-10 19:17 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paolo Bonzini, kvm list

On Mon, Nov 10, 2014 at 2:45 AM, Gleb Natapov <gleb@kernel.org> wrote:
> On Mon, Nov 10, 2014 at 11:03:35AM +0100, Paolo Bonzini wrote:
>>
>>
>> On 09/11/2014 17:36, Andy Lutomirski wrote:
>> >> The purpose of vmexit test is to show us various overheads, so why not
>> >> measure EFER switch overhead by having two tests one with equal EFER
>> >> another with different EFER, instead of hiding it.
>> >
>> > I'll try this.  We might need three tests, though: NX different, NX
>> > same but SCE different, and all flags the same.
>>
>> The test actually explicitly enables NX in order to put itself in the
>> "common case":
>>
>> commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36
>> Author: Marcelo Tosatti <mtosatti@redhat.com>
>> Date:   Tue Jun 8 15:33:29 2010 -0300
>>
>>     test: vmexit: enable NX
>>
>>     Enable NX to disable MSR autoload/save. This is the common case anyway.
>>
>>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>
>> (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is
>> just a bug.  The results on my Xeon Sandy Bridge are very interesting:
>>
>> NX different            ~11.5k (load/save EFER path)
>> NX same, SCE different  ~19.5k (urn path)
>> all flags the same      ~10.2k
>>
>> The inl_from_kernel results have absolutely no change, usually at most 5
>> cycles difference.  This could be because I've added the SCE=1 variant
>> directly to vmexit.c, so I'm running the tests one next to the other.
>>
>> I tried making also the other shared MSRs the same between guest and
>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier
>> has nothing to do.  That saves about 4-500 cycles on inl_from_qemu.  I
>> do want to dig out my old Core 2 and see how the new test fares, but it
>> really looks like your patch will be in 3.19.
>>
> Please test on wide variety of HW before final decision. Also it would
> be nice to ask Intel what is expected overhead. It is awesome if they
> mange to add EFER switching with non measurable overhead, but also hard
> to believe :) Also Andy had an idea do disable switching in case host
> and guest EFERs are the same but IIRC his patch does not include it yet.

I'll send that patch as a followup in a sec.  It doesn't seem to make
a difference, which reinforces my hypothesis that microcode is
fiddling with EFER on entry and exit anyway to handle LME and LMA
anyway, so adjusting the other bits doesn't effect performance.

--Andy

>
> --
>                         Gleb.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-11-17 12:22 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-07  6:27 Seeking a KVM benchmark Andy Lutomirski
2014-11-07  7:17 ` Paolo Bonzini
2014-11-07 17:59   ` Andy Lutomirski
2014-11-07 18:11     ` Andy Lutomirski
2014-11-08 12:01     ` Gleb Natapov
2014-11-08 16:00       ` Andy Lutomirski
2014-11-08 16:44         ` Andy Lutomirski
2014-11-09  8:52           ` Gleb Natapov
2014-11-09 16:36             ` Andy Lutomirski
2014-11-10 10:03               ` Paolo Bonzini
2014-11-10 10:45                 ` Gleb Natapov
2014-11-10 12:15                   ` Paolo Bonzini
2014-11-10 14:23                     ` Avi Kivity
2014-11-10 17:28                       ` Paolo Bonzini
2014-11-10 17:38                         ` Gleb Natapov
2014-11-12 11:33                           ` Paolo Bonzini
2014-11-12 15:22                             ` Gleb Natapov
2014-11-12 15:26                               ` Paolo Bonzini
2014-11-12 15:32                                 ` Gleb Natapov
2014-11-12 15:51                                   ` Paolo Bonzini
2014-11-12 16:07                                     ` Andy Lutomirski
2014-11-12 17:56                                       ` Paolo Bonzini
2014-11-17 11:17                         ` Wanpeng Li
2014-11-17 11:18                           ` Paolo Bonzini
2014-11-17 12:00                             ` Wanpeng Li
2014-11-17 12:04                               ` Paolo Bonzini
2014-11-17 12:14                                 ` Wanpeng Li
2014-11-17 12:22                                   ` Paolo Bonzini
2014-11-11 11:07                     ` Paolo Bonzini
2014-11-10 19:17                   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox