* Seeking a KVM benchmark @ 2014-11-07 6:27 Andy Lutomirski 2014-11-07 7:17 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Andy Lutomirski @ 2014-11-07 6:27 UTC (permalink / raw) To: kvm list Is there an easy benchmark that's sensitive to the time it takes to round-trip from userspace to guest and back to userspace? I think I may have a big speedup. --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-07 6:27 Seeking a KVM benchmark Andy Lutomirski @ 2014-11-07 7:17 ` Paolo Bonzini 2014-11-07 17:59 ` Andy Lutomirski 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-07 7:17 UTC (permalink / raw) To: Andy Lutomirski, kvm list On 07/11/2014 07:27, Andy Lutomirski wrote: > Is there an easy benchmark that's sensitive to the time it takes to > round-trip from userspace to guest and back to userspace? I think I > may have a big speedup. The simplest is vmexit.flat from git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu benchmark. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-07 7:17 ` Paolo Bonzini @ 2014-11-07 17:59 ` Andy Lutomirski 2014-11-07 18:11 ` Andy Lutomirski 2014-11-08 12:01 ` Gleb Natapov 0 siblings, 2 replies; 30+ messages in thread From: Andy Lutomirski @ 2014-11-07 17:59 UTC (permalink / raw) To: Paolo Bonzini; +Cc: kvm list On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 07/11/2014 07:27, Andy Lutomirski wrote: >> Is there an easy benchmark that's sensitive to the time it takes to >> round-trip from userspace to guest and back to userspace? I think I >> may have a big speedup. > > The simplest is vmexit.flat from > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu > benchmark. Thanks! That test case is slower than I expected. I think my change is likely to save somewhat under 100ns, which is only a couple percent. I'll look for more impressive improvements. On a barely related note, in the process of poking around with this test, I noticed: /* On ept, can't emulate nx, and must switch nx atomically */ if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { guest_efer = vmx->vcpu.arch.efer; if (!(guest_efer & EFER_LMA)) guest_efer &= ~EFER_LME; add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); return false; } return true; This heuristic seems wrong to me. wrmsr is serializing and therefore extremely slow, whereas I imagine that, on CPUs that support it, atomically switching EFER ought to be reasonably fast. Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER switching, and having no other relevant effect that I've thought of) speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense to always use atomic EFER switching, at least when cpu_has_load_ia32_efer? --Andy ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-07 17:59 ` Andy Lutomirski @ 2014-11-07 18:11 ` Andy Lutomirski 2014-11-08 12:01 ` Gleb Natapov 1 sibling, 0 replies; 30+ messages in thread From: Andy Lutomirski @ 2014-11-07 18:11 UTC (permalink / raw) To: Paolo Bonzini; +Cc: kvm list On Fri, Nov 7, 2014 at 9:59 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 07/11/2014 07:27, Andy Lutomirski wrote: >>> Is there an easy benchmark that's sensitive to the time it takes to >>> round-trip from userspace to guest and back to userspace? I think I >>> may have a big speedup. >> >> The simplest is vmexit.flat from >> git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git >> >> Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu >> benchmark. > > Thanks! > > That test case is slower than I expected. I think my change is likely > to save somewhat under 100ns, which is only a couple percent. I'll > look for more impressive improvements. > > On a barely related note, in the process of poking around with this > test, I noticed: > > /* On ept, can't emulate nx, and must switch nx atomically */ > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { > guest_efer = vmx->vcpu.arch.efer; > if (!(guest_efer & EFER_LMA)) > guest_efer &= ~EFER_LME; > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); > return false; > } > > return true; > > This heuristic seems wrong to me. wrmsr is serializing and therefore > extremely slow, whereas I imagine that, on CPUs that support it, > atomically switching EFER ought to be reasonably fast. > > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER > switching, and having no other relevant effect that I've thought of) > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense > to always use atomic EFER switching, at least when > cpu_has_load_ia32_efer? > Digging in to the history suggests that I might be right. There's this: commit 110312c84b5fbd4daf5de2417fa8ab5ec883858d Author: Avi Kivity <avi@redhat.com> Date: Tue Dec 21 12:54:20 2010 +0200 KVM: VMX: Optimize atomic EFER load When NX is enabled on the host but not on the guest, we use the entry/exit msr load facility, which is slow. Optimize it to use entry/exit efer load, which is ~1200 cycles faster. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> The NX and atomic EFER heuristic seems to be considerably older than that. It could just be that no one ever noticed entry/exit efer load becoming faster than wrmsr on modern hardware. Someone should double-check that I'm not nuts here, though. --Andy ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-07 17:59 ` Andy Lutomirski 2014-11-07 18:11 ` Andy Lutomirski @ 2014-11-08 12:01 ` Gleb Natapov 2014-11-08 16:00 ` Andy Lutomirski 1 sibling, 1 reply; 30+ messages in thread From: Gleb Natapov @ 2014-11-08 12:01 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Paolo Bonzini, kvm list On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote: > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > On 07/11/2014 07:27, Andy Lutomirski wrote: > >> Is there an easy benchmark that's sensitive to the time it takes to > >> round-trip from userspace to guest and back to userspace? I think I > >> may have a big speedup. > > > > The simplest is vmexit.flat from > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git > > > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu > > benchmark. > > Thanks! > > That test case is slower than I expected. I think my change is likely > to save somewhat under 100ns, which is only a couple percent. I'll > look for more impressive improvements. > > On a barely related note, in the process of poking around with this > test, I noticed: > > /* On ept, can't emulate nx, and must switch nx atomically */ > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { > guest_efer = vmx->vcpu.arch.efer; > if (!(guest_efer & EFER_LMA)) > guest_efer &= ~EFER_LME; > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); > return false; > } > > return true; > > This heuristic seems wrong to me. wrmsr is serializing and therefore > extremely slow, whereas I imagine that, on CPUs that support it, > atomically switching EFER ought to be reasonably fast. > > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER > switching, and having no other relevant effect that I've thought of) > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense > to always use atomic EFER switching, at least when > cpu_has_load_ia32_efer? > The idea behind current logic is that we want to avoid writing an MSR at all for lightweight exists (those that do not exit to userspace). So if NX bit is the same for host and guest we can avoid writing EFER on exit and run with guest's EFER in the kernel. But if userspace exit is required only then we write host's MSR back, only if guest and host MSRs are different of course. What bit should be restored on userspace exit in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE? Your change reduced userspace exit cost by ~30%, but what about exit to kernel? We have much more of those. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-08 12:01 ` Gleb Natapov @ 2014-11-08 16:00 ` Andy Lutomirski 2014-11-08 16:44 ` Andy Lutomirski 0 siblings, 1 reply; 30+ messages in thread From: Andy Lutomirski @ 2014-11-08 16:00 UTC (permalink / raw) To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote: > > On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote: > > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > > > > On 07/11/2014 07:27, Andy Lutomirski wrote: > > >> Is there an easy benchmark that's sensitive to the time it takes to > > >> round-trip from userspace to guest and back to userspace? I think I > > >> may have a big speedup. > > > > > > The simplest is vmexit.flat from > > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git > > > > > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu > > > benchmark. > > > > Thanks! > > > > That test case is slower than I expected. I think my change is likely > > to save somewhat under 100ns, which is only a couple percent. I'll > > look for more impressive improvements. > > > > On a barely related note, in the process of poking around with this > > test, I noticed: > > > > /* On ept, can't emulate nx, and must switch nx atomically */ > > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { > > guest_efer = vmx->vcpu.arch.efer; > > if (!(guest_efer & EFER_LMA)) > > guest_efer &= ~EFER_LME; > > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); > > return false; > > } > > > > return true; > > > > This heuristic seems wrong to me. wrmsr is serializing and therefore > > extremely slow, whereas I imagine that, on CPUs that support it, > > atomically switching EFER ought to be reasonably fast. > > > > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER > > switching, and having no other relevant effect that I've thought of) > > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense > > to always use atomic EFER switching, at least when > > cpu_has_load_ia32_efer? > > > The idea behind current logic is that we want to avoid writing an MSR > at all for lightweight exists (those that do not exit to userspace). So > if NX bit is the same for host and guest we can avoid writing EFER on > exit and run with guest's EFER in the kernel. But if userspace exit is > required only then we write host's MSR back, only if guest and host MSRs > are different of course. What bit should be restored on userspace exit > in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE? I don't understand. AFAICT there are really only two cases: EFER switched atomically using the best available mechanism on the host CPU, or EFER switched on userspace exit. I think there's a theoretical third possibility: if the guest and host EFER match, then EFER doesn't need to be switched at all, but this doesn't seem to be implemented. > > Your change reduced userspace exit cost by ~30%, but what about exit to kernel? > We have much more of those. My KVM patch to change the heuristic didn't seem to slow down kernel exits at all. In fact, it got faster, but possibly not significantly. This makes me suspect that the newer EFER entry/exit controls are actually free. This wouldn't surprise me all that much, since the microcode has to fiddle with LME and such anyway, and just switching the whole register could be easier than thinking about which bits to switch. My KVM patch and actual benchmarks are here: http://article.gmane.org/gmane.linux.kernel/1824469 I used the wrong email address for you, and it doesn't seem to have made it to the KVM list, though. --Andy ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-08 16:00 ` Andy Lutomirski @ 2014-11-08 16:44 ` Andy Lutomirski 2014-11-09 8:52 ` Gleb Natapov 0 siblings, 1 reply; 30+ messages in thread From: Andy Lutomirski @ 2014-11-08 16:44 UTC (permalink / raw) To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote: >> >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote: >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> > > >> > > >> > > On 07/11/2014 07:27, Andy Lutomirski wrote: >> > >> Is there an easy benchmark that's sensitive to the time it takes to >> > >> round-trip from userspace to guest and back to userspace? I think I >> > >> may have a big speedup. >> > > >> > > The simplest is vmexit.flat from >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git >> > > >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu >> > > benchmark. >> > >> > Thanks! >> > >> > That test case is slower than I expected. I think my change is likely >> > to save somewhat under 100ns, which is only a couple percent. I'll >> > look for more impressive improvements. >> > >> > On a barely related note, in the process of poking around with this >> > test, I noticed: >> > >> > /* On ept, can't emulate nx, and must switch nx atomically */ >> > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { >> > guest_efer = vmx->vcpu.arch.efer; >> > if (!(guest_efer & EFER_LMA)) >> > guest_efer &= ~EFER_LME; >> > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); >> > return false; >> > } >> > >> > return true; >> > >> > This heuristic seems wrong to me. wrmsr is serializing and therefore >> > extremely slow, whereas I imagine that, on CPUs that support it, >> > atomically switching EFER ought to be reasonably fast. >> > >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER >> > switching, and having no other relevant effect that I've thought of) >> > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense >> > to always use atomic EFER switching, at least when >> > cpu_has_load_ia32_efer? >> > >> The idea behind current logic is that we want to avoid writing an MSR >> at all for lightweight exists (those that do not exit to userspace). So >> if NX bit is the same for host and guest we can avoid writing EFER on >> exit and run with guest's EFER in the kernel. But if userspace exit is >> required only then we write host's MSR back, only if guest and host MSRs >> are different of course. What bit should be restored on userspace exit >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE? > > I don't understand. AFAICT there are really only two cases: EFER > switched atomically using the best available mechanism on the host > CPU, or EFER switched on userspace exit. I think there's a > theoretical third possibility: if the guest and host EFER match, then > EFER doesn't need to be switched at all, but this doesn't seem to be > implemented. I got this part wrong. It looks like the user return notifier is smart enough not to set EFER at all if the guest and host values match. Indeed, with stock KVM, if I modify vmexit.c to have exactly the same EFER as the host (NX and SCE both set), then it runs quickly. But I get almost exactly the same performance if NX is clear, which is the case where the built-in entry/exit switching is used. Admittedly, most guests probably do match the host, so this effect may be rare in practice. But possibly the code should be changed either the way I patched it (always use the built-in switching if available) or to only do it if the guest and host EFER values differ. ISTM that, on modern CPUs, switching EFER on return to userspace is always a big loss. If neither change is made, then maybe the test should change to set SCE so that it isn't so misleadingly slow. --Andy > >> >> Your change reduced userspace exit cost by ~30%, but what about exit to kernel? >> We have much more of those. > > My KVM patch to change the heuristic didn't seem to slow down kernel > exits at all. In fact, it got faster, but possibly not significantly. > This makes me suspect that the newer EFER entry/exit controls are > actually free. This wouldn't surprise me all that much, since the > microcode has to fiddle with LME and such anyway, and just switching > the whole register could be easier than thinking about which bits to > switch. > > My KVM patch and actual benchmarks are here: > > http://article.gmane.org/gmane.linux.kernel/1824469 > > I used the wrong email address for you, and it doesn't seem to have > made it to the KVM list, though. > > --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-08 16:44 ` Andy Lutomirski @ 2014-11-09 8:52 ` Gleb Natapov 2014-11-09 16:36 ` Andy Lutomirski 0 siblings, 1 reply; 30+ messages in thread From: Gleb Natapov @ 2014-11-09 8:52 UTC (permalink / raw) To: Andy Lutomirski; +Cc: kvm list, Paolo Bonzini On Sat, Nov 08, 2014 at 08:44:42AM -0800, Andy Lutomirski wrote: > On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote: > >> > >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote: > >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >> > > > >> > > > >> > > On 07/11/2014 07:27, Andy Lutomirski wrote: > >> > >> Is there an easy benchmark that's sensitive to the time it takes to > >> > >> round-trip from userspace to guest and back to userspace? I think I > >> > >> may have a big speedup. > >> > > > >> > > The simplest is vmexit.flat from > >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git > >> > > > >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu > >> > > benchmark. > >> > > >> > Thanks! > >> > > >> > That test case is slower than I expected. I think my change is likely > >> > to save somewhat under 100ns, which is only a couple percent. I'll > >> > look for more impressive improvements. > >> > > >> > On a barely related note, in the process of poking around with this > >> > test, I noticed: > >> > > >> > /* On ept, can't emulate nx, and must switch nx atomically */ > >> > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { > >> > guest_efer = vmx->vcpu.arch.efer; > >> > if (!(guest_efer & EFER_LMA)) > >> > guest_efer &= ~EFER_LME; > >> > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); > >> > return false; > >> > } > >> > > >> > return true; > >> > > >> > This heuristic seems wrong to me. wrmsr is serializing and therefore > >> > extremely slow, whereas I imagine that, on CPUs that support it, > >> > atomically switching EFER ought to be reasonably fast. > >> > > >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER > >> > switching, and having no other relevant effect that I've thought of) > >> > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense > >> > to always use atomic EFER switching, at least when > >> > cpu_has_load_ia32_efer? > >> > > >> The idea behind current logic is that we want to avoid writing an MSR > >> at all for lightweight exists (those that do not exit to userspace). So > >> if NX bit is the same for host and guest we can avoid writing EFER on > >> exit and run with guest's EFER in the kernel. But if userspace exit is > >> required only then we write host's MSR back, only if guest and host MSRs > >> are different of course. What bit should be restored on userspace exit > >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE? > > > > I don't understand. AFAICT there are really only two cases: EFER > > switched atomically using the best available mechanism on the host > > CPU, or EFER switched on userspace exit. I think there's a > > theoretical third possibility: if the guest and host EFER match, then > > EFER doesn't need to be switched at all, but this doesn't seem to be > > implemented. > > I got this part wrong. It looks like the user return notifier is > smart enough not to set EFER at all if the guest and host values > match. Indeed, with stock KVM, if I modify vmexit.c to have exactly > the same EFER as the host (NX and SCE both set), then it runs quickly. > But I get almost exactly the same performance if NX is clear, which is > the case where the built-in entry/exit switching is used. > What's the performance difference? > Admittedly, most guests probably do match the host, so this effect may > be rare in practice. But possibly the code should be changed either > the way I patched it (always use the built-in switching if available) > or to only do it if the guest and host EFER values differ. ISTM that, > on modern CPUs, switching EFER on return to userspace is always a big > loss. We should be careful to not optimise for a wrong case. In common case userspace exits are extremely rare. Try to trace common workloads with Linux guest. Windows as a guest has its share of userspace exists, but this is due to the lack of PV timer support (was it fixed already?). So if switching EFER has measurable overhead doing it on each exit is a net loss. > > If neither change is made, then maybe the test should change to set > SCE so that it isn't so misleadingly slow. > The purpose of vmexit test is to show us various overheads, so why not measure EFER switch overhead by having two tests one with equal EFER another with different EFER, instead of hiding it. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-09 8:52 ` Gleb Natapov @ 2014-11-09 16:36 ` Andy Lutomirski 2014-11-10 10:03 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Andy Lutomirski @ 2014-11-09 16:36 UTC (permalink / raw) To: Gleb Natapov; +Cc: kvm list, Paolo Bonzini On Sun, Nov 9, 2014 at 12:52 AM, Gleb Natapov <gleb@kernel.org> wrote: > On Sat, Nov 08, 2014 at 08:44:42AM -0800, Andy Lutomirski wrote: >> On Sat, Nov 8, 2014 at 8:00 AM, Andy Lutomirski <luto@amacapital.net> wrote: >> > On Nov 8, 2014 4:01 AM, "Gleb Natapov" <gleb@kernel.org> wrote: >> >> >> >> On Fri, Nov 07, 2014 at 09:59:55AM -0800, Andy Lutomirski wrote: >> >> > On Thu, Nov 6, 2014 at 11:17 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> > > >> >> > > >> >> > > On 07/11/2014 07:27, Andy Lutomirski wrote: >> >> > >> Is there an easy benchmark that's sensitive to the time it takes to >> >> > >> round-trip from userspace to guest and back to userspace? I think I >> >> > >> may have a big speedup. >> >> > > >> >> > > The simplest is vmexit.flat from >> >> > > git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git >> >> > > >> >> > > Run it with "x86/run x86/vmexit.flat" and look at the inl_from_qemu >> >> > > benchmark. >> >> > >> >> > Thanks! >> >> > >> >> > That test case is slower than I expected. I think my change is likely >> >> > to save somewhat under 100ns, which is only a couple percent. I'll >> >> > look for more impressive improvements. >> >> > >> >> > On a barely related note, in the process of poking around with this >> >> > test, I noticed: >> >> > >> >> > /* On ept, can't emulate nx, and must switch nx atomically */ >> >> > if (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX)) { >> >> > guest_efer = vmx->vcpu.arch.efer; >> >> > if (!(guest_efer & EFER_LMA)) >> >> > guest_efer &= ~EFER_LME; >> >> > add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); >> >> > return false; >> >> > } >> >> > >> >> > return true; >> >> > >> >> > This heuristic seems wrong to me. wrmsr is serializing and therefore >> >> > extremely slow, whereas I imagine that, on CPUs that support it, >> >> > atomically switching EFER ought to be reasonably fast. >> >> > >> >> > Indeed, changing vmexit.c to disable NX (thereby forcing atomic EFER >> >> > switching, and having no other relevant effect that I've thought of) >> >> > speeds up inl_from_qemu by ~30% on Sandy Bridge. Would it make sense >> >> > to always use atomic EFER switching, at least when >> >> > cpu_has_load_ia32_efer? >> >> > >> >> The idea behind current logic is that we want to avoid writing an MSR >> >> at all for lightweight exists (those that do not exit to userspace). So >> >> if NX bit is the same for host and guest we can avoid writing EFER on >> >> exit and run with guest's EFER in the kernel. But if userspace exit is >> >> required only then we write host's MSR back, only if guest and host MSRs >> >> are different of course. What bit should be restored on userspace exit >> >> in vmexit tests? Is it SCE? What if you set it instead of unsetting NXE? >> > >> > I don't understand. AFAICT there are really only two cases: EFER >> > switched atomically using the best available mechanism on the host >> > CPU, or EFER switched on userspace exit. I think there's a >> > theoretical third possibility: if the guest and host EFER match, then >> > EFER doesn't need to be switched at all, but this doesn't seem to be >> > implemented. >> >> I got this part wrong. It looks like the user return notifier is >> smart enough not to set EFER at all if the guest and host values >> match. Indeed, with stock KVM, if I modify vmexit.c to have exactly >> the same EFER as the host (NX and SCE both set), then it runs quickly. >> But I get almost exactly the same performance if NX is clear, which is >> the case where the built-in entry/exit switching is used. >> > What's the performance difference? Negative. That is, switching EFER atomically was faster than not switching it at all. But this could just be noise. Here are the numbers comparing the status quo (SCE cleared in vmexit.c, so switch on user return) vs. switching atomically at entry/exit. Sorry about the formatting. Test Before After Change cpuid 2000 1932 -3.40% vmcall 1914 1817 -5.07% mov_from_cr8 13 13 0.00% mov_to_cr8 19 19 0.00% inl_from_pmtimer 19164 10619 -44.59% inl_from_qemu 15662 10302 -34.22% inl_from_kernel 3916 3802 -2.91% outl_to_kernel 2230 2194 -1.61% mov_dr 172 176 2.33% ipi (skipped) (skipped) ipi+halt (skipped) (skipped) ple-round-robin 13 13 0.00% wr_tsc_adjust_msr 1920 1845 -3.91% rd_tsc_adjust_msr 1892 1814 -4.12% mmio-no-eventfd:pci-mem 16394 11165 -31.90% mmio-wildcard-eventfd:pci-mem 4607 4645 0.82% mmio-datamatch-eventfd:pci-mem 4601 4610 0.20% portio-no-eventfd:pci-io 11507 7942 -30.98% portio-wildcard-eventfd:pci-io 2239 2225 -0.63% portio-datamatch-eventfd:pci- io 2250 2234 -0.71% The tiny differences for the non-userspace exits could be just noise or CPU temperature at the time or anything else. > >> Admittedly, most guests probably do match the host, so this effect may >> be rare in practice. But possibly the code should be changed either >> the way I patched it (always use the built-in switching if available) >> or to only do it if the guest and host EFER values differ. ISTM that, >> on modern CPUs, switching EFER on return to userspace is always a big >> loss. > We should be careful to not optimise for a wrong case. In common case > userspace exits are extremely rare. Try to trace common workloads with > Linux guest. Windows as a guest has its share of userspace exists, but > this is due to the lack of PV timer support (was it fixed already?). > So if switching EFER has measurable overhead doing it on each exit is a > net loss. > >> >> If neither change is made, then maybe the test should change to set >> SCE so that it isn't so misleadingly slow. >> > The purpose of vmexit test is to show us various overheads, so why not > measure EFER switch overhead by having two tests one with equal EFER > another with different EFER, instead of hiding it. > I'll try this. We might need three tests, though: NX different, NX same but SCE different, and all flags the same. --Andy ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-09 16:36 ` Andy Lutomirski @ 2014-11-10 10:03 ` Paolo Bonzini 2014-11-10 10:45 ` Gleb Natapov 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-10 10:03 UTC (permalink / raw) To: Andy Lutomirski, Gleb Natapov; +Cc: kvm list On 09/11/2014 17:36, Andy Lutomirski wrote: >> The purpose of vmexit test is to show us various overheads, so why not >> measure EFER switch overhead by having two tests one with equal EFER >> another with different EFER, instead of hiding it. > > I'll try this. We might need three tests, though: NX different, NX > same but SCE different, and all flags the same. The test actually explicitly enables NX in order to put itself in the "common case": commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36 Author: Marcelo Tosatti <mtosatti@redhat.com> Date: Tue Jun 8 15:33:29 2010 -0300 test: vmexit: enable NX Enable NX to disable MSR autoload/save. This is the common case anyway. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is just a bug. The results on my Xeon Sandy Bridge are very interesting: NX different ~11.5k (load/save EFER path) NX same, SCE different ~19.5k (urn path) all flags the same ~10.2k The inl_from_kernel results have absolutely no change, usually at most 5 cycles difference. This could be because I've added the SCE=1 variant directly to vmexit.c, so I'm running the tests one next to the other. I tried making also the other shared MSRs the same between guest and host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I do want to dig out my old Core 2 and see how the new test fares, but it really looks like your patch will be in 3.19. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 10:03 ` Paolo Bonzini @ 2014-11-10 10:45 ` Gleb Natapov 2014-11-10 12:15 ` Paolo Bonzini 2014-11-10 19:17 ` Andy Lutomirski 0 siblings, 2 replies; 30+ messages in thread From: Gleb Natapov @ 2014-11-10 10:45 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Andy Lutomirski, kvm list On Mon, Nov 10, 2014 at 11:03:35AM +0100, Paolo Bonzini wrote: > > > On 09/11/2014 17:36, Andy Lutomirski wrote: > >> The purpose of vmexit test is to show us various overheads, so why not > >> measure EFER switch overhead by having two tests one with equal EFER > >> another with different EFER, instead of hiding it. > > > > I'll try this. We might need three tests, though: NX different, NX > > same but SCE different, and all flags the same. > > The test actually explicitly enables NX in order to put itself in the > "common case": > > commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36 > Author: Marcelo Tosatti <mtosatti@redhat.com> > Date: Tue Jun 8 15:33:29 2010 -0300 > > test: vmexit: enable NX > > Enable NX to disable MSR autoload/save. This is the common case anyway. > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > Signed-off-by: Avi Kivity <avi@redhat.com> > > (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is > just a bug. The results on my Xeon Sandy Bridge are very interesting: > > NX different ~11.5k (load/save EFER path) > NX same, SCE different ~19.5k (urn path) > all flags the same ~10.2k > > The inl_from_kernel results have absolutely no change, usually at most 5 > cycles difference. This could be because I've added the SCE=1 variant > directly to vmexit.c, so I'm running the tests one next to the other. > > I tried making also the other shared MSRs the same between guest and > host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier > has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I > do want to dig out my old Core 2 and see how the new test fares, but it > really looks like your patch will be in 3.19. > Please test on wide variety of HW before final decision. Also it would be nice to ask Intel what is expected overhead. It is awesome if they mange to add EFER switching with non measurable overhead, but also hard to believe :) Also Andy had an idea do disable switching in case host and guest EFERs are the same but IIRC his patch does not include it yet. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 10:45 ` Gleb Natapov @ 2014-11-10 12:15 ` Paolo Bonzini 2014-11-10 14:23 ` Avi Kivity 2014-11-11 11:07 ` Paolo Bonzini 2014-11-10 19:17 ` Andy Lutomirski 1 sibling, 2 replies; 30+ messages in thread From: Paolo Bonzini @ 2014-11-10 12:15 UTC (permalink / raw) To: Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 10/11/2014 11:45, Gleb Natapov wrote: > > I tried making also the other shared MSRs the same between guest and > > host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier > > has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I > > do want to dig out my old Core 2 and see how the new test fares, but it > > really looks like your patch will be in 3.19. > > Please test on wide variety of HW before final decision. Yes, definitely. > Also it would > be nice to ask Intel what is expected overhead. It is awesome if they > mange to add EFER switching with non measurable overhead, but also hard > to believe :) So let's see what happens. Sneak preview: the result is definitely worth asking Intel about. I ran these benchmarks with a stock 3.16.6 KVM. Instead I patched kvm-unit-tests to set EFER.SCE in enable_nx. This makes it much simpler for others to reproduce the results. I only ran the inl_from_qemu test. Perf stat reports that the processor goes from 0.65 to 0.46 instructions per cycle, which is consistent with the improvement from 19k to 12k cycles per iteration. Unpatched KVM-unit-tests: 3,385,586,563 cycles # 3.189 GHz [83.25%] 2,475,979,685 stalled-cycles-frontend # 73.13% frontend cycles idle [83.37%] 2,083,556,270 stalled-cycles-backend # 61.54% backend cycles idle [66.71%] 1,573,854,041 instructions # 0.46 insns per cycle # 1.57 stalled cycles per insn [83.20%] 1.108486526 seconds time elapsed Patched KVM-unit-tests: 3,252,297,378 cycles # 3.147 GHz [83.32%] 2,010,266,184 stalled-cycles-frontend # 61.81% frontend cycles idle [83.36%] 1,560,371,769 stalled-cycles-backend # 47.98% backend cycles idle [66.51%] 2,133,698,018 instructions # 0.66 insns per cycle # 0.94 stalled cycles per insn [83.45%] 1.072395697 seconds time elapsed Playing with other events shows that the unpatched benchmark has an awful load of TLB misses Unpatched: 30,311 iTLB-loads 464,641,844 dTLB-loads 10,813,839 dTLB-load-misses # 2.33% of all dTLB cache hits 20436,027 iTLB-load-misses # 67421.16% of all iTLB cache hits Patched: 1,440,033 iTLB-loads 640,970,836 dTLB-loads 2,345,112 dTLB-load-misses # 0.37% of all dTLB cache hits 270,884 iTLB-load-misses # 18.81% of all iTLB cache hits This is 100% reproducible. The meaning of the numbers is clearer if you look up the raw event numbers in the Intel manuals: - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level TLB] hits. No page walk." - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that cause page walks." So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and friends show that the unpatched KVM wastes about 0.1 seconds more than the patched KVM on page walks: Unpatched: 22,583,440 r449 (cycles on dTLB store miss page walks) 40,452,018 r408 (cycles on dTLB load miss page walks) 2,115,981 r485 (cycles on iTLB miss page walks) ------------------------ 65,151,439 total Patched: 24,430,676 r449 (cycles on dTLB store miss page walks) 196,017,693 r408 (cycles on dTLB load miss page walks) 213,266,243 r485 (cycles on iTLB miss page walks) ------------------------- 433,714,612 total These 0.1 seconds probably are all on instructions that would have been fast, since the slow instructions responsible for the low IPC are the microcoded instructions including VMX and other privileged stuff. Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM and 260k in patched KVM. Let's see where they come from: Unpatched: + 98.97% qemu-kvm [kernel.kallsyms] [k] native_write_msr_safe + 0.70% qemu-kvm [kernel.kallsyms] [k] page_fault It's expected that most TLB misses happen just before a page fault (there are also events to count how many TLB misses do result in a page fault, if you care about that), and thus are accounted to the first instruction of the exception handler. We do not know what causes second-level TLB _flushes_ but it's quite expected that you'll have a TLB miss after them and possibly a page fault. And anyway 98.97% of them coming from native_write_msr_safe is totally anomalous. A patched benchmark shows no second-level TLB flush occurs after a WRMSR: + 72.41% qemu-kvm [kernel.kallsyms] [k] page_fault + 9.07% qemu-kvm [kvm_intel] [k] vmx_flush_tlb + 6.60% qemu-kvm [kernel.kallsyms] [k] set_pte_vaddr_pud + 5.68% qemu-kvm [kernel.kallsyms] [k] flush_tlb_mm_range + 4.87% qemu-kvm [kernel.kallsyms] [k] native_flush_tlb + 1.36% qemu-kvm [kernel.kallsyms] [k] flush_tlb_page So basically VMX EFER writes are optimized, while non-VMX EFER writes cause a TLB flush, at least on a Sandy Bridge. Ouch! I'll try to reproduce on the Core 2 Duo soon, and inquire Intel about it. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 12:15 ` Paolo Bonzini @ 2014-11-10 14:23 ` Avi Kivity 2014-11-10 17:28 ` Paolo Bonzini 2014-11-11 11:07 ` Paolo Bonzini 1 sibling, 1 reply; 30+ messages in thread From: Avi Kivity @ 2014-11-10 14:23 UTC (permalink / raw) To: Paolo Bonzini, Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 11/10/2014 02:15 PM, Paolo Bonzini wrote: > > On 10/11/2014 11:45, Gleb Natapov wrote: >>> I tried making also the other shared MSRs the same between guest and >>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier >>> has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I >>> do want to dig out my old Core 2 and see how the new test fares, but it >>> really looks like your patch will be in 3.19. >> Please test on wide variety of HW before final decision. > Yes, definitely. > >> Also it would >> be nice to ask Intel what is expected overhead. It is awesome if they >> mange to add EFER switching with non measurable overhead, but also hard >> to believe :) > So let's see what happens. Sneak preview: the result is definitely worth > asking Intel about. > > I ran these benchmarks with a stock 3.16.6 KVM. Instead I patched > kvm-unit-tests to set EFER.SCE in enable_nx. This makes it much simpler > for others to reproduce the results. I only ran the inl_from_qemu test. > > Perf stat reports that the processor goes from 0.65 to 0.46 > instructions per cycle, which is consistent with the improvement from > 19k to 12k cycles per iteration. > > Unpatched KVM-unit-tests: > > 3,385,586,563 cycles # 3.189 GHz [83.25%] > 2,475,979,685 stalled-cycles-frontend # 73.13% frontend cycles idle [83.37%] > 2,083,556,270 stalled-cycles-backend # 61.54% backend cycles idle [66.71%] > 1,573,854,041 instructions # 0.46 insns per cycle > # 1.57 stalled cycles per insn [83.20%] > 1.108486526 seconds time elapsed > > > Patched KVM-unit-tests: > > 3,252,297,378 cycles # 3.147 GHz [83.32%] > 2,010,266,184 stalled-cycles-frontend # 61.81% frontend cycles idle [83.36%] > 1,560,371,769 stalled-cycles-backend # 47.98% backend cycles idle [66.51%] > 2,133,698,018 instructions # 0.66 insns per cycle > # 0.94 stalled cycles per insn [83.45%] > 1.072395697 seconds time elapsed > > Playing with other events shows that the unpatched benchmark has an > awful load of TLB misses > > Unpatched: > > 30,311 iTLB-loads > 464,641,844 dTLB-loads > 10,813,839 dTLB-load-misses # 2.33% of all dTLB cache hits > 20436,027 iTLB-load-misses # 67421.16% of all iTLB cache hits > > Patched: > > 1,440,033 iTLB-loads > 640,970,836 dTLB-loads > 2,345,112 dTLB-load-misses # 0.37% of all dTLB cache hits > 270,884 iTLB-load-misses # 18.81% of all iTLB cache hits > > This is 100% reproducible. The meaning of the numbers is clearer if you > look up the raw event numbers in the Intel manuals: > > - iTLB-loads is 85h/10h aka "perf -e r1085": "Number of cache load STLB [second-level > TLB] hits. No page walk." > > - iTLB-load-misses is 85h/01h aka r185: "Misses in all ITLB levels that > cause page walks." > > So for example event 85h/04h aka r485 ("Cycle PMH is busy with a walk.") and > friends show that the unpatched KVM wastes about 0.1 seconds more than > the patched KVM on page walks: > > Unpatched: > > 22,583,440 r449 (cycles on dTLB store miss page walks) > 40,452,018 r408 (cycles on dTLB load miss page walks) > 2,115,981 r485 (cycles on iTLB miss page walks) > ------------------------ > 65,151,439 total > > Patched: > > 24,430,676 r449 (cycles on dTLB store miss page walks) > 196,017,693 r408 (cycles on dTLB load miss page walks) > 213,266,243 r485 (cycles on iTLB miss page walks) > ------------------------- > 433,714,612 total > > These 0.1 seconds probably are all on instructions that would have been > fast, since the slow instructions responsible for the low IPC are the > microcoded instructions including VMX and other privileged stuff. > > Similarly, BDh/20h counts STLB flushes, which are 3k in unpatched KVM > and 260k in patched KVM. Let's see where they come from: > > Unpatched: > > + 98.97% qemu-kvm [kernel.kallsyms] [k] native_write_msr_safe > + 0.70% qemu-kvm [kernel.kallsyms] [k] page_fault > > It's expected that most TLB misses happen just before a page fault (there > are also events to count how many TLB misses do result in a page fault, > if you care about that), and thus are accounted to the first instruction of the > exception handler. > > We do not know what causes second-level TLB _flushes_ but it's quite > expected that you'll have a TLB miss after them and possibly a page fault. > And anyway 98.97% of them coming from native_write_msr_safe is totally > anomalous. > > A patched benchmark shows no second-level TLB flush occurs after a WRMSR: > > + 72.41% qemu-kvm [kernel.kallsyms] [k] page_fault > + 9.07% qemu-kvm [kvm_intel] [k] vmx_flush_tlb > + 6.60% qemu-kvm [kernel.kallsyms] [k] set_pte_vaddr_pud > + 5.68% qemu-kvm [kernel.kallsyms] [k] flush_tlb_mm_range > + 4.87% qemu-kvm [kernel.kallsyms] [k] native_flush_tlb > + 1.36% qemu-kvm [kernel.kallsyms] [k] flush_tlb_page > > > So basically VMX EFER writes are optimized, while non-VMX EFER writes > cause a TLB flush, at least on a Sandy Bridge. Ouch! > It's not surprising [1]. Since the meaning of some PTE bits change [2], the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush if EFER changed between two invocations of the same VPID, which isn't the case. [1] after the fact [2] although those bits were reserved with NXE=0, so they shouldn't have any TLB footprint ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 14:23 ` Avi Kivity @ 2014-11-10 17:28 ` Paolo Bonzini 2014-11-10 17:38 ` Gleb Natapov 2014-11-17 11:17 ` Wanpeng Li 0 siblings, 2 replies; 30+ messages in thread From: Paolo Bonzini @ 2014-11-10 17:28 UTC (permalink / raw) To: Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 10/11/2014 15:23, Avi Kivity wrote: > It's not surprising [1]. Since the meaning of some PTE bits change [2], > the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush > if EFER changed between two invocations of the same VPID, which isn't the > case. > > [1] after the fact > [2] although those bits were reserved with NXE=0, so they shouldn't have > any TLB footprint You're right that this is not that surprising after the fact, and that both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones). This is also why I'm curious about the Nehalem. However note that even toggling the SCE bit is flushing the TLB. The NXE bit is not being toggled here! That's the more surprising part. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 17:28 ` Paolo Bonzini @ 2014-11-10 17:38 ` Gleb Natapov 2014-11-12 11:33 ` Paolo Bonzini 2014-11-17 11:17 ` Wanpeng Li 1 sibling, 1 reply; 30+ messages in thread From: Gleb Natapov @ 2014-11-10 17:38 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote: > On 10/11/2014 15:23, Avi Kivity wrote: > > It's not surprising [1]. Since the meaning of some PTE bits change [2], > > the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush > > if EFER changed between two invocations of the same VPID, which isn't the > > case. > > > > [1] after the fact > > [2] although those bits were reserved with NXE=0, so they shouldn't have > > any TLB footprint > > You're right that this is not that surprising after the fact, and that > both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones). > This is also why I'm curious about the Nehalem. > > However note that even toggling the SCE bit is flushing the TLB. The > NXE bit is not being toggled here! That's the more surprising part. > Just a guess, but may be because writing EFER is not something that happens often in regular OSes it is not optimized to handle different bits differently. -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 17:38 ` Gleb Natapov @ 2014-11-12 11:33 ` Paolo Bonzini 2014-11-12 15:22 ` Gleb Natapov 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-12 11:33 UTC (permalink / raw) To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list On 10/11/2014 18:38, Gleb Natapov wrote: > On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote: >> On 10/11/2014 15:23, Avi Kivity wrote: >>> It's not surprising [1]. Since the meaning of some PTE bits change [2], >>> the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush >>> if EFER changed between two invocations of the same VPID, which isn't the >>> case. >>> >>> [1] after the fact >>> [2] although those bits were reserved with NXE=0, so they shouldn't have >>> any TLB footprint >> >> You're right that this is not that surprising after the fact, and that >> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones). >> This is also why I'm curious about the Nehalem. >> >> However note that even toggling the SCE bit is flushing the TLB. The >> NXE bit is not being toggled here! That's the more surprising part. >> > Just a guess, but may be because writing EFER is not something that happens > often in regular OSes it is not optimized to handle different bits differently. Yes, that's what Intel said too. Nehalem results: userspace exit, urn 17560 17726 17628 17572 17417 lightweight exit, urn 3316 3342 3342 3319 3328 userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 So the benchmark results also explain why skipping the LOAD_EFER does not give a benefit for guest EFER=host EFER. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 11:33 ` Paolo Bonzini @ 2014-11-12 15:22 ` Gleb Natapov 2014-11-12 15:26 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Gleb Natapov @ 2014-11-12 15:22 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list On Wed, Nov 12, 2014 at 12:33:32PM +0100, Paolo Bonzini wrote: > > > On 10/11/2014 18:38, Gleb Natapov wrote: > > On Mon, Nov 10, 2014 at 06:28:25PM +0100, Paolo Bonzini wrote: > >> On 10/11/2014 15:23, Avi Kivity wrote: > >>> It's not surprising [1]. Since the meaning of some PTE bits change [2], > >>> the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush > >>> if EFER changed between two invocations of the same VPID, which isn't the > >>> case. > >>> > >>> [1] after the fact > >>> [2] although those bits were reserved with NXE=0, so they shouldn't have > >>> any TLB footprint > >> > >> You're right that this is not that surprising after the fact, and that > >> both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones). > >> This is also why I'm curious about the Nehalem. > >> > >> However note that even toggling the SCE bit is flushing the TLB. The > >> NXE bit is not being toggled here! That's the more surprising part. > >> > > Just a guess, but may be because writing EFER is not something that happens > > often in regular OSes it is not optimized to handle different bits differently. > > Yes, that's what Intel said too. > > Nehalem results: > > userspace exit, urn 17560 17726 17628 17572 17417 > lightweight exit, urn 3316 3342 3342 3319 3328 > userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 > lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 > userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 > lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one that always switch LOAD_EFER? -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 15:22 ` Gleb Natapov @ 2014-11-12 15:26 ` Paolo Bonzini 2014-11-12 15:32 ` Gleb Natapov 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-12 15:26 UTC (permalink / raw) To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list On 12/11/2014 16:22, Gleb Natapov wrote: >> > Nehalem results: >> > >> > userspace exit, urn 17560 17726 17628 17572 17417 >> > lightweight exit, urn 3316 3342 3342 3319 3328 >> > userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 >> > lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 >> > userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 >> > lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 >> > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one > that always switch LOAD_EFER? Skip LOAD_EFER when guest=host. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 15:26 ` Paolo Bonzini @ 2014-11-12 15:32 ` Gleb Natapov 2014-11-12 15:51 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Gleb Natapov @ 2014-11-12 15:32 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Avi Kivity, Andy Lutomirski, kvm list On Wed, Nov 12, 2014 at 04:26:29PM +0100, Paolo Bonzini wrote: > > > On 12/11/2014 16:22, Gleb Natapov wrote: > >> > Nehalem results: > >> > > >> > userspace exit, urn 17560 17726 17628 17572 17417 > >> > lightweight exit, urn 3316 3342 3342 3319 3328 > >> > userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 > >> > lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 > >> > userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 > >> > lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 > >> > > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one > > that always switch LOAD_EFER? > > Skip LOAD_EFER when guest=host. > So guest=host is a little bit better than guest!=host so looks like skipping LOAD_EFER helps, but why "lightweight exit, urn" worse than guest=host though, it should be exactly the same as long as NX bit is the same in urn test, no? -- Gleb. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 15:32 ` Gleb Natapov @ 2014-11-12 15:51 ` Paolo Bonzini 2014-11-12 16:07 ` Andy Lutomirski 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-12 15:51 UTC (permalink / raw) To: Gleb Natapov; +Cc: Avi Kivity, Andy Lutomirski, kvm list On 12/11/2014 16:32, Gleb Natapov wrote: > > > > userspace exit, urn 17560 17726 17628 17572 17417 > > > > lightweight exit, urn 3316 3342 3342 3319 3328 > > > > userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 > > > > lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 > > > > userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 > > > > lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 > > > > > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one > > > that always switch LOAD_EFER? > > > > Skip LOAD_EFER when guest=host. > > So guest=host is a little bit better than guest!=host so looks like > skipping LOAD_EFER helps, but why "lightweight exit, urn" worse than > guest=host though, it should be exactly the same as long as NX bit is > the same in urn test, no? I don't know---it is very much reproducible though. It is not my machine so I cannot run perf on it, but I can try to find a similar one in the next few days. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 15:51 ` Paolo Bonzini @ 2014-11-12 16:07 ` Andy Lutomirski 2014-11-12 17:56 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Andy Lutomirski @ 2014-11-12 16:07 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Gleb Natapov, Avi Kivity, kvm list On Wed, Nov 12, 2014 at 7:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 12/11/2014 16:32, Gleb Natapov wrote: >> > > > userspace exit, urn 17560 17726 17628 17572 17417 >> > > > lightweight exit, urn 3316 3342 3342 3319 3328 >> > > > userspace exit, LOAD_EFER, guest!=host 12200 11772 12130 12164 12327 >> > > > lightweight exit, LOAD_EFER, guest!=host 3214 3220 3238 3218 3337 >> > > > userspace exit, LOAD_EFER, guest=host 11983 11780 11920 11919 12040 >> > > > lightweight exit, LOAD_EFER, guest=host 3178 3193 3193 3187 3220 >> > > >> > > Is this with Andy's patch that skips LOAD_EFER when guest=host, or the one >> > > that always switch LOAD_EFER? >> > >> > Skip LOAD_EFER when guest=host. >> >> So guest=host is a little bit better than guest!=host so looks like >> skipping LOAD_EFER helps, but why "lightweight exit, urn" worse than >> guest=host though, it should be exactly the same as long as NX bit is >> the same in urn test, no? > > I don't know---it is very much reproducible though. It is not my > machine so I cannot run perf on it, but I can try to find a similar one > in the next few days. Assuming you're running both of my patches (LOAD_EFER regardless of nx, but skip LOAD_EFER of guest == host), then some of the speedup may be just less code running. I haven't figured out exactly when vmx_save_host_state runs, but my patches avoid a call to kvm_set_shared_msr, which is worth a few cycles. --Andy > > Paolo -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-12 16:07 ` Andy Lutomirski @ 2014-11-12 17:56 ` Paolo Bonzini 0 siblings, 0 replies; 30+ messages in thread From: Paolo Bonzini @ 2014-11-12 17:56 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Gleb Natapov, Avi Kivity, kvm list > Assuming you're running both of my patches (LOAD_EFER regardless of > nx, but skip LOAD_EFER of guest == host), then some of the speedup may > be just less code running. I haven't figured out exactly when > vmx_save_host_state runs, but my patches avoid a call to > kvm_set_shared_msr, which is worth a few cycles. Yes, that's possible. vmx_save_host_state is here: preempt_disable(); kvm_x86_ops->prepare_guest_switch(vcpu); // <<<< if (vcpu->fpu_active) kvm_load_guest_fpu(vcpu); kvm_load_guest_xcr0(vcpu); vcpu->mode = IN_GUEST_MODE; srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); and it's a fairly hot function. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 17:28 ` Paolo Bonzini 2014-11-10 17:38 ` Gleb Natapov @ 2014-11-17 11:17 ` Wanpeng Li 2014-11-17 11:18 ` Paolo Bonzini 1 sibling, 1 reply; 30+ messages in thread From: Wanpeng Li @ 2014-11-17 11:17 UTC (permalink / raw) To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list Hi Paolo, On 11/11/14, 1:28 AM, Paolo Bonzini wrote: > On 10/11/2014 15:23, Avi Kivity wrote: >> It's not surprising [1]. Since the meaning of some PTE bits change [2], >> the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush >> if EFER changed between two invocations of the same VPID, which isn't the >> case. If there need a TLB flush if guest is UP? Regards, Wanpeng Li >> >> [1] after the fact >> [2] although those bits were reserved with NXE=0, so they shouldn't have >> any TLB footprint > You're right that this is not that surprising after the fact, and that > both Sandy Bridge and Ivy Bridge have VPIDs (even the non-Xeon ones). > This is also why I'm curious about the Nehalem. > > However note that even toggling the SCE bit is flushing the TLB. The > NXE bit is not being toggled here! That's the more surprising part. > > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-17 11:17 ` Wanpeng Li @ 2014-11-17 11:18 ` Paolo Bonzini 2014-11-17 12:00 ` Wanpeng Li 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-17 11:18 UTC (permalink / raw) To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 17/11/2014 12:17, Wanpeng Li wrote: >> >>> It's not surprising [1]. Since the meaning of some PTE bits change [2], >>> the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush >>> if EFER changed between two invocations of the same VPID, which isn't >>> the case. > > If there need a TLB flush if guest is UP? The wrmsr is in the host, and the TLB flush is done in the processor microcode. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-17 11:18 ` Paolo Bonzini @ 2014-11-17 12:00 ` Wanpeng Li 2014-11-17 12:04 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Wanpeng Li @ 2014-11-17 12:00 UTC (permalink / raw) To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list Hi Paolo, On 11/17/14, 7:18 PM, Paolo Bonzini wrote: > > On 17/11/2014 12:17, Wanpeng Li wrote: >>>> It's not surprising [1]. Since the meaning of some PTE bits change [2], >>>> the TLB has to be flushed. In VMX we have VPIDs, so we only need to flush >>>> if EFER changed between two invocations of the same VPID, which isn't >>>> the case. >> If there need a TLB flush if guest is UP? > The wrmsr is in the host, and the TLB flush is done in the processor > microcode. Sorry, maybe I didn't state my question clearly. As Avi mentioned above "In VMX we have VPIDs, so we only need to flush if EFER changed between two invocations of the same VPID", so there is only one VPID if the guest is UP, my question is if there need a TLB flush when guest's EFER has been changed? Regards, Wanpeng Li > > Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-17 12:00 ` Wanpeng Li @ 2014-11-17 12:04 ` Paolo Bonzini 2014-11-17 12:14 ` Wanpeng Li 0 siblings, 1 reply; 30+ messages in thread From: Paolo Bonzini @ 2014-11-17 12:04 UTC (permalink / raw) To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 17/11/2014 13:00, Wanpeng Li wrote: > Sorry, maybe I didn't state my question clearly. As Avi mentioned above > "In VMX we have VPIDs, so we only need to flush if EFER changed between > two invocations of the same VPID", so there is only one VPID if the > guest is UP, my question is if there need a TLB flush when guest's EFER > has been changed? Yes, because the meaning of the page table entries has changed. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-17 12:04 ` Paolo Bonzini @ 2014-11-17 12:14 ` Wanpeng Li 2014-11-17 12:22 ` Paolo Bonzini 0 siblings, 1 reply; 30+ messages in thread From: Wanpeng Li @ 2014-11-17 12:14 UTC (permalink / raw) To: Paolo Bonzini, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list Hi Paolo, On 11/17/14, 8:04 PM, Paolo Bonzini wrote: > > On 17/11/2014 13:00, Wanpeng Li wrote: >> Sorry, maybe I didn't state my question clearly. As Avi mentioned above >> "In VMX we have VPIDs, so we only need to flush if EFER changed between >> two invocations of the same VPID", so there is only one VPID if the >> guest is UP, my question is if there need a TLB flush when guest's EFER >> has been changed? > Yes, because the meaning of the page table entries has changed. So both VMX EFER writes and non-VMX EFER writes cause a TLB flush for UP guest, is there still a performance improvement in this case? Regards, Wanpeng Li > > Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-17 12:14 ` Wanpeng Li @ 2014-11-17 12:22 ` Paolo Bonzini 0 siblings, 0 replies; 30+ messages in thread From: Paolo Bonzini @ 2014-11-17 12:22 UTC (permalink / raw) To: Wanpeng Li, Avi Kivity, Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 17/11/2014 13:14, Wanpeng Li wrote: >> >>> Sorry, maybe I didn't state my question clearly. As Avi mentioned above >>> "In VMX we have VPIDs, so we only need to flush if EFER changed between >>> two invocations of the same VPID", so there is only one VPID if the >>> guest is UP, my question is if there need a TLB flush when guest's EFER >>> has been changed? >> Yes, because the meaning of the page table entries has changed. > > So both VMX EFER writes and non-VMX EFER writes cause a TLB flush for UP > guest, is there still a performance improvement in this case? Note that the guest's EFER does not change, so no TLB flush happens. The guest EFER, however, is different from the host's, so if you change it with a wrmsr in the host you will get a TLB flush on every userspace exit. Paolo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 12:15 ` Paolo Bonzini 2014-11-10 14:23 ` Avi Kivity @ 2014-11-11 11:07 ` Paolo Bonzini 1 sibling, 0 replies; 30+ messages in thread From: Paolo Bonzini @ 2014-11-11 11:07 UTC (permalink / raw) To: Gleb Natapov; +Cc: Andy Lutomirski, kvm list On 10/11/2014 13:15, Paolo Bonzini wrote: > > > On 10/11/2014 11:45, Gleb Natapov wrote: >>> I tried making also the other shared MSRs the same between guest and >>> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier >>> has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I >>> do want to dig out my old Core 2 and see how the new test fares, but it >>> really looks like your patch will be in 3.19. >> >> Please test on wide variety of HW before final decision. > > Yes, definitely. I've reproduced Andy's results on Ivy Bridge: NX off ~6900 cycles (EFER) NX on, SCE off ~14600 cycles (urn) NX on, SCE on ~6900 cycles (same value) I also asked Intel about clarifications. On Core 2 Duo the results are weird. There is no LOAD_EFER control, so Andy's patch does not apply and the only interesting paths are urn and same value. The pessimization of EFER writes does _seem_ to be there, since I can profile for iTLB flushes (r4082 on this microarchitecture) and get: 0.14% qemu-kvm [kernel.kallsyms] [k] native_write_msr_safe 0.14% qemu-kvm [kernel.kallsyms] [k] native_flush_tlb but these are the top two results and it is not clear to me why perf only records them as "0.14%"... Also, this machine has no EPT, so virt suffers a lot from TLB misses anyway. Nevertheless I tried running kvm-unit-tests with different values of the MSRs to see what's the behavior. NX=1/SCE=0 NX=1/SCE=1 all MSRs equal cpuid 3374 3448 3608 vmcall 3274 3337 3478 mov_from_cr8 11 11 11 mov_to_cr8 15 15 15 inl_from_pmtimer 17803 16346 15156 inl_from_qemu 17858 16375 15163 inl_from_kernel 6351 6492 6622 outl_to_kernel 3850 3900 4053 mov_dr 116 116 117 ple-round-robin 15 16 16 wr_tsc_adjust_msr 3334 3417 3570 rd_tsc_adjust_msr 3374 3404 3605 mmio-no-eventfd:pci-mem 19188 17866 16660 mmio-wildcard-eventfd:pci-mem 7319 7414 7595 mmio-datamatch-eventfd:pci-mem 7304 7470 7605 portio-no-eventfd:pci-io 13219 11780 10447 portio-wildcard-eventfd:pci-io 3951 4024 4149 portio-datamatch-eventfd:pci-io 3940 4026 4228 In the last column, all shared MSRs are equal (*) host and guest. The difference is very noisy on newer processors, but quite visible on the older processor. It is weird though that the light-weight exits become _more_ expensive as more MSRs are equal between guest and host. Anyhow, this is more of a curiosity since the proposed patch has no effect. Next will come Nehalem. Nehalem has both LOAD_EFER and EPT, so it's already a good target. I can test Westmere too, as soon as I find someone that has it, but it shouldn't give surprises. Paolo (*) run this: #! /usr/bin/env python class msr(object): def __init__(self): try: self.f = open('/dev/cpu/0/msr', 'r', 0) except: self.f = open('/dev/msr0', 'r', 0) def read(self, index, default = None): import struct self.f.seek(index) try: return struct.unpack('Q', self.f.read(8))[0] except: return default m = msr() for i in [0xc0000080, 0xc0000081, 0xc0000082, 0xc0000083, 0xc0000084]: print ("wrmsr(0x%x, 0x%x);" % (i, m.read(i))) and add the result to the enable_nx function. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Seeking a KVM benchmark 2014-11-10 10:45 ` Gleb Natapov 2014-11-10 12:15 ` Paolo Bonzini @ 2014-11-10 19:17 ` Andy Lutomirski 1 sibling, 0 replies; 30+ messages in thread From: Andy Lutomirski @ 2014-11-10 19:17 UTC (permalink / raw) To: Gleb Natapov; +Cc: Paolo Bonzini, kvm list On Mon, Nov 10, 2014 at 2:45 AM, Gleb Natapov <gleb@kernel.org> wrote: > On Mon, Nov 10, 2014 at 11:03:35AM +0100, Paolo Bonzini wrote: >> >> >> On 09/11/2014 17:36, Andy Lutomirski wrote: >> >> The purpose of vmexit test is to show us various overheads, so why not >> >> measure EFER switch overhead by having two tests one with equal EFER >> >> another with different EFER, instead of hiding it. >> > >> > I'll try this. We might need three tests, though: NX different, NX >> > same but SCE different, and all flags the same. >> >> The test actually explicitly enables NX in order to put itself in the >> "common case": >> >> commit 82d4ccb9daf67885a0316b1d763ce5ace57cff36 >> Author: Marcelo Tosatti <mtosatti@redhat.com> >> Date: Tue Jun 8 15:33:29 2010 -0300 >> >> test: vmexit: enable NX >> >> Enable NX to disable MSR autoload/save. This is the common case anyway. >> >> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> >> Signed-off-by: Avi Kivity <avi@redhat.com> >> >> (this commit is in qemu-kvm.git), so I guess forgetting to set SCE is >> just a bug. The results on my Xeon Sandy Bridge are very interesting: >> >> NX different ~11.5k (load/save EFER path) >> NX same, SCE different ~19.5k (urn path) >> all flags the same ~10.2k >> >> The inl_from_kernel results have absolutely no change, usually at most 5 >> cycles difference. This could be because I've added the SCE=1 variant >> directly to vmexit.c, so I'm running the tests one next to the other. >> >> I tried making also the other shared MSRs the same between guest and >> host (STAR, LSTAR, CSTAR, SYSCALL_MASK), so that the user return notifier >> has nothing to do. That saves about 4-500 cycles on inl_from_qemu. I >> do want to dig out my old Core 2 and see how the new test fares, but it >> really looks like your patch will be in 3.19. >> > Please test on wide variety of HW before final decision. Also it would > be nice to ask Intel what is expected overhead. It is awesome if they > mange to add EFER switching with non measurable overhead, but also hard > to believe :) Also Andy had an idea do disable switching in case host > and guest EFERs are the same but IIRC his patch does not include it yet. I'll send that patch as a followup in a sec. It doesn't seem to make a difference, which reinforces my hypothesis that microcode is fiddling with EFER on entry and exit anyway to handle LME and LMA anyway, so adjusting the other bits doesn't effect performance. --Andy > > -- > Gleb. -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2014-11-17 12:22 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-07 6:27 Seeking a KVM benchmark Andy Lutomirski 2014-11-07 7:17 ` Paolo Bonzini 2014-11-07 17:59 ` Andy Lutomirski 2014-11-07 18:11 ` Andy Lutomirski 2014-11-08 12:01 ` Gleb Natapov 2014-11-08 16:00 ` Andy Lutomirski 2014-11-08 16:44 ` Andy Lutomirski 2014-11-09 8:52 ` Gleb Natapov 2014-11-09 16:36 ` Andy Lutomirski 2014-11-10 10:03 ` Paolo Bonzini 2014-11-10 10:45 ` Gleb Natapov 2014-11-10 12:15 ` Paolo Bonzini 2014-11-10 14:23 ` Avi Kivity 2014-11-10 17:28 ` Paolo Bonzini 2014-11-10 17:38 ` Gleb Natapov 2014-11-12 11:33 ` Paolo Bonzini 2014-11-12 15:22 ` Gleb Natapov 2014-11-12 15:26 ` Paolo Bonzini 2014-11-12 15:32 ` Gleb Natapov 2014-11-12 15:51 ` Paolo Bonzini 2014-11-12 16:07 ` Andy Lutomirski 2014-11-12 17:56 ` Paolo Bonzini 2014-11-17 11:17 ` Wanpeng Li 2014-11-17 11:18 ` Paolo Bonzini 2014-11-17 12:00 ` Wanpeng Li 2014-11-17 12:04 ` Paolo Bonzini 2014-11-17 12:14 ` Wanpeng Li 2014-11-17 12:22 ` Paolo Bonzini 2014-11-11 11:07 ` Paolo Bonzini 2014-11-10 19:17 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox