Nested kvm_intel broken on pre 3.3 hosts

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Nested kvm_intel broken on pre 3.3 hosts
@ 2012-08-01 11:29 Stefan Bader
  2012-08-01 13:39 ` Gleb Natapov
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Bader @ 2012-08-01 11:29 UTC (permalink / raw)
  To: kvm; +Cc: Avi Kivity, Gleb Natapov, Andy Whitcroft

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

I have been looking at a report[1] about the kvm_intel module failing to load on
linux v3.3 and newer guests when running on a v3.2 host. Bisection turned up the
following patch:

commit fee84b079d5ddee2247b5c1f53162c330c622902
Author: Avi Kivity <avi@redhat.com>
Date:   Thu Nov 10 14:57:25 2011 +0200

    KVM: VMX: Intercept RDPMC

    Intercept RDPMC and forward it to the PMU emulation code.

    Signed-off-by: Avi Kivity <avi@redhat.com>
    Signed-off-by: Gleb Natapov <gleb@redhat.com>
    Signed-off-by: Avi Kivity <avi@redhat.com>

It looks like requiring the feature based on cpu fails when the host (outer kvm
module) code does not support it. So maybe that should be optional instead of
required?
Seems also like kvm_amd does not "suffer" from any test that could fail and
should be ok (though I did not test it personally).

-Stefan

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1031090

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 11:29 Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
@ 2012-08-01 13:39 ` Gleb Natapov
  2012-08-01 14:08   ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Gleb Natapov @ 2012-08-01 13:39 UTC (permalink / raw)
  To: Stefan Bader; +Cc: kvm, Avi Kivity, Andy Whitcroft

On Wed, Aug 01, 2012 at 01:29:11PM +0200, Stefan Bader wrote:
> I have been looking at a report[1] about the kvm_intel module failing to load on
> linux v3.3 and newer guests when running on a v3.2 host. Bisection turned up the
> following patch:
> 
> commit fee84b079d5ddee2247b5c1f53162c330c622902
> Author: Avi Kivity <avi@redhat.com>
> Date:   Thu Nov 10 14:57:25 2011 +0200
> 
>     KVM: VMX: Intercept RDPMC
> 
>     Intercept RDPMC and forward it to the PMU emulation code.
> 
>     Signed-off-by: Avi Kivity <avi@redhat.com>
>     Signed-off-by: Gleb Natapov <gleb@redhat.com>
>     Signed-off-by: Avi Kivity <avi@redhat.com>
> 
> It looks like requiring the feature based on cpu fails when the host (outer kvm
> module) code does not support it. So maybe that should be optional instead of
> required?
According to Intel SDM there was never CPU that didn't support RDPMC
exiting. Looks like unfortunate nested VMX bug.

> Seems also like kvm_amd does not "suffer" from any test that could fail and
> should be ok (though I did not test it personally).
> 
> -Stefan
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1031090
> 



--
			Gleb.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 13:39 ` Gleb Natapov
@ 2012-08-01 14:08   ` Avi Kivity
  2012-08-01 14:26     ` Stefan Bader
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2012-08-01 14:08 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Stefan Bader, kvm, Andy Whitcroft

On 08/01/2012 04:39 PM, Gleb Natapov wrote:
> On Wed, Aug 01, 2012 at 01:29:11PM +0200, Stefan Bader wrote:
>> I have been looking at a report[1] about the kvm_intel module failing to load on
>> linux v3.3 and newer guests when running on a v3.2 host. Bisection turned up the
>> following patch:
>> 
>> commit fee84b079d5ddee2247b5c1f53162c330c622902
>> Author: Avi Kivity <avi@redhat.com>
>> Date:   Thu Nov 10 14:57:25 2011 +0200
>> 
>>     KVM: VMX: Intercept RDPMC
>> 
>>     Intercept RDPMC and forward it to the PMU emulation code.
>> 
>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>     Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>> 
>> It looks like requiring the feature based on cpu fails when the host (outer kvm
>> module) code does not support it. So maybe that should be optional instead of
>> required?
> According to Intel SDM there was never CPU that didn't support RDPMC
> exiting. Looks like unfortunate nested VMX bug.

Moreover, that same commit fixes the bug in nested vmx.  So if you
update your host kernel to the same version as your L1 guest (or, at
your option, any later version) it should work.

We could backport that part of the patch, though as nested vmx is still
experimential, I don't think it's worth it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 14:08   ` Avi Kivity
@ 2012-08-01 14:26     ` Stefan Bader
  2012-08-01 14:29       ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Bader @ 2012-08-01 14:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm, Andy Whitcroft

[-- Attachment #1: Type: text/plain, Size: 1790 bytes --]

On 01.08.2012 16:08, Avi Kivity wrote:
> On 08/01/2012 04:39 PM, Gleb Natapov wrote:
>> On Wed, Aug 01, 2012 at 01:29:11PM +0200, Stefan Bader wrote:
>>> I have been looking at a report[1] about the kvm_intel module failing to load on
>>> linux v3.3 and newer guests when running on a v3.2 host. Bisection turned up the
>>> following patch:
>>>
>>> commit fee84b079d5ddee2247b5c1f53162c330c622902
>>> Author: Avi Kivity <avi@redhat.com>
>>> Date:   Thu Nov 10 14:57:25 2011 +0200
>>>
>>>     KVM: VMX: Intercept RDPMC
>>>
>>>     Intercept RDPMC and forward it to the PMU emulation code.
>>>
>>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>>     Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>>     Signed-off-by: Avi Kivity <avi@redhat.com>
>>>
>>> It looks like requiring the feature based on cpu fails when the host (outer kvm
>>> module) code does not support it. So maybe that should be optional instead of
>>> required?
>> According to Intel SDM there was never CPU that didn't support RDPMC
>> exiting. Looks like unfortunate nested VMX bug.
> 
> Moreover, that same commit fixes the bug in nested vmx.  So if you
> update your host kernel to the same version as your L1 guest (or, at
> your option, any later version) it should work.
> 
> We could backport that part of the patch, though as nested vmx is still
> experimential, I don't think it's worth it.
> 
Though this is probably what people will (or will have to) do. The host is not
always under your control. Even with it being experimental, it was working (at
least the module was loadable) before and is now broken on Intel hosts.
But ok, so the recommendation is to rather backport support to the host kernel
than to try handling this differently in the guest module, right?



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 14:26     ` Stefan Bader
@ 2012-08-01 14:29       ` Avi Kivity
  2012-08-01 15:07         ` Nadav Har'El
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2012-08-01 14:29 UTC (permalink / raw)
  To: Stefan Bader; +Cc: Gleb Natapov, kvm, Andy Whitcroft

On 08/01/2012 05:26 PM, Stefan Bader wrote:

>>> According to Intel SDM there was never CPU that didn't support RDPMC
>>> exiting. Looks like unfortunate nested VMX bug.
>> 
>> Moreover, that same commit fixes the bug in nested vmx.  So if you
>> update your host kernel to the same version as your L1 guest (or, at
>> your option, any later version) it should work.
>> 
>> We could backport that part of the patch, though as nested vmx is still
>> experimential, I don't think it's worth it.
>> 
> Though this is probably what people will (or will have to) do. The host is not
> always under your control. Even with it being experimental, it was working (at
> least the module was loadable) before and is now broken on Intel hosts.
> But ok, so the recommendation is to rather backport support to the host kernel
> than to try handling this differently in the guest module, right?

Right - it's not just kvm-as-a-guest that will trip on this.  But
there's no point in everyone backporting it on their own.  If you're
doing the backport, please post it here and we'll forward it to the
stable branch.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 14:29       ` Avi Kivity
@ 2012-08-01 15:07         ` Nadav Har'El
  2012-08-01 15:10           ` Gleb Natapov
  2012-08-01 15:11           ` Avi Kivity
  0 siblings, 2 replies; 23+ messages in thread
From: Nadav Har'El @ 2012-08-01 15:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stefan Bader, Gleb Natapov, kvm, Andy Whitcroft

On Wed, Aug 01, 2012, Avi Kivity wrote about "Re: Nested kvm_intel broken on pre 3.3 hosts":
> Right - it's not just kvm-as-a-guest that will trip on this.  But
> there's no point in everyone backporting it on their own.  If you're
> doing the backport, please post it here and we'll forward it to the
> stable branch.

If I understand correctly, the failure occurs because new versions of
KVM refuse to work if the processor doesn't support CPU_BASED_RDPMC_EXITING -
which older versions of nested VMX didn't say that they did.

But must the KVM guest refuse to work if this feature isn't supported?
I.e., why not move in setup_vmcs_config() the CPU_BASED_RDPMC_EXITING
from "min" to "opt"? Isn't losing the PMU feature a lesser evil than
not working at all?  In any case, perhaps the original reporter can use
this as a workaround, at least, because it requires modifying the (L1)
guest, not the host.

-- 
Nadav Har'El                        |        Wednesday, Aug 1 2012, 13 Av 5772
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |May you live as long as you want - and
http://nadav.harel.org.il           |never want as long as you live.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 15:07         ` Nadav Har'El
@ 2012-08-01 15:10           ` Gleb Natapov
  2012-08-01 15:11           ` Avi Kivity
  1 sibling, 0 replies; 23+ messages in thread
From: Gleb Natapov @ 2012-08-01 15:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, Stefan Bader, kvm, Andy Whitcroft

On Wed, Aug 01, 2012 at 06:07:07PM +0300, Nadav Har'El wrote:
> On Wed, Aug 01, 2012, Avi Kivity wrote about "Re: Nested kvm_intel broken on pre 3.3 hosts":
> > Right - it's not just kvm-as-a-guest that will trip on this.  But
> > there's no point in everyone backporting it on their own.  If you're
> > doing the backport, please post it here and we'll forward it to the
> > stable branch.
> 
> If I understand correctly, the failure occurs because new versions of
> KVM refuse to work if the processor doesn't support CPU_BASED_RDPMC_EXITING -
> which older versions of nested VMX didn't say that they did.
> 
> But must the KVM guest refuse to work if this feature isn't supported?
> I.e., why not move in setup_vmcs_config() the CPU_BASED_RDPMC_EXITING
> from "min" to "opt"? Isn't losing the PMU feature a lesser evil than
> not working at all?  In any case, perhaps the original reporter can use
> this as a workaround, at least, because it requires modifying the (L1)
> guest, not the host.
> 
> 
Since L2 or L1 have to be fixed anyway for setup to work it is better to
fix L1 to work according to a spec than fixing L2 to workaround L1 bug.

--
			Gleb.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 15:07         ` Nadav Har'El
  2012-08-01 15:10           ` Gleb Natapov
@ 2012-08-01 15:11           ` Avi Kivity
  2012-08-02 15:19             ` Stefan Bader
  1 sibling, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2012-08-01 15:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Stefan Bader, Gleb Natapov, kvm, Andy Whitcroft

On 08/01/2012 06:07 PM, Nadav Har'El wrote:
> On Wed, Aug 01, 2012, Avi Kivity wrote about "Re: Nested kvm_intel broken on pre 3.3 hosts":
>> Right - it's not just kvm-as-a-guest that will trip on this.  But
>> there's no point in everyone backporting it on their own.  If you're
>> doing the backport, please post it here and we'll forward it to the
>> stable branch.
> 
> If I understand correctly, the failure occurs because new versions of
> KVM refuse to work if the processor doesn't support CPU_BASED_RDPMC_EXITING -
> which older versions of nested VMX didn't say that they did.

Right.

> But must the KVM guest refuse to work if this feature isn't supported?
> I.e., why not move in setup_vmcs_config() the CPU_BASED_RDPMC_EXITING
> from "min" to "opt"? Isn't losing the PMU feature a lesser evil than
> not working at all?  In any case, perhaps the original reporter can use
> this as a workaround, at least, because it requires modifying the (L1)
> guest, not the host.

Real processors that don't support RDPMC exiting don't exist (and
logically cannot exist unless they also drop support for the RDPMC
instruction).  Given it's a clear host bug I'd rather fix it than making
the guest more complicated, even by a small amount.

The hypervisor needs to be updated on a regular schedule anyway, so
there's no risk of locking users out for more than a short while.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-01 15:11           ` Avi Kivity
@ 2012-08-02 15:19             ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 1/7] KVM: Move cpuid code to new file Stefan Bader
                                 ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

I started to pick #7 (#6 is in to have things in-sync between
SVM and VMX). Most other patches then were needed as dependencies.
The only difference here is #2 which I found being applied together
with #1 (which is a dependency). Since #2 is rather change to add
support than to fix a bug it was applied only to our 3.2 tree. But
maybe this would be the time to get it into the upstream 3.2 stable
as well. I would leave the decision to you, just that the testing I
have done, was basically with that addition. The test was just an
installation of a nested guest (so not necessarily testing RDPMC,
only the fact that with that applied to 3.2, a 3.5 is able to load
the kvm-intel module).

What I also did not do is to look much into the other direction,
like what patches may be important as that feature now would be
supported in 3.2. I think people on this list are likely in a much
better position to decide that.

So here the series I used to compile and test on top of 3.2:

0001-KVM-Move-cpuid-code-to-new-file.patch
0002-KVM-expose-latest-Intel-cpu-new-features-BMI1-BMI2-F.patch
0003-KVM-Expose-kvm_lapic_local_deliver.patch
0004-KVM-Expose-a-version-2-architectural-PMU-to-a-guests.patch
0005-KVM-Add-generic-RDPMC-support.patch
0006-KVM-SVM-Intercept-RDPMC.patch
0007-KVM-VMX-Intercept-RDPMC.patch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/7] KVM: Move cpuid code to new file
  2012-08-02 15:19             ` Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 2/7] KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest Stefan Bader
                                 ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Avi Kivity <avi@redhat.com>

commit 00b27a3efb116062ca5a276ad5cb01ea1b80b5f6 upstream

The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/960466
(cherry picked from commit 00b27a3efb116062ca5a276ad5cb01ea1b80b5f6)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 arch/x86/kvm/Makefile |    2 +-
 arch/x86/kvm/cpuid.c  |  625 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/cpuid.h  |   46 ++++
 arch/x86/kvm/lapic.c  |    1 +
 arch/x86/kvm/vmx.c    |    1 +
 arch/x86/kvm/x86.c    |  634 +------------------------------------------------
 arch/x86/kvm/x86.h    |    5 +-
 7 files changed, 679 insertions(+), 635 deletions(-)
 create mode 100644 arch/x86/kvm/cpuid.c
 create mode 100644 arch/x86/kvm/cpuid.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f43..161b76a 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-			   i8254.o timer.o
+			   i8254.o timer.o cpuid.o
 kvm-intel-y		+= vmx.o
 kvm-amd-y		+= svm.o
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
new file mode 100644
index 0000000..0a332ec
--- /dev/null
+++ b/arch/x86/kvm/cpuid.c
@@ -0,0 +1,625 @@
+/*
+ * Kernel-based Virtual Machine driver for Linux
+ * cpuid support routines
+ *
+ * derived from arch/x86/kvm/x86.c
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ * Copyright IBM Corporation, 2008
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/module.h>
+#include <asm/user.h>
+#include <asm/xsave.h>
+#include "cpuid.h"
+#include "lapic.h"
+#include "mmu.h"
+#include "trace.h"
+
+void kvm_update_cpuid(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	best = kvm_find_cpuid_entry(vcpu, 1, 0);
+	if (!best)
+		return;
+
+	/* Update OSXSAVE bit */
+	if (cpu_has_xsave && best->function == 0x1) {
+		best->ecx &= ~(bit(X86_FEATURE_OSXSAVE));
+		if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE))
+			best->ecx |= bit(X86_FEATURE_OSXSAVE);
+	}
+
+	if (apic) {
+		if (best->ecx & bit(X86_FEATURE_TSC_DEADLINE_TIMER))
+			apic->lapic_timer.timer_mode_mask = 3 << 17;
+		else
+			apic->lapic_timer.timer_mode_mask = 1 << 17;
+	}
+}
+
+static int is_efer_nx(void)
+{
+	unsigned long long efer = 0;
+
+	rdmsrl_safe(MSR_EFER, &efer);
+	return efer & EFER_NX;
+}
+
+static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu)
+{
+	int i;
+	struct kvm_cpuid_entry2 *e, *entry;
+
+	entry = NULL;
+	for (i = 0; i < vcpu->arch.cpuid_nent; ++i) {
+		e = &vcpu->arch.cpuid_entries[i];
+		if (e->function == 0x80000001) {
+			entry = e;
+			break;
+		}
+	}
+	if (entry && (entry->edx & (1 << 20)) && !is_efer_nx()) {
+		entry->edx &= ~(1 << 20);
+		printk(KERN_INFO "kvm: guest NX capability removed\n");
+	}
+}
+
+/* when an old userspace process fills a new kernel module */
+int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
+			     struct kvm_cpuid *cpuid,
+			     struct kvm_cpuid_entry __user *entries)
+{
+	int r, i;
+	struct kvm_cpuid_entry *cpuid_entries;
+
+	r = -E2BIG;
+	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
+		goto out;
+	r = -ENOMEM;
+	cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry) * cpuid->nent);
+	if (!cpuid_entries)
+		goto out;
+	r = -EFAULT;
+	if (copy_from_user(cpuid_entries, entries,
+			   cpuid->nent * sizeof(struct kvm_cpuid_entry)))
+		goto out_free;
+	for (i = 0; i < cpuid->nent; i++) {
+		vcpu->arch.cpuid_entries[i].function = cpuid_entries[i].function;
+		vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax;
+		vcpu->arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx;
+		vcpu->arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx;
+		vcpu->arch.cpuid_entries[i].edx = cpuid_entries[i].edx;
+		vcpu->arch.cpuid_entries[i].index = 0;
+		vcpu->arch.cpuid_entries[i].flags = 0;
+		vcpu->arch.cpuid_entries[i].padding[0] = 0;
+		vcpu->arch.cpuid_entries[i].padding[1] = 0;
+		vcpu->arch.cpuid_entries[i].padding[2] = 0;
+	}
+	vcpu->arch.cpuid_nent = cpuid->nent;
+	cpuid_fix_nx_cap(vcpu);
+	r = 0;
+	kvm_apic_set_version(vcpu);
+	kvm_x86_ops->cpuid_update(vcpu);
+	kvm_update_cpuid(vcpu);
+
+out_free:
+	vfree(cpuid_entries);
+out:
+	return r;
+}
+
+int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu,
+			      struct kvm_cpuid2 *cpuid,
+			      struct kvm_cpuid_entry2 __user *entries)
+{
+	int r;
+
+	r = -E2BIG;
+	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
+		goto out;
+	r = -EFAULT;
+	if (copy_from_user(&vcpu->arch.cpuid_entries, entries,
+			   cpuid->nent * sizeof(struct kvm_cpuid_entry2)))
+		goto out;
+	vcpu->arch.cpuid_nent = cpuid->nent;
+	kvm_apic_set_version(vcpu);
+	kvm_x86_ops->cpuid_update(vcpu);
+	kvm_update_cpuid(vcpu);
+	return 0;
+
+out:
+	return r;
+}
+
+int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
+			      struct kvm_cpuid2 *cpuid,
+			      struct kvm_cpuid_entry2 __user *entries)
+{
+	int r;
+
+	r = -E2BIG;
+	if (cpuid->nent < vcpu->arch.cpuid_nent)
+		goto out;
+	r = -EFAULT;
+	if (copy_to_user(entries, &vcpu->arch.cpuid_entries,
+			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
+		goto out;
+	return 0;
+
+out:
+	cpuid->nent = vcpu->arch.cpuid_nent;
+	return r;
+}
+
+static void cpuid_mask(u32 *word, int wordnum)
+{
+	*word &= boot_cpu_data.x86_capability[wordnum];
+}
+
+static void do_cpuid_1_ent(struct kvm_cpuid_entry2 *entry, u32 function,
+			   u32 index)
+{
+	entry->function = function;
+	entry->index = index;
+	cpuid_count(entry->function, entry->index,
+		    &entry->eax, &entry->ebx, &entry->ecx, &entry->edx);
+	entry->flags = 0;
+}
+
+static bool supported_xcr0_bit(unsigned bit)
+{
+	u64 mask = ((u64)1 << bit);
+
+	return mask & (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) & host_xcr0;
+}
+
+#define F(x) bit(X86_FEATURE_##x)
+
+static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
+			 u32 index, int *nent, int maxnent)
+{
+	unsigned f_nx = is_efer_nx() ? F(NX) : 0;
+#ifdef CONFIG_X86_64
+	unsigned f_gbpages = (kvm_x86_ops->get_lpage_level() == PT_PDPE_LEVEL)
+				? F(GBPAGES) : 0;
+	unsigned f_lm = F(LM);
+#else
+	unsigned f_gbpages = 0;
+	unsigned f_lm = 0;
+#endif
+	unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
+
+	/* cpuid 1.edx */
+	const u32 kvm_supported_word0_x86_features =
+		F(FPU) | F(VME) | F(DE) | F(PSE) |
+		F(TSC) | F(MSR) | F(PAE) | F(MCE) |
+		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
+		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
+		F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLSH) |
+		0 /* Reserved, DS, ACPI */ | F(MMX) |
+		F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
+		0 /* HTT, TM, Reserved, PBE */;
+	/* cpuid 0x80000001.edx */
+	const u32 kvm_supported_word1_x86_features =
+		F(FPU) | F(VME) | F(DE) | F(PSE) |
+		F(TSC) | F(MSR) | F(PAE) | F(MCE) |
+		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SYSCALL) |
+		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
+		F(PAT) | F(PSE36) | 0 /* Reserved */ |
+		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
+		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
+		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
+	/* cpuid 1.ecx */
+	const u32 kvm_supported_word4_x86_features =
+		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+		0 /* DS-CPL, VMX, SMX, EST */ |
+		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
+		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
+		0 /* Reserved, DCA */ | F(XMM4_1) |
+		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
+		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
+		F(F16C) | F(RDRAND);
+	/* cpuid 0x80000001.ecx */
+	const u32 kvm_supported_word6_x86_features =
+		F(LAHF_LM) | F(CMP_LEGACY) | 0 /*SVM*/ | 0 /* ExtApicSpace */ |
+		F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) |
+		F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(XOP) |
+		0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
+
+	/* cpuid 0xC0000001.edx */
+	const u32 kvm_supported_word5_x86_features =
+		F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
+		F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) |
+		F(PMM) | F(PMM_EN);
+
+	/* cpuid 7.0.ebx */
+	const u32 kvm_supported_word9_x86_features =
+		F(SMEP) | F(FSGSBASE) | F(ERMS);
+
+	/* all calls to cpuid_count() should be made on the same cpu */
+	get_cpu();
+	do_cpuid_1_ent(entry, function, index);
+	++*nent;
+
+	switch (function) {
+	case 0:
+		entry->eax = min(entry->eax, (u32)0xd);
+		break;
+	case 1:
+		entry->edx &= kvm_supported_word0_x86_features;
+		cpuid_mask(&entry->edx, 0);
+		entry->ecx &= kvm_supported_word4_x86_features;
+		cpuid_mask(&entry->ecx, 4);
+		/* we support x2apic emulation even if host does not support
+		 * it since we emulate x2apic in software */
+		entry->ecx |= F(X2APIC);
+		break;
+	/* function 2 entries are STATEFUL. That is, repeated cpuid commands
+	 * may return different values. This forces us to get_cpu() before
+	 * issuing the first command, and also to emulate this annoying behavior
+	 * in kvm_emulate_cpuid() using KVM_CPUID_FLAG_STATE_READ_NEXT */
+	case 2: {
+		int t, times = entry->eax & 0xff;
+
+		entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC;
+		entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
+		for (t = 1; t < times && *nent < maxnent; ++t) {
+			do_cpuid_1_ent(&entry[t], function, 0);
+			entry[t].flags |= KVM_CPUID_FLAG_STATEFUL_FUNC;
+			++*nent;
+		}
+		break;
+	}
+	/* function 4 has additional index. */
+	case 4: {
+		int i, cache_type;
+
+		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+		/* read more entries until cache_type is zero */
+		for (i = 1; *nent < maxnent; ++i) {
+			cache_type = entry[i - 1].eax & 0x1f;
+			if (!cache_type)
+				break;
+			do_cpuid_1_ent(&entry[i], function, i);
+			entry[i].flags |=
+			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+			++*nent;
+		}
+		break;
+	}
+	case 7: {
+		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+		/* Mask ebx against host capbability word 9 */
+		if (index == 0) {
+			entry->ebx &= kvm_supported_word9_x86_features;
+			cpuid_mask(&entry->ebx, 9);
+		} else
+			entry->ebx = 0;
+		entry->eax = 0;
+		entry->ecx = 0;
+		entry->edx = 0;
+		break;
+	}
+	case 9:
+		break;
+	/* function 0xb has additional index. */
+	case 0xb: {
+		int i, level_type;
+
+		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+		/* read more entries until level_type is zero */
+		for (i = 1; *nent < maxnent; ++i) {
+			level_type = entry[i - 1].ecx & 0xff00;
+			if (!level_type)
+				break;
+			do_cpuid_1_ent(&entry[i], function, i);
+			entry[i].flags |=
+			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+			++*nent;
+		}
+		break;
+	}
+	case 0xd: {
+		int idx, i;
+
+		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+		for (idx = 1, i = 1; *nent < maxnent && idx < 64; ++idx) {
+			do_cpuid_1_ent(&entry[i], function, idx);
+			if (entry[i].eax == 0 || !supported_xcr0_bit(idx))
+				continue;
+			entry[i].flags |=
+			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+			++*nent;
+			++i;
+		}
+		break;
+	}
+	case KVM_CPUID_SIGNATURE: {
+		char signature[12] = "KVMKVMKVM\0\0";
+		u32 *sigptr = (u32 *)signature;
+		entry->eax = 0;
+		entry->ebx = sigptr[0];
+		entry->ecx = sigptr[1];
+		entry->edx = sigptr[2];
+		break;
+	}
+	case KVM_CPUID_FEATURES:
+		entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
+			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
+			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
+			     (1 << KVM_FEATURE_ASYNC_PF) |
+			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+
+		if (sched_info_on())
+			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
+
+		entry->ebx = 0;
+		entry->ecx = 0;
+		entry->edx = 0;
+		break;
+	case 0x80000000:
+		entry->eax = min(entry->eax, 0x8000001a);
+		break;
+	case 0x80000001:
+		entry->edx &= kvm_supported_word1_x86_features;
+		cpuid_mask(&entry->edx, 1);
+		entry->ecx &= kvm_supported_word6_x86_features;
+		cpuid_mask(&entry->ecx, 6);
+		break;
+	case 0x80000008: {
+		unsigned g_phys_as = (entry->eax >> 16) & 0xff;
+		unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U);
+		unsigned phys_as = entry->eax & 0xff;
+
+		if (!g_phys_as)
+			g_phys_as = phys_as;
+		entry->eax = g_phys_as | (virt_as << 8);
+		entry->ebx = entry->edx = 0;
+		break;
+	}
+	case 0x80000019:
+		entry->ecx = entry->edx = 0;
+		break;
+	case 0x8000001a:
+		break;
+	case 0x8000001d:
+		break;
+	/*Add support for Centaur's CPUID instruction*/
+	case 0xC0000000:
+		/*Just support up to 0xC0000004 now*/
+		entry->eax = min(entry->eax, 0xC0000004);
+		break;
+	case 0xC0000001:
+		entry->edx &= kvm_supported_word5_x86_features;
+		cpuid_mask(&entry->edx, 5);
+		break;
+	case 3: /* Processor serial number */
+	case 5: /* MONITOR/MWAIT */
+	case 6: /* Thermal management */
+	case 0xA: /* Architectural Performance Monitoring */
+	case 0x80000007: /* Advanced power management */
+	case 0xC0000002:
+	case 0xC0000003:
+	case 0xC0000004:
+	default:
+		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+		break;
+	}
+
+	kvm_x86_ops->set_supported_cpuid(function, entry);
+
+	put_cpu();
+}
+
+#undef F
+
+int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
+				      struct kvm_cpuid_entry2 __user *entries)
+{
+	struct kvm_cpuid_entry2 *cpuid_entries;
+	int limit, nent = 0, r = -E2BIG;
+	u32 func;
+
+	if (cpuid->nent < 1)
+		goto out;
+	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
+		cpuid->nent = KVM_MAX_CPUID_ENTRIES;
+	r = -ENOMEM;
+	cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry2) * cpuid->nent);
+	if (!cpuid_entries)
+		goto out;
+
+	do_cpuid_ent(&cpuid_entries[0], 0, 0, &nent, cpuid->nent);
+	limit = cpuid_entries[0].eax;
+	for (func = 1; func <= limit && nent < cpuid->nent; ++func)
+		do_cpuid_ent(&cpuid_entries[nent], func, 0,
+			     &nent, cpuid->nent);
+	r = -E2BIG;
+	if (nent >= cpuid->nent)
+		goto out_free;
+
+	do_cpuid_ent(&cpuid_entries[nent], 0x80000000, 0, &nent, cpuid->nent);
+	limit = cpuid_entries[nent - 1].eax;
+	for (func = 0x80000001; func <= limit && nent < cpuid->nent; ++func)
+		do_cpuid_ent(&cpuid_entries[nent], func, 0,
+			     &nent, cpuid->nent);
+
+
+
+	r = -E2BIG;
+	if (nent >= cpuid->nent)
+		goto out_free;
+
+	/* Add support for Centaur's CPUID instruction. */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR) {
+		do_cpuid_ent(&cpuid_entries[nent], 0xC0000000, 0,
+				&nent, cpuid->nent);
+
+		r = -E2BIG;
+		if (nent >= cpuid->nent)
+			goto out_free;
+
+		limit = cpuid_entries[nent - 1].eax;
+		for (func = 0xC0000001;
+			func <= limit && nent < cpuid->nent; ++func)
+			do_cpuid_ent(&cpuid_entries[nent], func, 0,
+					&nent, cpuid->nent);
+
+		r = -E2BIG;
+		if (nent >= cpuid->nent)
+			goto out_free;
+	}
+
+	do_cpuid_ent(&cpuid_entries[nent], KVM_CPUID_SIGNATURE, 0, &nent,
+		     cpuid->nent);
+
+	r = -E2BIG;
+	if (nent >= cpuid->nent)
+		goto out_free;
+
+	do_cpuid_ent(&cpuid_entries[nent], KVM_CPUID_FEATURES, 0, &nent,
+		     cpuid->nent);
+
+	r = -E2BIG;
+	if (nent >= cpuid->nent)
+		goto out_free;
+
+	r = -EFAULT;
+	if (copy_to_user(entries, cpuid_entries,
+			 nent * sizeof(struct kvm_cpuid_entry2)))
+		goto out_free;
+	cpuid->nent = nent;
+	r = 0;
+
+out_free:
+	vfree(cpuid_entries);
+out:
+	return r;
+}
+
+static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i)
+{
+	struct kvm_cpuid_entry2 *e = &vcpu->arch.cpuid_entries[i];
+	int j, nent = vcpu->arch.cpuid_nent;
+
+	e->flags &= ~KVM_CPUID_FLAG_STATE_READ_NEXT;
+	/* when no next entry is found, the current entry[i] is reselected */
+	for (j = i + 1; ; j = (j + 1) % nent) {
+		struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
+		if (ej->function == e->function) {
+			ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
+			return j;
+		}
+	}
+	return 0; /* silence gcc, even though control never reaches here */
+}
+
+/* find an entry with matching function, matching index (if needed), and that
+ * should be read next (if it's stateful) */
+static int is_matching_cpuid_entry(struct kvm_cpuid_entry2 *e,
+	u32 function, u32 index)
+{
+	if (e->function != function)
+		return 0;
+	if ((e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX) && e->index != index)
+		return 0;
+	if ((e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) &&
+	    !(e->flags & KVM_CPUID_FLAG_STATE_READ_NEXT))
+		return 0;
+	return 1;
+}
+
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
+					      u32 function, u32 index)
+{
+	int i;
+	struct kvm_cpuid_entry2 *best = NULL;
+
+	for (i = 0; i < vcpu->arch.cpuid_nent; ++i) {
+		struct kvm_cpuid_entry2 *e;
+
+		e = &vcpu->arch.cpuid_entries[i];
+		if (is_matching_cpuid_entry(e, function, index)) {
+			if (e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC)
+				move_to_next_stateful_cpuid_entry(vcpu, i);
+			best = e;
+			break;
+		}
+	}
+	return best;
+}
+EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry);
+
+int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 0x80000000, 0);
+	if (!best || best->eax < 0x80000008)
+		goto not_found;
+	best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
+	if (best)
+		return best->eax & 0xff;
+not_found:
+	return 36;
+}
+
+/*
+ * If no match is found, check whether we exceed the vCPU's limit
+ * and return the content of the highest valid _standard_ leaf instead.
+ * This is to satisfy the CPUID specification.
+ */
+static struct kvm_cpuid_entry2* check_cpuid_limit(struct kvm_vcpu *vcpu,
+                                                  u32 function, u32 index)
+{
+	struct kvm_cpuid_entry2 *maxlevel;
+
+	maxlevel = kvm_find_cpuid_entry(vcpu, function & 0x80000000, 0);
+	if (!maxlevel || maxlevel->eax >= function)
+		return NULL;
+	if (function & 0x80000000) {
+		maxlevel = kvm_find_cpuid_entry(vcpu, 0, 0);
+		if (!maxlevel)
+			return NULL;
+	}
+	return kvm_find_cpuid_entry(vcpu, maxlevel->eax, index);
+}
+
+void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+	u32 function, index;
+	struct kvm_cpuid_entry2 *best;
+
+	function = kvm_register_read(vcpu, VCPU_REGS_RAX);
+	index = kvm_register_read(vcpu, VCPU_REGS_RCX);
+	kvm_register_write(vcpu, VCPU_REGS_RAX, 0);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, 0);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
+	best = kvm_find_cpuid_entry(vcpu, function, index);
+
+	if (!best)
+		best = check_cpuid_limit(vcpu, function, index);
+
+	if (best) {
+		kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
+		kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);
+		kvm_register_write(vcpu, VCPU_REGS_RCX, best->ecx);
+		kvm_register_write(vcpu, VCPU_REGS_RDX, best->edx);
+	}
+	kvm_x86_ops->skip_emulated_instruction(vcpu);
+	trace_kvm_cpuid(function,
+			kvm_register_read(vcpu, VCPU_REGS_RAX),
+			kvm_register_read(vcpu, VCPU_REGS_RBX),
+			kvm_register_read(vcpu, VCPU_REGS_RCX),
+			kvm_register_read(vcpu, VCPU_REGS_RDX));
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_cpuid);
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
new file mode 100644
index 0000000..5b97e17
--- /dev/null
+++ b/arch/x86/kvm/cpuid.h
@@ -0,0 +1,46 @@
+#ifndef ARCH_X86_KVM_CPUID_H
+#define ARCH_X86_KVM_CPUID_H
+
+#include "x86.h"
+
+void kvm_update_cpuid(struct kvm_vcpu *vcpu);
+struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
+					      u32 function, u32 index);
+int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
+				      struct kvm_cpuid_entry2 __user *entries);
+int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
+			     struct kvm_cpuid *cpuid,
+			     struct kvm_cpuid_entry __user *entries);
+int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu,
+			      struct kvm_cpuid2 *cpuid,
+			      struct kvm_cpuid_entry2 __user *entries);
+int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
+			      struct kvm_cpuid2 *cpuid,
+			      struct kvm_cpuid_entry2 __user *entries);
+
+
+static inline bool guest_cpuid_has_xsave(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 1, 0);
+	return best && (best->ecx & bit(X86_FEATURE_XSAVE));
+}
+
+static inline bool guest_cpuid_has_smep(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	return best && (best->ebx & bit(X86_FEATURE_SMEP));
+}
+
+static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
+}
+
+#endif
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 54abb40..a7f3e65 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -38,6 +38,7 @@
 #include "irq.h"
 #include "trace.h"
 #include "x86.h"
+#include "cpuid.h"
 
 #ifndef CONFIG_X86_64
 #define mod_64(x, y) ((x) - (y) * div64_u64(x, y))
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7315488..114fe29 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -18,6 +18,7 @@
 
 #include "irq.h"
 #include "mmu.h"
+#include "cpuid.h"
 
 #include <linux/kvm_host.h>
 #include <linux/module.h>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4fc5323..759af84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -26,6 +26,7 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "x86.h"
+#include "cpuid.h"
 
 #include <linux/clocksource.h>
 #include <linux/interrupt.h>
@@ -82,8 +83,6 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
 #define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu);
-static int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
-				    struct kvm_cpuid_entry2 __user *entries);
 static void process_nmi(struct kvm_vcpu *vcpu);
 
 struct kvm_x86_ops *kvm_x86_ops;
@@ -574,54 +573,6 @@ int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 }
 EXPORT_SYMBOL_GPL(kvm_set_xcr);
 
-static bool guest_cpuid_has_xsave(struct kvm_vcpu *vcpu)
-{
-	struct kvm_cpuid_entry2 *best;
-
-	best = kvm_find_cpuid_entry(vcpu, 1, 0);
-	return best && (best->ecx & bit(X86_FEATURE_XSAVE));
-}
-
-static bool guest_cpuid_has_smep(struct kvm_vcpu *vcpu)
-{
-	struct kvm_cpuid_entry2 *best;
-
-	best = kvm_find_cpuid_entry(vcpu, 7, 0);
-	return best && (best->ebx & bit(X86_FEATURE_SMEP));
-}
-
-static bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
-{
-	struct kvm_cpuid_entry2 *best;
-
-	best = kvm_find_cpuid_entry(vcpu, 7, 0);
-	return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
-}
-
-static void update_cpuid(struct kvm_vcpu *vcpu)
-{
-	struct kvm_cpuid_entry2 *best;
-	struct kvm_lapic *apic = vcpu->arch.apic;
-
-	best = kvm_find_cpuid_entry(vcpu, 1, 0);
-	if (!best)
-		return;
-
-	/* Update OSXSAVE bit */
-	if (cpu_has_xsave && best->function == 0x1) {
-		best->ecx &= ~(bit(X86_FEATURE_OSXSAVE));
-		if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE))
-			best->ecx |= bit(X86_FEATURE_OSXSAVE);
-	}
-
-	if (apic) {
-		if (best->ecx & bit(X86_FEATURE_TSC_DEADLINE_TIMER))
-			apic->lapic_timer.timer_mode_mask = 3 << 17;
-		else
-			apic->lapic_timer.timer_mode_mask = 1 << 17;
-	}
-}
-
 int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
@@ -655,7 +606,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 		kvm_mmu_reset_context(vcpu);
 
 	if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
-		update_cpuid(vcpu);
+		kvm_update_cpuid(vcpu);
 
 	return 0;
 }
@@ -2265,466 +2216,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 }
 
-static int is_efer_nx(void)
-{
-	unsigned long long efer = 0;
-
-	rdmsrl_safe(MSR_EFER, &efer);
-	return efer & EFER_NX;
-}
-
-static void cpuid_fix_nx_cap(struct kvm_vcpu *vcpu)
-{
-	int i;
-	struct kvm_cpuid_entry2 *e, *entry;
-
-	entry = NULL;
-	for (i = 0; i < vcpu->arch.cpuid_nent; ++i) {
-		e = &vcpu->arch.cpuid_entries[i];
-		if (e->function == 0x80000001) {
-			entry = e;
-			break;
-		}
-	}
-	if (entry && (entry->edx & (1 << 20)) && !is_efer_nx()) {
-		entry->edx &= ~(1 << 20);
-		printk(KERN_INFO "kvm: guest NX capability removed\n");
-	}
-}
-
-/* when an old userspace process fills a new kernel module */
-static int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
-				    struct kvm_cpuid *cpuid,
-				    struct kvm_cpuid_entry __user *entries)
-{
-	int r, i;
-	struct kvm_cpuid_entry *cpuid_entries;
-
-	r = -E2BIG;
-	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
-		goto out;
-	r = -ENOMEM;
-	cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry) * cpuid->nent);
-	if (!cpuid_entries)
-		goto out;
-	r = -EFAULT;
-	if (copy_from_user(cpuid_entries, entries,
-			   cpuid->nent * sizeof(struct kvm_cpuid_entry)))
-		goto out_free;
-	for (i = 0; i < cpuid->nent; i++) {
-		vcpu->arch.cpuid_entries[i].function = cpuid_entries[i].function;
-		vcpu->arch.cpuid_entries[i].eax = cpuid_entries[i].eax;
-		vcpu->arch.cpuid_entries[i].ebx = cpuid_entries[i].ebx;
-		vcpu->arch.cpuid_entries[i].ecx = cpuid_entries[i].ecx;
-		vcpu->arch.cpuid_entries[i].edx = cpuid_entries[i].edx;
-		vcpu->arch.cpuid_entries[i].index = 0;
-		vcpu->arch.cpuid_entries[i].flags = 0;
-		vcpu->arch.cpuid_entries[i].padding[0] = 0;
-		vcpu->arch.cpuid_entries[i].padding[1] = 0;
-		vcpu->arch.cpuid_entries[i].padding[2] = 0;
-	}
-	vcpu->arch.cpuid_nent = cpuid->nent;
-	cpuid_fix_nx_cap(vcpu);
-	r = 0;
-	kvm_apic_set_version(vcpu);
-	kvm_x86_ops->cpuid_update(vcpu);
-	update_cpuid(vcpu);
-
-out_free:
-	vfree(cpuid_entries);
-out:
-	return r;
-}
-
-static int kvm_vcpu_ioctl_set_cpuid2(struct kvm_vcpu *vcpu,
-				     struct kvm_cpuid2 *cpuid,
-				     struct kvm_cpuid_entry2 __user *entries)
-{
-	int r;
-
-	r = -E2BIG;
-	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
-		goto out;
-	r = -EFAULT;
-	if (copy_from_user(&vcpu->arch.cpuid_entries, entries,
-			   cpuid->nent * sizeof(struct kvm_cpuid_entry2)))
-		goto out;
-	vcpu->arch.cpuid_nent = cpuid->nent;
-	kvm_apic_set_version(vcpu);
-	kvm_x86_ops->cpuid_update(vcpu);
-	update_cpuid(vcpu);
-	return 0;
-
-out:
-	return r;
-}
-
-static int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
-				     struct kvm_cpuid2 *cpuid,
-				     struct kvm_cpuid_entry2 __user *entries)
-{
-	int r;
-
-	r = -E2BIG;
-	if (cpuid->nent < vcpu->arch.cpuid_nent)
-		goto out;
-	r = -EFAULT;
-	if (copy_to_user(entries, &vcpu->arch.cpuid_entries,
-			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
-		goto out;
-	return 0;
-
-out:
-	cpuid->nent = vcpu->arch.cpuid_nent;
-	return r;
-}
-
-static void cpuid_mask(u32 *word, int wordnum)
-{
-	*word &= boot_cpu_data.x86_capability[wordnum];
-}
-
-static void do_cpuid_1_ent(struct kvm_cpuid_entry2 *entry, u32 function,
-			   u32 index)
-{
-	entry->function = function;
-	entry->index = index;
-	cpuid_count(entry->function, entry->index,
-		    &entry->eax, &entry->ebx, &entry->ecx, &entry->edx);
-	entry->flags = 0;
-}
-
-static bool supported_xcr0_bit(unsigned bit)
-{
-	u64 mask = ((u64)1 << bit);
-
-	return mask & (XSTATE_FP | XSTATE_SSE | XSTATE_YMM) & host_xcr0;
-}
-
-#define F(x) bit(X86_FEATURE_##x)
-
-static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
-			 u32 index, int *nent, int maxnent)
-{
-	unsigned f_nx = is_efer_nx() ? F(NX) : 0;
-#ifdef CONFIG_X86_64
-	unsigned f_gbpages = (kvm_x86_ops->get_lpage_level() == PT_PDPE_LEVEL)
-				? F(GBPAGES) : 0;
-	unsigned f_lm = F(LM);
-#else
-	unsigned f_gbpages = 0;
-	unsigned f_lm = 0;
-#endif
-	unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
-
-	/* cpuid 1.edx */
-	const u32 kvm_supported_word0_x86_features =
-		F(FPU) | F(VME) | F(DE) | F(PSE) |
-		F(TSC) | F(MSR) | F(PAE) | F(MCE) |
-		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
-		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
-		F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLSH) |
-		0 /* Reserved, DS, ACPI */ | F(MMX) |
-		F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
-		0 /* HTT, TM, Reserved, PBE */;
-	/* cpuid 0x80000001.edx */
-	const u32 kvm_supported_word1_x86_features =
-		F(FPU) | F(VME) | F(DE) | F(PSE) |
-		F(TSC) | F(MSR) | F(PAE) | F(MCE) |
-		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SYSCALL) |
-		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
-		F(PAT) | F(PSE36) | 0 /* Reserved */ |
-		f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
-		F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
-		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
-	/* cpuid 1.ecx */
-	const u32 kvm_supported_word4_x86_features =
-		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
-		0 /* DS-CPL, VMX, SMX, EST */ |
-		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
-		0 /* Reserved, DCA */ | F(XMM4_1) |
-		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
-		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
-		F(F16C) | F(RDRAND);
-	/* cpuid 0x80000001.ecx */
-	const u32 kvm_supported_word6_x86_features =
-		F(LAHF_LM) | F(CMP_LEGACY) | 0 /*SVM*/ | 0 /* ExtApicSpace */ |
-		F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) |
-		F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(XOP) |
-		0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
-
-	/* cpuid 0xC0000001.edx */
-	const u32 kvm_supported_word5_x86_features =
-		F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
-		F(ACE2) | F(ACE2_EN) | F(PHE) | F(PHE_EN) |
-		F(PMM) | F(PMM_EN);
-
-	/* cpuid 7.0.ebx */
-	const u32 kvm_supported_word9_x86_features =
-		F(SMEP) | F(FSGSBASE) | F(ERMS);
-
-	/* all calls to cpuid_count() should be made on the same cpu */
-	get_cpu();
-	do_cpuid_1_ent(entry, function, index);
-	++*nent;
-
-	switch (function) {
-	case 0:
-		entry->eax = min(entry->eax, (u32)0xd);
-		break;
-	case 1:
-		entry->edx &= kvm_supported_word0_x86_features;
-		cpuid_mask(&entry->edx, 0);
-		entry->ecx &= kvm_supported_word4_x86_features;
-		cpuid_mask(&entry->ecx, 4);
-		/* we support x2apic emulation even if host does not support
-		 * it since we emulate x2apic in software */
-		entry->ecx |= F(X2APIC);
-		break;
-	/* function 2 entries are STATEFUL. That is, repeated cpuid commands
-	 * may return different values. This forces us to get_cpu() before
-	 * issuing the first command, and also to emulate this annoying behavior
-	 * in kvm_emulate_cpuid() using KVM_CPUID_FLAG_STATE_READ_NEXT */
-	case 2: {
-		int t, times = entry->eax & 0xff;
-
-		entry->flags |= KVM_CPUID_FLAG_STATEFUL_FUNC;
-		entry->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
-		for (t = 1; t < times && *nent < maxnent; ++t) {
-			do_cpuid_1_ent(&entry[t], function, 0);
-			entry[t].flags |= KVM_CPUID_FLAG_STATEFUL_FUNC;
-			++*nent;
-		}
-		break;
-	}
-	/* function 4 has additional index. */
-	case 4: {
-		int i, cache_type;
-
-		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-		/* read more entries until cache_type is zero */
-		for (i = 1; *nent < maxnent; ++i) {
-			cache_type = entry[i - 1].eax & 0x1f;
-			if (!cache_type)
-				break;
-			do_cpuid_1_ent(&entry[i], function, i);
-			entry[i].flags |=
-			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-			++*nent;
-		}
-		break;
-	}
-	case 7: {
-		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-		/* Mask ebx against host capbability word 9 */
-		if (index == 0) {
-			entry->ebx &= kvm_supported_word9_x86_features;
-			cpuid_mask(&entry->ebx, 9);
-		} else
-			entry->ebx = 0;
-		entry->eax = 0;
-		entry->ecx = 0;
-		entry->edx = 0;
-		break;
-	}
-	case 9:
-		break;
-	/* function 0xb has additional index. */
-	case 0xb: {
-		int i, level_type;
-
-		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-		/* read more entries until level_type is zero */
-		for (i = 1; *nent < maxnent; ++i) {
-			level_type = entry[i - 1].ecx & 0xff00;
-			if (!level_type)
-				break;
-			do_cpuid_1_ent(&entry[i], function, i);
-			entry[i].flags |=
-			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-			++*nent;
-		}
-		break;
-	}
-	case 0xd: {
-		int idx, i;
-
-		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-		for (idx = 1, i = 1; *nent < maxnent && idx < 64; ++idx) {
-			do_cpuid_1_ent(&entry[i], function, idx);
-			if (entry[i].eax == 0 || !supported_xcr0_bit(idx))
-				continue;
-			entry[i].flags |=
-			       KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-			++*nent;
-			++i;
-		}
-		break;
-	}
-	case KVM_CPUID_SIGNATURE: {
-		char signature[12] = "KVMKVMKVM\0\0";
-		u32 *sigptr = (u32 *)signature;
-		entry->eax = 0;
-		entry->ebx = sigptr[0];
-		entry->ecx = sigptr[1];
-		entry->edx = sigptr[2];
-		break;
-	}
-	case KVM_CPUID_FEATURES:
-		entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
-			     (1 << KVM_FEATURE_NOP_IO_DELAY) |
-			     (1 << KVM_FEATURE_CLOCKSOURCE2) |
-			     (1 << KVM_FEATURE_ASYNC_PF) |
-			     (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
-
-		if (sched_info_on())
-			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
-
-		entry->ebx = 0;
-		entry->ecx = 0;
-		entry->edx = 0;
-		break;
-	case 0x80000000:
-		entry->eax = min(entry->eax, 0x8000001a);
-		break;
-	case 0x80000001:
-		entry->edx &= kvm_supported_word1_x86_features;
-		cpuid_mask(&entry->edx, 1);
-		entry->ecx &= kvm_supported_word6_x86_features;
-		cpuid_mask(&entry->ecx, 6);
-		break;
-	case 0x80000008: {
-		unsigned g_phys_as = (entry->eax >> 16) & 0xff;
-		unsigned virt_as = max((entry->eax >> 8) & 0xff, 48U);
-		unsigned phys_as = entry->eax & 0xff;
-
-		if (!g_phys_as)
-			g_phys_as = phys_as;
-		entry->eax = g_phys_as | (virt_as << 8);
-		entry->ebx = entry->edx = 0;
-		break;
-	}
-	case 0x80000019:
-		entry->ecx = entry->edx = 0;
-		break;
-	case 0x8000001a:
-		break;
-	case 0x8000001d:
-		break;
-	/*Add support for Centaur's CPUID instruction*/
-	case 0xC0000000:
-		/*Just support up to 0xC0000004 now*/
-		entry->eax = min(entry->eax, 0xC0000004);
-		break;
-	case 0xC0000001:
-		entry->edx &= kvm_supported_word5_x86_features;
-		cpuid_mask(&entry->edx, 5);
-		break;
-	case 3: /* Processor serial number */
-	case 5: /* MONITOR/MWAIT */
-	case 6: /* Thermal management */
-	case 0xA: /* Architectural Performance Monitoring */
-	case 0x80000007: /* Advanced power management */
-	case 0xC0000002:
-	case 0xC0000003:
-	case 0xC0000004:
-	default:
-		entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
-		break;
-	}
-
-	kvm_x86_ops->set_supported_cpuid(function, entry);
-
-	put_cpu();
-}
-
-#undef F
-
-static int kvm_dev_ioctl_get_supported_cpuid(struct kvm_cpuid2 *cpuid,
-				     struct kvm_cpuid_entry2 __user *entries)
-{
-	struct kvm_cpuid_entry2 *cpuid_entries;
-	int limit, nent = 0, r = -E2BIG;
-	u32 func;
-
-	if (cpuid->nent < 1)
-		goto out;
-	if (cpuid->nent > KVM_MAX_CPUID_ENTRIES)
-		cpuid->nent = KVM_MAX_CPUID_ENTRIES;
-	r = -ENOMEM;
-	cpuid_entries = vmalloc(sizeof(struct kvm_cpuid_entry2) * cpuid->nent);
-	if (!cpuid_entries)
-		goto out;
-
-	do_cpuid_ent(&cpuid_entries[0], 0, 0, &nent, cpuid->nent);
-	limit = cpuid_entries[0].eax;
-	for (func = 1; func <= limit && nent < cpuid->nent; ++func)
-		do_cpuid_ent(&cpuid_entries[nent], func, 0,
-			     &nent, cpuid->nent);
-	r = -E2BIG;
-	if (nent >= cpuid->nent)
-		goto out_free;
-
-	do_cpuid_ent(&cpuid_entries[nent], 0x80000000, 0, &nent, cpuid->nent);
-	limit = cpuid_entries[nent - 1].eax;
-	for (func = 0x80000001; func <= limit && nent < cpuid->nent; ++func)
-		do_cpuid_ent(&cpuid_entries[nent], func, 0,
-			     &nent, cpuid->nent);
-
-
-
-	r = -E2BIG;
-	if (nent >= cpuid->nent)
-		goto out_free;
-
-	/* Add support for Centaur's CPUID instruction. */
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR) {
-		do_cpuid_ent(&cpuid_entries[nent], 0xC0000000, 0,
-				&nent, cpuid->nent);
-
-		r = -E2BIG;
-		if (nent >= cpuid->nent)
-			goto out_free;
-
-		limit = cpuid_entries[nent - 1].eax;
-		for (func = 0xC0000001;
-			func <= limit && nent < cpuid->nent; ++func)
-			do_cpuid_ent(&cpuid_entries[nent], func, 0,
-					&nent, cpuid->nent);
-
-		r = -E2BIG;
-		if (nent >= cpuid->nent)
-			goto out_free;
-	}
-
-	do_cpuid_ent(&cpuid_entries[nent], KVM_CPUID_SIGNATURE, 0, &nent,
-		     cpuid->nent);
-
-	r = -E2BIG;
-	if (nent >= cpuid->nent)
-		goto out_free;
-
-	do_cpuid_ent(&cpuid_entries[nent], KVM_CPUID_FEATURES, 0, &nent,
-		     cpuid->nent);
-
-	r = -E2BIG;
-	if (nent >= cpuid->nent)
-		goto out_free;
-
-	r = -EFAULT;
-	if (copy_to_user(entries, cpuid_entries,
-			 nent * sizeof(struct kvm_cpuid_entry2)))
-		goto out_free;
-	cpuid->nent = nent;
-	r = 0;
-
-out_free:
-	vfree(cpuid_entries);
-out:
-	return r;
-}
-
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
 				    struct kvm_lapic_state *s)
 {
@@ -5395,125 +4886,6 @@ int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 	return emulator_write_emulated(ctxt, rip, instruction, 3, NULL);
 }
 
-static int move_to_next_stateful_cpuid_entry(struct kvm_vcpu *vcpu, int i)
-{
-	struct kvm_cpuid_entry2 *e = &vcpu->arch.cpuid_entries[i];
-	int j, nent = vcpu->arch.cpuid_nent;
-
-	e->flags &= ~KVM_CPUID_FLAG_STATE_READ_NEXT;
-	/* when no next entry is found, the current entry[i] is reselected */
-	for (j = i + 1; ; j = (j + 1) % nent) {
-		struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
-		if (ej->function == e->function) {
-			ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
-			return j;
-		}
-	}
-	return 0; /* silence gcc, even though control never reaches here */
-}
-
-/* find an entry with matching function, matching index (if needed), and that
- * should be read next (if it's stateful) */
-static int is_matching_cpuid_entry(struct kvm_cpuid_entry2 *e,
-	u32 function, u32 index)
-{
-	if (e->function != function)
-		return 0;
-	if ((e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX) && e->index != index)
-		return 0;
-	if ((e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC) &&
-	    !(e->flags & KVM_CPUID_FLAG_STATE_READ_NEXT))
-		return 0;
-	return 1;
-}
-
-struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
-					      u32 function, u32 index)
-{
-	int i;
-	struct kvm_cpuid_entry2 *best = NULL;
-
-	for (i = 0; i < vcpu->arch.cpuid_nent; ++i) {
-		struct kvm_cpuid_entry2 *e;
-
-		e = &vcpu->arch.cpuid_entries[i];
-		if (is_matching_cpuid_entry(e, function, index)) {
-			if (e->flags & KVM_CPUID_FLAG_STATEFUL_FUNC)
-				move_to_next_stateful_cpuid_entry(vcpu, i);
-			best = e;
-			break;
-		}
-	}
-	return best;
-}
-EXPORT_SYMBOL_GPL(kvm_find_cpuid_entry);
-
-int cpuid_maxphyaddr(struct kvm_vcpu *vcpu)
-{
-	struct kvm_cpuid_entry2 *best;
-
-	best = kvm_find_cpuid_entry(vcpu, 0x80000000, 0);
-	if (!best || best->eax < 0x80000008)
-		goto not_found;
-	best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
-	if (best)
-		return best->eax & 0xff;
-not_found:
-	return 36;
-}
-
-/*
- * If no match is found, check whether we exceed the vCPU's limit
- * and return the content of the highest valid _standard_ leaf instead.
- * This is to satisfy the CPUID specification.
- */
-static struct kvm_cpuid_entry2* check_cpuid_limit(struct kvm_vcpu *vcpu,
-                                                  u32 function, u32 index)
-{
-	struct kvm_cpuid_entry2 *maxlevel;
-
-	maxlevel = kvm_find_cpuid_entry(vcpu, function & 0x80000000, 0);
-	if (!maxlevel || maxlevel->eax >= function)
-		return NULL;
-	if (function & 0x80000000) {
-		maxlevel = kvm_find_cpuid_entry(vcpu, 0, 0);
-		if (!maxlevel)
-			return NULL;
-	}
-	return kvm_find_cpuid_entry(vcpu, maxlevel->eax, index);
-}
-
-void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
-{
-	u32 function, index;
-	struct kvm_cpuid_entry2 *best;
-
-	function = kvm_register_read(vcpu, VCPU_REGS_RAX);
-	index = kvm_register_read(vcpu, VCPU_REGS_RCX);
-	kvm_register_write(vcpu, VCPU_REGS_RAX, 0);
-	kvm_register_write(vcpu, VCPU_REGS_RBX, 0);
-	kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
-	kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
-	best = kvm_find_cpuid_entry(vcpu, function, index);
-
-	if (!best)
-		best = check_cpuid_limit(vcpu, function, index);
-
-	if (best) {
-		kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
-		kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);
-		kvm_register_write(vcpu, VCPU_REGS_RCX, best->ecx);
-		kvm_register_write(vcpu, VCPU_REGS_RDX, best->edx);
-	}
-	kvm_x86_ops->skip_emulated_instruction(vcpu);
-	trace_kvm_cpuid(function,
-			kvm_register_read(vcpu, VCPU_REGS_RAX),
-			kvm_register_read(vcpu, VCPU_REGS_RBX),
-			kvm_register_read(vcpu, VCPU_REGS_RCX),
-			kvm_register_read(vcpu, VCPU_REGS_RDX));
-}
-EXPORT_SYMBOL_GPL(kvm_emulate_cpuid);
-
 /*
  * Check if userspace requested an interrupt window, and that the
  * interrupt window is open.
@@ -6174,7 +5546,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
 	kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
 	if (sregs->cr4 & X86_CR4_OSXSAVE)
-		update_cpuid(vcpu);
+		kvm_update_cpuid(vcpu);
 
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
 	if (!is_long_mode(vcpu) && is_pae(vcpu)) {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d36fe23..cb80c29 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -33,9 +33,6 @@ static inline bool kvm_exception_is_soft(unsigned int nr)
 	return (nr == BP_VECTOR) || (nr == OF_VECTOR);
 }
 
-struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
-                                             u32 function, u32 index);
-
 static inline bool is_protmode(struct kvm_vcpu *vcpu)
 {
 	return kvm_read_cr0_bits(vcpu, X86_CR0_PE);
@@ -125,4 +122,6 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 	gva_t addr, void *val, unsigned int bytes,
 	struct x86_exception *exception);
 
+extern u64 host_xcr0;
+
 #endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/7] KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest
  2012-08-02 15:19             ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 1/7] KVM: Move cpuid code to new file Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 3/7] KVM: Expose kvm_lapic_local_deliver() Stefan Bader
                                 ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: "Liu, Jinsong" <jinsong.liu@intel.com>

commit fb215366b3c7320ac25dca766a0152df16534932 upstream

Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945
The new feature cpuid listed as below:

1. FMA		CPUID.EAX=01H:ECX.FMA[bit 12]
2. MOVBE	CPUID.EAX=01H:ECX.MOVBE[bit 22]
3. BMI1		CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3]
4. AVX2		CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5]
5. BMI2		CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8]
6. LZCNT	CPUID.EAX=80000001H:ECX.LZCNT[bit 5]

This patch expose these features to guest.
Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has
already been exposed.

This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/960466
(cherry picked from commit fb215366b3c7320ac25dca766a0152df16534932)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 arch/x86/include/asm/cpufeature.h |    3 +++
 arch/x86/kvm/cpuid.c              |    4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 0c3b775..6306be4 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -197,7 +197,10 @@
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (ebx), word 9 */
 #define X86_FEATURE_FSGSBASE	(9*32+ 0) /* {RD/WR}{FS/GS}BASE instructions*/
+#define X86_FEATURE_BMI1	(9*32+ 3) /* 1st group bit manipulation extensions */
+#define X86_FEATURE_AVX2	(9*32+ 5) /* AVX2 instructions */
 #define X86_FEATURE_SMEP	(9*32+ 7) /* Supervisor Mode Execution Protection */
+#define X86_FEATURE_BMI2	(9*32+ 8) /* 2nd group bit manipulation extensions */
 #define X86_FEATURE_ERMS	(9*32+ 9) /* Enhanced REP MOVSB/STOSB */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0a332ec..47be763 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -222,7 +222,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
+		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
 		0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
 		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
@@ -242,7 +242,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.ebx */
 	const u32 kvm_supported_word9_x86_features =
-		F(SMEP) | F(FSGSBASE) | F(ERMS);
+		F(FSGSBASE) | F(BMI1) | F(AVX2) | F(SMEP) | F(BMI2) | F(ERMS);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/7] KVM: Expose kvm_lapic_local_deliver()
  2012-08-02 15:19             ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 1/7] KVM: Move cpuid code to new file Stefan Bader
  2012-08-02 15:19               ` [PATCH 2/7] KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 4/7] KVM: Expose a version 2 architectural PMU to a guests Stefan Bader
                                 ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Avi Kivity <avi@redhat.com>

commit 893420822192f717af6fde927c9e78c9b82f8327 upstream

Needed to deliver performance monitoring interrupts.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/1031090
(cherry-picked from commit 893420822192f717af6fde927c9e78c9b82f8327 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/lapic.c |    2 +-
 arch/x86/kvm/lapic.h |    1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a7f3e65..cfdc6e0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1121,7 +1121,7 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
 {
 	u32 reg = apic_get_reg(apic, lvt_type);
 	int vector, mode, trig_mode;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 138e8cc..6f4ce25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -34,6 +34,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
 int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
 
 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
 void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/7] KVM: Expose a version 2 architectural PMU to a guests
  2012-08-02 15:19             ` Stefan Bader
                                 ` (2 preceding siblings ...)
  2012-08-02 15:19               ` [PATCH 3/7] KVM: Expose kvm_lapic_local_deliver() Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 5/7] KVM: Add generic RDPMC support Stefan Bader
                                 ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Gleb Natapov <gleb@redhat.com>

commit f5132b01386b5a67f1ff673bb2b96a507a3f7e41 upstream

Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

[avi: adjust for cpuid.c]
[jan: fix anonymous field initialization for older gcc]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/1031090
(backported from commit f5132b01386b5a67f1ff673bb2b96a507a3f7e41 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/include/asm/kvm_host.h |   48 ++++
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/Makefile           |    2 +-
 arch/x86/kvm/cpuid.c            |    2 +
 arch/x86/kvm/pmu.c              |  533 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   22 +-
 include/linux/kvm_host.h        |    2 +
 7 files changed, 600 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/kvm/pmu.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4973f4..80a5f1e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -16,10 +16,12 @@
 #include <linux/mmu_notifier.h>
 #include <linux/tracepoint.h>
 #include <linux/cpumask.h>
+#include <linux/irq_work.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
+#include <linux/perf_event.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -294,6 +296,37 @@ struct kvm_mmu {
 	u64 pdptrs[4]; /* pae */
 };
 
+enum pmc_type {
+	KVM_PMC_GP = 0,
+	KVM_PMC_FIXED,
+};
+
+struct kvm_pmc {
+	enum pmc_type type;
+	u8 idx;
+	u64 counter;
+	u64 eventsel;
+	struct perf_event *perf_event;
+	struct kvm_vcpu *vcpu;
+};
+
+struct kvm_pmu {
+	unsigned nr_arch_gp_counters;
+	unsigned nr_arch_fixed_counters;
+	unsigned available_event_types;
+	u64 fixed_ctr_ctrl;
+	u64 global_ctrl;
+	u64 global_status;
+	u64 global_ovf_ctrl;
+	u64 counter_bitmask[2];
+	u64 global_ctrl_mask;
+	u8 version;
+	struct kvm_pmc gp_counters[X86_PMC_MAX_GENERIC];
+	struct kvm_pmc fixed_counters[X86_PMC_MAX_FIXED];
+	struct irq_work irq_work;
+	u64 reprogram_pmi;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -436,6 +469,8 @@ struct kvm_vcpu_arch {
 	unsigned access;
 	gfn_t mmio_gfn;
 
+	struct kvm_pmu pmu;
+
 	/* used for guest single stepping over the given code position */
 	unsigned long singlestep_rip;
 
@@ -894,4 +929,17 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 
+int kvm_is_in_guest(void);
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_reset(struct kvm_vcpu *vcpu);
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu);
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
+void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
+void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..c27dd11 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -35,6 +35,7 @@ config KVM
 	select KVM_MMIO
 	select TASKSTATS
 	select TASK_DELAY_ACCT
+	select PERF_EVENTS
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 161b76a..4f579e8 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-			   i8254.o timer.o cpuid.o
+			   i8254.o timer.o cpuid.o pmu.o
 kvm-intel-y		+= vmx.o
 kvm-amd-y		+= svm.o
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 47be763..3c471a0 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -43,6 +43,8 @@ void kvm_update_cpuid(struct kvm_vcpu *vcpu)
 		else
 			apic->lapic_timer.timer_mode_mask = 1 << 17;
 	}
+
+	kvm_pmu_cpuid_update(vcpu);
 }
 
 static int is_efer_nx(void)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
new file mode 100644
index 0000000..7aad544
--- /dev/null
+++ b/arch/x86/kvm/pmu.c
@@ -0,0 +1,533 @@
+/*
+ * Kernel-based Virtual Machine -- Performane Monitoring Unit support
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ *   Avi Kivity   <avi@redhat.com>
+ *   Gleb Natapov <gleb@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/perf_event.h>
+#include "x86.h"
+#include "cpuid.h"
+#include "lapic.h"
+
+static struct kvm_arch_event_perf_mapping {
+	u8 eventsel;
+	u8 unit_mask;
+	unsigned event_type;
+	bool inexact;
+} arch_events[] = {
+	/* Index must match CPUID 0x0A.EBX bit vector */
+	[0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
+	[1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
+	[2] = { 0x3c, 0x01, PERF_COUNT_HW_BUS_CYCLES  },
+	[3] = { 0x2e, 0x4f, PERF_COUNT_HW_CACHE_REFERENCES },
+	[4] = { 0x2e, 0x41, PERF_COUNT_HW_CACHE_MISSES },
+	[5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
+	[6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
+};
+
+/* mapping between fixed pmc index and arch_events array */
+int fixed_pmc_events[] = {1, 0, 2};
+
+static bool pmc_is_gp(struct kvm_pmc *pmc)
+{
+	return pmc->type == KVM_PMC_GP;
+}
+
+static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+
+	return pmu->counter_bitmask[pmc->type];
+}
+
+static inline bool pmc_enabled(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
+}
+
+static inline struct kvm_pmc *get_gp_pmc(struct kvm_pmu *pmu, u32 msr,
+					 u32 base)
+{
+	if (msr >= base && msr < base + pmu->nr_arch_gp_counters)
+		return &pmu->gp_counters[msr - base];
+	return NULL;
+}
+
+static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
+{
+	int base = MSR_CORE_PERF_FIXED_CTR0;
+	if (msr >= base && msr < base + pmu->nr_arch_fixed_counters)
+		return &pmu->fixed_counters[msr - base];
+	return NULL;
+}
+
+static inline struct kvm_pmc *get_fixed_pmc_idx(struct kvm_pmu *pmu, int idx)
+{
+	return get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + idx);
+}
+
+static struct kvm_pmc *global_idx_to_pmc(struct kvm_pmu *pmu, int idx)
+{
+	if (idx < X86_PMC_IDX_FIXED)
+		return get_gp_pmc(pmu, MSR_P6_EVNTSEL0 + idx, MSR_P6_EVNTSEL0);
+	else
+		return get_fixed_pmc_idx(pmu, idx - X86_PMC_IDX_FIXED);
+}
+
+void kvm_deliver_pmi(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.apic)
+		kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
+}
+
+static void trigger_pmi(struct irq_work *irq_work)
+{
+	struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu,
+			irq_work);
+	struct kvm_vcpu *vcpu = container_of(pmu, struct kvm_vcpu,
+			arch.pmu);
+
+	kvm_deliver_pmi(vcpu);
+}
+
+static void kvm_perf_overflow(struct perf_event *perf_event,
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs)
+{
+	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+}
+
+static void kvm_perf_overflow_intr(struct perf_event *perf_event,
+		struct perf_sample_data *data, struct pt_regs *regs)
+{
+	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
+		kvm_perf_overflow(perf_event, data, regs);
+		kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+		/*
+		 * Inject PMI. If vcpu was in a guest mode during NMI PMI
+		 * can be ejected on a guest mode re-entry. Otherwise we can't
+		 * be sure that vcpu wasn't executing hlt instruction at the
+		 * time of vmexit and is not going to re-enter guest mode until,
+		 * woken up. So we should wake it, but this is impossible from
+		 * NMI context. Do it from irq work instead.
+		 */
+		if (!kvm_is_in_guest())
+			irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
+		else
+			kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
+	}
+}
+
+static u64 read_pmc(struct kvm_pmc *pmc)
+{
+	u64 counter, enabled, running;
+
+	counter = pmc->counter;
+
+	if (pmc->perf_event)
+		counter += perf_event_read_value(pmc->perf_event,
+						 &enabled, &running);
+
+	/* FIXME: Scaling needed? */
+
+	return counter & pmc_bitmask(pmc);
+}
+
+static void stop_counter(struct kvm_pmc *pmc)
+{
+	if (pmc->perf_event) {
+		pmc->counter = read_pmc(pmc);
+		perf_event_release_kernel(pmc->perf_event);
+		pmc->perf_event = NULL;
+	}
+}
+
+static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
+		unsigned config, bool exclude_user, bool exclude_kernel,
+		bool intr)
+{
+	struct perf_event *event;
+	struct perf_event_attr attr = {
+		.type = type,
+		.size = sizeof(attr),
+		.pinned = true,
+		.exclude_idle = true,
+		.exclude_host = 1,
+		.exclude_user = exclude_user,
+		.exclude_kernel = exclude_kernel,
+		.config = config,
+	};
+
+	attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
+	event = perf_event_create_kernel_counter(&attr, -1, current,
+						 intr ? kvm_perf_overflow_intr :
+						 kvm_perf_overflow, pmc);
+	if (IS_ERR(event)) {
+		printk_once("kvm: pmu event creation failed %ld\n",
+				PTR_ERR(event));
+		return;
+	}
+
+	pmc->perf_event = event;
+	clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
+}
+
+static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
+		u8 unit_mask)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(arch_events); i++)
+		if (arch_events[i].eventsel == event_select
+				&& arch_events[i].unit_mask == unit_mask
+				&& (pmu->available_event_types & (1 << i)))
+			break;
+
+	if (i == ARRAY_SIZE(arch_events))
+		return PERF_COUNT_HW_MAX;
+
+	return arch_events[i].event_type;
+}
+
+static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
+{
+	unsigned config, type = PERF_TYPE_RAW;
+	u8 event_select, unit_mask;
+
+	pmc->eventsel = eventsel;
+
+	stop_counter(pmc);
+
+	if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_enabled(pmc))
+		return;
+
+	event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
+	unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
+
+	if (!(event_select & (ARCH_PERFMON_EVENTSEL_EDGE |
+				ARCH_PERFMON_EVENTSEL_INV |
+				ARCH_PERFMON_EVENTSEL_CMASK))) {
+		config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
+				unit_mask);
+		if (config != PERF_COUNT_HW_MAX)
+			type = PERF_TYPE_HARDWARE;
+	}
+
+	if (type == PERF_TYPE_RAW)
+		config = eventsel & X86_RAW_EVENT_MASK;
+
+	reprogram_counter(pmc, type, config,
+			!(eventsel & ARCH_PERFMON_EVENTSEL_USR),
+			!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
+			eventsel & ARCH_PERFMON_EVENTSEL_INT);
+}
+
+static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
+{
+	unsigned en = en_pmi & 0x3;
+	bool pmi = en_pmi & 0x8;
+
+	stop_counter(pmc);
+
+	if (!en || !pmc_enabled(pmc))
+		return;
+
+	reprogram_counter(pmc, PERF_TYPE_HARDWARE,
+			arch_events[fixed_pmc_events[idx]].event_type,
+			!(en & 0x2), /* exclude user */
+			!(en & 0x1), /* exclude kernel */
+			pmi);
+}
+
+static inline u8 fixed_en_pmi(u64 ctrl, int idx)
+{
+	return (ctrl >> (idx * 4)) & 0xf;
+}
+
+static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
+{
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		u8 en_pmi = fixed_en_pmi(data, i);
+		struct kvm_pmc *pmc = get_fixed_pmc_idx(pmu, i);
+
+		if (fixed_en_pmi(pmu->fixed_ctr_ctrl, i) == en_pmi)
+			continue;
+
+		reprogram_fixed_counter(pmc, en_pmi, i);
+	}
+
+	pmu->fixed_ctr_ctrl = data;
+}
+
+static void reprogram_idx(struct kvm_pmu *pmu, int idx)
+{
+	struct kvm_pmc *pmc = global_idx_to_pmc(pmu, idx);
+
+	if (!pmc)
+		return;
+
+	if (pmc_is_gp(pmc))
+		reprogram_gp_counter(pmc, pmc->eventsel);
+	else {
+		int fidx = idx - X86_PMC_IDX_FIXED;
+		reprogram_fixed_counter(pmc,
+				fixed_en_pmi(pmu->fixed_ctr_ctrl, fidx), fidx);
+	}
+}
+
+static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
+{
+	int bit;
+	u64 diff = pmu->global_ctrl ^ data;
+
+	pmu->global_ctrl = data;
+
+	for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX)
+		reprogram_idx(pmu, bit);
+}
+
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	int ret;
+
+	switch (msr) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		ret = pmu->version > 1;
+		break;
+	default:
+		ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
+			|| get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0)
+			|| get_fixed_pmc(pmu, msr);
+		break;
+	}
+	return ret;
+}
+
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_pmc *pmc;
+
+	switch (index) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+		*data = pmu->fixed_ctr_ctrl;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+		*data = pmu->global_status;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		*data = pmu->global_ctrl;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		*data = pmu->global_ovf_ctrl;
+		return 0;
+	default:
+		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
+				(pmc = get_fixed_pmc(pmu, index))) {
+			*data = read_pmc(pmc);
+			return 0;
+		} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+			*data = pmc->eventsel;
+			return 0;
+		}
+	}
+	return 1;
+}
+
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_pmc *pmc;
+
+	switch (index) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+		if (pmu->fixed_ctr_ctrl == data)
+			return 0;
+		if (!(data & 0xfffffffffffff444)) {
+			reprogram_fixed_counters(pmu, data);
+			return 0;
+		}
+		break;
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+		break; /* RO MSR */
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		if (pmu->global_ctrl == data)
+			return 0;
+		if (!(data & pmu->global_ctrl_mask)) {
+			global_ctrl_changed(pmu, data);
+			return 0;
+		}
+		break;
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		if (!(data & (pmu->global_ctrl_mask & ~(3ull<<62)))) {
+			pmu->global_status &= ~data;
+			pmu->global_ovf_ctrl = data;
+			return 0;
+		}
+		break;
+	default:
+		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
+				(pmc = get_fixed_pmc(pmu, index))) {
+			data = (s64)(s32)data;
+			pmc->counter += data - read_pmc(pmc);
+			return 0;
+		} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+			if (data == pmc->eventsel)
+				return 0;
+			if (!(data & 0xffffffff00200000ull)) {
+				reprogram_gp_counter(pmc, data);
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	bool fast_mode = pmc & (1u << 31);
+	bool fixed = pmc & (1u << 30);
+	struct kvm_pmc *counters;
+	u64 ctr;
+
+	pmc &= (3u << 30) - 1;
+	if (!fixed && pmc >= pmu->nr_arch_gp_counters)
+		return 1;
+	if (fixed && pmc >= pmu->nr_arch_fixed_counters)
+		return 1;
+	counters = fixed ? pmu->fixed_counters : pmu->gp_counters;
+	ctr = read_pmc(&counters[pmc]);
+	if (fast_mode)
+		ctr = (u32)ctr;
+	*data = ctr;
+
+	return 0;
+}
+
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_cpuid_entry2 *entry;
+	unsigned bitmap_len;
+
+	pmu->nr_arch_gp_counters = 0;
+	pmu->nr_arch_fixed_counters = 0;
+	pmu->counter_bitmask[KVM_PMC_GP] = 0;
+	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
+	pmu->version = 0;
+
+	entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
+	if (!entry)
+		return;
+
+	pmu->version = entry->eax & 0xff;
+	if (!pmu->version)
+		return;
+
+	pmu->nr_arch_gp_counters = min((int)(entry->eax >> 8) & 0xff,
+			X86_PMC_MAX_GENERIC);
+	pmu->counter_bitmask[KVM_PMC_GP] =
+		((u64)1 << ((entry->eax >> 16) & 0xff)) - 1;
+	bitmap_len = (entry->eax >> 24) & 0xff;
+	pmu->available_event_types = ~entry->ebx & ((1ull << bitmap_len) - 1);
+
+	if (pmu->version == 1) {
+		pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
+		return;
+	}
+
+	pmu->nr_arch_fixed_counters = min((int)(entry->edx & 0x1f),
+			X86_PMC_MAX_FIXED);
+	pmu->counter_bitmask[KVM_PMC_FIXED] =
+		((u64)1 << ((entry->edx >> 5) & 0xff)) - 1;
+	pmu->global_ctrl_mask = ~(((1 << pmu->nr_arch_gp_counters) - 1)
+			| (((1ull << pmu->nr_arch_fixed_counters) - 1)
+				<< X86_PMC_IDX_FIXED));
+}
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu)
+{
+	int i;
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+	memset(pmu, 0, sizeof(*pmu));
+	for (i = 0; i < X86_PMC_MAX_GENERIC; i++) {
+		pmu->gp_counters[i].type = KVM_PMC_GP;
+		pmu->gp_counters[i].vcpu = vcpu;
+		pmu->gp_counters[i].idx = i;
+	}
+	for (i = 0; i < X86_PMC_MAX_FIXED; i++) {
+		pmu->fixed_counters[i].type = KVM_PMC_FIXED;
+		pmu->fixed_counters[i].vcpu = vcpu;
+		pmu->fixed_counters[i].idx = i + X86_PMC_IDX_FIXED;
+	}
+	init_irq_work(&pmu->irq_work, trigger_pmi);
+	kvm_pmu_cpuid_update(vcpu);
+}
+
+void kvm_pmu_reset(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	int i;
+
+	irq_work_sync(&pmu->irq_work);
+	for (i = 0; i < X86_PMC_MAX_GENERIC; i++) {
+		struct kvm_pmc *pmc = &pmu->gp_counters[i];
+		stop_counter(pmc);
+		pmc->counter = pmc->eventsel = 0;
+	}
+
+	for (i = 0; i < X86_PMC_MAX_FIXED; i++)
+		stop_counter(&pmu->fixed_counters[i]);
+
+	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
+		pmu->global_ovf_ctrl = 0;
+}
+
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
+{
+	kvm_pmu_reset(vcpu);
+}
+
+void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	u64 bitmask;
+	int bit;
+
+	bitmask = pmu->reprogram_pmi;
+
+	for_each_set_bit(bit, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
+		struct kvm_pmc *pmc = global_idx_to_pmc(pmu, bit);
+
+		if (unlikely(!pmc || !pmc->perf_event)) {
+			clear_bit(bit, (unsigned long *)&pmu->reprogram_pmi);
+			continue;
+		}
+
+		reprogram_idx(pmu, bit);
+	}
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 759af84..29ece09 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1603,8 +1603,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	 * which we perfectly emulate ;-). Any other value should be at least
 	 * reported, some guests depend on them.
 	 */
-	case MSR_P6_EVNTSEL0:
-	case MSR_P6_EVNTSEL1:
 	case MSR_K7_EVNTSEL0:
 	case MSR_K7_EVNTSEL1:
 	case MSR_K7_EVNTSEL2:
@@ -1616,8 +1614,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	/* at least RHEL 4 unconditionally writes to the perfctr registers,
 	 * so we ignore writes to make it happy.
 	 */
-	case MSR_P6_PERFCTR0:
-	case MSR_P6_PERFCTR1:
 	case MSR_K7_PERFCTR0:
 	case MSR_K7_PERFCTR1:
 	case MSR_K7_PERFCTR2:
@@ -1654,6 +1650,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	default:
 		if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
 			return xen_hvm_config(vcpu, data);
+		if (kvm_pmu_msr(vcpu, msr))
+			return kvm_pmu_set_msr(vcpu, msr, data);
 		if (!ignore_msrs) {
 			pr_unimpl(vcpu, "unhandled wrmsr: 0x%x data %llx\n",
 				msr, data);
@@ -1816,10 +1814,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_K8_SYSCFG:
 	case MSR_K7_HWCR:
 	case MSR_VM_HSAVE_PA:
-	case MSR_P6_PERFCTR0:
-	case MSR_P6_PERFCTR1:
-	case MSR_P6_EVNTSEL0:
-	case MSR_P6_EVNTSEL1:
 	case MSR_K7_EVNTSEL0:
 	case MSR_K7_PERFCTR0:
 	case MSR_K8_INT_PENDING_MSG:
@@ -1930,6 +1924,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = 0xbe702111;
 		break;
 	default:
+		if (kvm_pmu_msr(vcpu, msr))
+			return kvm_pmu_get_msr(vcpu, msr, pdata);
 		if (!ignore_msrs) {
 			pr_unimpl(vcpu, "unhandled rdmsr: 0x%x\n", msr);
 			return 1;
@@ -4612,7 +4608,7 @@ static void kvm_timer_init(void)
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
-static int kvm_is_in_guest(void)
+int kvm_is_in_guest(void)
 {
 	return percpu_read(current_vcpu) != NULL;
 }
@@ -5085,6 +5081,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			record_steal_time(vcpu);
 		if (kvm_check_request(KVM_REQ_NMI, vcpu))
 			process_nmi(vcpu);
+		if (kvm_check_request(KVM_REQ_PMU, vcpu))
+			kvm_handle_pmu_event(vcpu);
+		if (kvm_check_request(KVM_REQ_PMI, vcpu))
+			kvm_deliver_pmi(vcpu);
 
 	}
 
@@ -5823,6 +5823,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	kvm_async_pf_hash_reset(vcpu);
 	vcpu->arch.apf.halted = false;
 
+	kvm_pmu_reset(vcpu);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -5916,6 +5918,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 		goto fail_free_mce_banks;
 
 	kvm_async_pf_hash_reset(vcpu);
+	kvm_pmu_init(vcpu);
 
 	return 0;
 fail_free_mce_banks:
@@ -5934,6 +5937,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
+	kvm_pmu_destroy(vcpu);
 	kfree(vcpu->arch.mce_banks);
 	kvm_free_lapic(vcpu);
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6136821..c22dc2f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -50,6 +50,8 @@
 #define KVM_REQ_APF_HALT          12
 #define KVM_REQ_STEAL_UPDATE      13
 #define KVM_REQ_NMI               14
+#define KVM_REQ_PMU               16
+#define KVM_REQ_PMI               17
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID	0
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/7] KVM: Add generic RDPMC support
  2012-08-02 15:19             ` Stefan Bader
                                 ` (3 preceding siblings ...)
  2012-08-02 15:19               ` [PATCH 4/7] KVM: Expose a version 2 architectural PMU to a guests Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 6/7] KVM: SVM: Intercept RDPMC Stefan Bader
                                 ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Avi Kivity <avi@redhat.com>

commit 022cd0e84020eec8b589bc119699c935c7b29584 upstream

Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/1031090
(cherry-picked from commit 022cd0e84020eec8b589bc119699c935c7b29584 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/x86.c              |   15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 80a5f1e..c819342 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -769,6 +769,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool kvm_rdpmc(struct kvm_vcpu *vcpu);
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29ece09..e524ee8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -760,6 +760,21 @@ int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
 }
 EXPORT_SYMBOL_GPL(kvm_get_dr);
 
+bool kvm_rdpmc(struct kvm_vcpu *vcpu)
+{
+	u32 ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
+	u64 data;
+	int err;
+
+	err = kvm_pmu_read_pmc(vcpu, ecx, &data);
+	if (err)
+		return err;
+	kvm_register_write(vcpu, VCPU_REGS_RAX, (u32)data);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, data >> 32);
+	return err;
+}
+EXPORT_SYMBOL_GPL(kvm_rdpmc);
+
 /*
  * List of msr numbers which we expose to userspace through KVM_GET_MSRS
  * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/7] KVM: SVM: Intercept RDPMC
  2012-08-02 15:19             ` Stefan Bader
                                 ` (4 preceding siblings ...)
  2012-08-02 15:19               ` [PATCH 5/7] KVM: Add generic RDPMC support Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:19               ` [PATCH 7/7] KVM: VMX: " Stefan Bader
  2012-08-02 15:26               ` Nested kvm_intel broken on pre 3.3 hosts Avi Kivity
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Avi Kivity <avi@redhat.com>

commit 332b56e4841ef62db4dbf1b4b92195575e1c7338 upstream

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/1031090
(cherry-picked from commit 332b56e4841ef62db4dbf1b4b92195575e1c7338 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/svm.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 94a4672..e385214 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1019,6 +1019,7 @@ static void init_vmcb(struct vcpu_svm *svm)
 	set_intercept(svm, INTERCEPT_NMI);
 	set_intercept(svm, INTERCEPT_SMI);
 	set_intercept(svm, INTERCEPT_SELECTIVE_CR0);
+	set_intercept(svm, INTERCEPT_RDPMC);
 	set_intercept(svm, INTERCEPT_CPUID);
 	set_intercept(svm, INTERCEPT_INVD);
 	set_intercept(svm, INTERCEPT_HLT);
@@ -2775,6 +2776,19 @@ static int emulate_on_interception(struct vcpu_svm *svm)
 	return emulate_instruction(&svm->vcpu, 0) == EMULATE_DONE;
 }
 
+static int rdpmc_interception(struct vcpu_svm *svm)
+{
+	int err;
+
+	if (!static_cpu_has(X86_FEATURE_NRIPS))
+		return emulate_on_interception(svm);
+
+	err = kvm_rdpmc(&svm->vcpu);
+	kvm_complete_insn_gp(&svm->vcpu, err);
+
+	return 1;
+}
+
 bool check_selective_cr0_intercepted(struct vcpu_svm *svm, unsigned long val)
 {
 	unsigned long cr0 = svm->vcpu.arch.cr0;
@@ -3195,6 +3209,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_SMI]				= nop_on_interception,
 	[SVM_EXIT_INIT]				= nop_on_interception,
 	[SVM_EXIT_VINTR]			= interrupt_window_interception,
+	[SVM_EXIT_RDPMC]			= rdpmc_interception,
 	[SVM_EXIT_CPUID]			= cpuid_interception,
 	[SVM_EXIT_IRET]                         = iret_interception,
 	[SVM_EXIT_INVD]                         = emulate_on_interception,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/7] KVM: VMX: Intercept RDPMC
  2012-08-02 15:19             ` Stefan Bader
                                 ` (5 preceding siblings ...)
  2012-08-02 15:19               ` [PATCH 6/7] KVM: SVM: Intercept RDPMC Stefan Bader
@ 2012-08-02 15:19               ` Stefan Bader
  2012-08-02 15:26               ` Nested kvm_intel broken on pre 3.3 hosts Avi Kivity
  7 siblings, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-02 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

From: Avi Kivity <avi@redhat.com>

commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

BugLink: http://bugs.launchpad.net/bugs/1031090
(cherry-picked from fee84b079d5ddee2247b5c1f53162c330c622902 upstream)
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/vmx.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 114fe29..bcd59a9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1957,6 +1957,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the
@@ -2415,7 +2416,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_USE_TSC_OFFSETING |
 	      CPU_BASED_MWAIT_EXITING |
 	      CPU_BASED_MONITOR_EXITING |
-	      CPU_BASED_INVLPG_EXITING;
+	      CPU_BASED_INVLPG_EXITING |
+	      CPU_BASED_RDPMC_EXITING;
 
 	if (yield_on_hlt)
 		min |= CPU_BASED_HLT_EXITING;
@@ -4614,6 +4616,16 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_rdpmc(struct kvm_vcpu *vcpu)
+{
+	int err;
+
+	err = kvm_rdpmc(vcpu);
+	kvm_complete_insn_gp(vcpu, err);
+
+	return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
@@ -5564,6 +5576,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
+	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-02 15:19             ` Stefan Bader
                                 ` (6 preceding siblings ...)
  2012-08-02 15:19               ` [PATCH 7/7] KVM: VMX: " Stefan Bader
@ 2012-08-02 15:26               ` Avi Kivity
  2012-08-03 10:55                 ` (unknown), Stefan Bader
  2012-08-03 10:57                 ` Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
  7 siblings, 2 replies; 23+ messages in thread
From: Avi Kivity @ 2012-08-02 15:26 UTC (permalink / raw)
  To: Stefan Bader; +Cc: nyh, Gleb Natapov, Andy Whitcroft, kvm

On 08/02/2012 06:19 PM, Stefan Bader wrote:
> I started to pick #7 (#6 is in to have things in-sync between
> SVM and VMX). Most other patches then were needed as dependencies.
> The only difference here is #2 which I found being applied together
> with #1 (which is a dependency). Since #2 is rather change to add
> support than to fix a bug it was applied only to our 3.2 tree. But
> maybe this would be the time to get it into the upstream 3.2 stable
> as well. I would leave the decision to you, just that the testing I
> have done, was basically with that addition. The test was just an
> installation of a nested guest (so not necessarily testing RDPMC,
> only the fact that with that applied to 3.2, a 3.5 is able to load
> the kvm-intel module).
> 
> What I also did not do is to look much into the other direction,
> like what patches may be important as that feature now would be
> supported in 3.2. I think people on this list are likely in a much
> better position to decide that.
> 
> So here the series I used to compile and test on top of 3.2:
> 
> 0001-KVM-Move-cpuid-code-to-new-file.patch
> 0002-KVM-expose-latest-Intel-cpu-new-features-BMI1-BMI2-F.patch
> 0003-KVM-Expose-kvm_lapic_local_deliver.patch
> 0004-KVM-Expose-a-version-2-architectural-PMU-to-a-guests.patch
> 0005-KVM-Add-generic-RDPMC-support.patch
> 0006-KVM-SVM-Intercept-RDPMC.patch
> 0007-KVM-VMX-Intercept-RDPMC.patch

No, you're backporting the entire feature.  All we need is to expose
RDPMC intercept to the guest.

It should be sufficient to backport the bits in
nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* (unknown), 
  2012-08-02 15:26               ` Nested kvm_intel broken on pre 3.3 hosts Avi Kivity
@ 2012-08-03 10:55                 ` Stefan Bader
  2012-08-03 10:57                 ` Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
  1 sibling, 0 replies; 23+ messages in thread
From: Stefan Bader @ 2012-08-03 10:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

Subject: Re: Nested kvm_intel broken on pre 3.3 hosts

> No, you're backporting the entire feature.  All we need is to expose
> RDPMC intercept to the guest.

Oh well, I thought that was the thing you asked for...

> It should be sufficient to backport the bits in
> nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().

Ok, how about that? It is probably wrong again, but at least it
allows to load the kvm-intel module from within a nested guest
and not having the feature pretend to fail seems the closest
thing to do...

---

>From 0aeb99348363b7aeb2b0bd92428cb212159fa468 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Thu, 10 Nov 2011 14:57:25 +0200
Subject: [PATCH] KVM: VMX: Fake intercept RDPMC

Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.

  Intercept RDPMC and forward it to the PMU emulation code.

But drop the requirement for the feature being present and instead
of forwarding, cause a GP as if the call had failed.

BugLink: http://bugs.launchpad.net/bugs/1031090
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/vmx.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7315488..fc937f2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the
@@ -4613,6 +4614,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_rdpmc(struct kvm_vcpu *vcpu)
+{
+	/* Instead of implementing the feature, cause a GP */
+	kvm_complete_insn_gp(vcpu, 1);
+
+	return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
@@ -5563,6 +5572,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
+	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-02 15:26               ` Nested kvm_intel broken on pre 3.3 hosts Avi Kivity
  2012-08-03 10:55                 ` (unknown), Stefan Bader
@ 2012-08-03 10:57                 ` Stefan Bader
  2012-08-05  9:18                   ` Avi Kivity
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Bader @ 2012-08-03 10:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Stefan Bader, Gleb Natapov, Andy Whitcroft, kvm

> No, you're backporting the entire feature.  All we need is to expose
> RDPMC intercept to the guest.

Oh well, I thought that was the thing you asked for...

> It should be sufficient to backport the bits in
> nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().

Ok, how about that? It is probably wrong again, but at least it
allows to load the kvm-intel module from within a nested guest
and not having the feature pretend to fail seems the closest
thing to do...

---

>From 0aeb99348363b7aeb2b0bd92428cb212159fa468 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Thu, 10 Nov 2011 14:57:25 +0200
Subject: [PATCH] KVM: VMX: Fake intercept RDPMC

Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.

  Intercept RDPMC and forward it to the PMU emulation code.

But drop the requirement for the feature being present and instead
of forwarding, cause a GP as if the call had failed.

BugLink: http://bugs.launchpad.net/bugs/1031090
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/vmx.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7315488..fc937f2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the
@@ -4613,6 +4614,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_rdpmc(struct kvm_vcpu *vcpu)
+{
+	/* Instead of implementing the feature, cause a GP */
+	kvm_complete_insn_gp(vcpu, 1);
+
+	return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
@@ -5563,6 +5572,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
+	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-03 10:57                 ` Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
@ 2012-08-05  9:18                   ` Avi Kivity
  2012-08-06 14:40                     ` Stefan Bader
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2012-08-05  9:18 UTC (permalink / raw)
  To: Stefan Bader; +Cc: nyh, Gleb Natapov, Andy Whitcroft, kvm

On 08/03/2012 01:57 PM, Stefan Bader wrote:
>> No, you're backporting the entire feature.  All we need is to expose
>> RDPMC intercept to the guest.
> 
> Oh well, I thought that was the thing you asked for...

Sorry for being unclear.

> 
>> It should be sufficient to backport the bits in
>> nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().
> 
> Ok, how about that? It is probably wrong again, but at least it
> allows to load the kvm-intel module from within a nested guest
> and not having the feature pretend to fail seems the closest
> thing to do...
> 
> ---
> 
> From 0aeb99348363b7aeb2b0bd92428cb212159fa468 Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.bader@canonical.com>
> Date: Thu, 10 Nov 2011 14:57:25 +0200
> Subject: [PATCH] KVM: VMX: Fake intercept RDPMC
> 
> Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.
> 
>   Intercept RDPMC and forward it to the PMU emulation code.
> 
> But drop the requirement for the feature being present and instead
> of forwarding, cause a GP as if the call had failed.
> 
> BugLink: http://bugs.launchpad.net/bugs/1031090
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  arch/x86/kvm/vmx.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7315488..fc937f2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>  #endif
>  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
>  		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
> +		CPU_BASED_RDPMC_EXITING |
>  		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
>  	/*
>  	 * We can allow some features even when not supported by the
> @@ -4613,6 +4614,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>  
> +static int handle_rdpmc(struct kvm_vcpu *vcpu)
> +{
> +	/* Instead of implementing the feature, cause a GP */
> +	kvm_complete_insn_gp(vcpu, 1);
> +
> +	return 1;
> +}

In fact this should never be called, since we never request RDPMC
exiting for L1.

> +
>  static int handle_wbinvd(struct kvm_vcpu *vcpu)
>  {
>  	skip_emulated_instruction(vcpu);
> @@ -5563,6 +5572,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[EXIT_REASON_HLT]                     = handle_halt,
>  	[EXIT_REASON_INVD]		      = handle_invd,
>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
> +	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
> 

Provided you backport the bit in nested_vmx_exit_handled().  That takes
the L2->L1 RDPMC exit and forwards it to L1.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-05  9:18                   ` Avi Kivity
@ 2012-08-06 14:40                     ` Stefan Bader
  2012-08-09  7:13                       ` Stefan Bader
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Bader @ 2012-08-06 14:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Gleb Natapov, Andy Whitcroft, kvm


[-- Attachment #1.1: Type: text/plain, Size: 3709 bytes --]

On 05.08.2012 11:18, Avi Kivity wrote:
> On 08/03/2012 01:57 PM, Stefan Bader wrote:
>>> No, you're backporting the entire feature.  All we need is to expose
>>> RDPMC intercept to the guest.
>>
>> Oh well, I thought that was the thing you asked for...
> 
> Sorry for being unclear.
> 
>>
>>> It should be sufficient to backport the bits in
>>> nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().
>>
>> Ok, how about that? It is probably wrong again, but at least it
>> allows to load the kvm-intel module from within a nested guest
>> and not having the feature pretend to fail seems the closest
>> thing to do...
>>
>> ---
>>
>> From 0aeb99348363b7aeb2b0bd92428cb212159fa468 Mon Sep 17 00:00:00 2001
>> From: Stefan Bader <stefan.bader@canonical.com>
>> Date: Thu, 10 Nov 2011 14:57:25 +0200
>> Subject: [PATCH] KVM: VMX: Fake intercept RDPMC
>>
>> Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.
>>
>>   Intercept RDPMC and forward it to the PMU emulation code.
>>
>> But drop the requirement for the feature being present and instead
>> of forwarding, cause a GP as if the call had failed.
>>
>> BugLink: http://bugs.launchpad.net/bugs/1031090
>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>> ---
>>  arch/x86/kvm/vmx.c |   10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 7315488..fc937f2 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>  #endif
>>  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
>>  		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
>> +		CPU_BASED_RDPMC_EXITING |
>>  		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
>>  	/*
>>  	 * We can allow some features even when not supported by the
>> @@ -4613,6 +4614,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
>>  	return 1;
>>  }
>>  
>> +static int handle_rdpmc(struct kvm_vcpu *vcpu)
>> +{
>> +	/* Instead of implementing the feature, cause a GP */
>> +	kvm_complete_insn_gp(vcpu, 1);
>> +
>> +	return 1;
>> +}
> 
> In fact this should never be called, since we never request RDPMC
> exiting for L1.
> 
>> +
>>  static int handle_wbinvd(struct kvm_vcpu *vcpu)
>>  {
>>  	skip_emulated_instruction(vcpu);
>> @@ -5563,6 +5572,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>  	[EXIT_REASON_HLT]                     = handle_halt,
>>  	[EXIT_REASON_INVD]		      = handle_invd,
>>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>> +	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
>>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>>  	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
>>
> 
> Provided you backport the bit in nested_vmx_exit_handled().  That takes
> the L2->L1 RDPMC exit and forwards it to L1.
> 
The nested_vmx_exit_handled handled function is already there (in 3.1).

commit 644d711aa0e16111d8aba6d289caebec013e26ea
Author: Nadav Har'El <nyh@il.ibm.com>
Date:   Wed May 25 23:12:35 2011 +0300

    KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit

Though I checked what happens if I drop anything beside the change in
nested_vmx_setup_ctls_msrs. If running "perf test" (which does some
RDPMC test) is sufficient to prove things are ok, then it seems to be
good enough. That test fails already on the perf_event_open syscall
(returning -ENOENT).

So this remaining (attached patch) change would be enough to allow
the kvm_intel module get loaded from newer kernels on an 3.2 host.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-KVM-VMX-Set-CPU_BASED_RDPMC_EXITING-for-nested.patch --]
[-- Type: text/x-diff; name="0001-KVM-VMX-Set-CPU_BASED_RDPMC_EXITING-for-nested.patch", Size: 1220 bytes --]

From b79a5f03b4d9a1a56949d6ef38fd4879ff1b8aee Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Thu, 10 Nov 2011 14:57:25 +0200
Subject: [PATCH] KVM: VMX: Set CPU_BASED_RDPMC_EXITING for nested

Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.

  Intercept RDPMC and forward it to the PMU emulation code.

Newer vmx support will only allow to load the kvm_intel module
if RDPMC_EXITING is supported. Even without the actual support
this part of the change is required on 3.2 hosts.

BugLink: http://bugs.launchpad.net/bugs/1031090
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/vmx.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 114fe29..94e6749 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1957,6 +1957,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the
-- 
1.7.9.5


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-06 14:40                     ` Stefan Bader
@ 2012-08-09  7:13                       ` Stefan Bader
  2012-08-09  9:34                         ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Bader @ 2012-08-09  7:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: nyh, Gleb Natapov, Andy Whitcroft, kvm

[-- Attachment #1: Type: text/plain, Size: 5247 bytes --]

On 06.08.2012 16:40, Stefan Bader wrote:
> On 05.08.2012 11:18, Avi Kivity wrote:
>> On 08/03/2012 01:57 PM, Stefan Bader wrote:
>>>> No, you're backporting the entire feature.  All we need is to expose
>>>> RDPMC intercept to the guest.
>>>
>>> Oh well, I thought that was the thing you asked for...
>>
>> Sorry for being unclear.
>>
>>>
>>>> It should be sufficient to backport the bits in
>>>> nested_vmx_setup_ctls_msrs() and nested_vmx_exit_handled().
>>>
>>> Ok, how about that? It is probably wrong again, but at least it
>>> allows to load the kvm-intel module from within a nested guest
>>> and not having the feature pretend to fail seems the closest
>>> thing to do...
>>>
>>> ---
>>>
>>> From 0aeb99348363b7aeb2b0bd92428cb212159fa468 Mon Sep 17 00:00:00 2001
>>> From: Stefan Bader <stefan.bader@canonical.com>
>>> Date: Thu, 10 Nov 2011 14:57:25 +0200
>>> Subject: [PATCH] KVM: VMX: Fake intercept RDPMC
>>>
>>> Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.
>>>
>>>   Intercept RDPMC and forward it to the PMU emulation code.
>>>
>>> But drop the requirement for the feature being present and instead
>>> of forwarding, cause a GP as if the call had failed.
>>>
>>> BugLink: http://bugs.launchpad.net/bugs/1031090
>>> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
>>> ---
>>>  arch/x86/kvm/vmx.c |   10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 7315488..fc937f2 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
>>>  #endif
>>>  		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
>>>  		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
>>> +		CPU_BASED_RDPMC_EXITING |
>>>  		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
>>>  	/*
>>>  	 * We can allow some features even when not supported by the
>>> @@ -4613,6 +4614,14 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
>>>  	return 1;
>>>  }
>>>  
>>> +static int handle_rdpmc(struct kvm_vcpu *vcpu)
>>> +{
>>> +	/* Instead of implementing the feature, cause a GP */
>>> +	kvm_complete_insn_gp(vcpu, 1);
>>> +
>>> +	return 1;
>>> +}
>>
>> In fact this should never be called, since we never request RDPMC
>> exiting for L1.
>>
>>> +
>>>  static int handle_wbinvd(struct kvm_vcpu *vcpu)
>>>  {
>>>  	skip_emulated_instruction(vcpu);
>>> @@ -5563,6 +5572,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>>  	[EXIT_REASON_HLT]                     = handle_halt,
>>>  	[EXIT_REASON_INVD]		      = handle_invd,
>>>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>>> +	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
>>>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>>>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>>>  	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
>>>
>>
>> Provided you backport the bit in nested_vmx_exit_handled().  That takes
>> the L2->L1 RDPMC exit and forwards it to L1.
>>
> The nested_vmx_exit_handled handled function is already there (in 3.1).
> 
> commit 644d711aa0e16111d8aba6d289caebec013e26ea
> Author: Nadav Har'El <nyh@il.ibm.com>
> Date:   Wed May 25 23:12:35 2011 +0300
> 
>     KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit
> 
> Though I checked what happens if I drop anything beside the change in
> nested_vmx_setup_ctls_msrs. If running "perf test" (which does some
> RDPMC test) is sufficient to prove things are ok, then it seems to be
> good enough. That test fails already on the perf_event_open syscall
> (returning -ENOENT).
> 
> So this remaining (attached patch) change would be enough to allow
> the kvm_intel module get loaded from newer kernels on an 3.2 host.
> 
Avi, was the last version of the patch (only adding the flag to the nested MSRs)
good for submitting to stable from your point of view?

(inlining though this will be broken)
From b79a5f03b4d9a1a56949d6ef38fd4879ff1b8aee Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@canonical.com>
Date: Thu, 10 Nov 2011 14:57:25 +0200
Subject: [PATCH] KVM: VMX: Set CPU_BASED_RDPMC_EXITING for nested

Based on commit fee84b079d5ddee2247b5c1f53162c330c622902 upstream.

  Intercept RDPMC and forward it to the PMU emulation code.

Newer vmx support will only allow to load the kvm_intel module
if RDPMC_EXITING is supported. Even without the actual support
this part of the change is required on 3.2 hosts.

BugLink: http://bugs.launchpad.net/bugs/1031090
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
---
 arch/x86/kvm/vmx.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 114fe29..94e6749 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1957,6 +1957,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Nested kvm_intel broken on pre 3.3 hosts
  2012-08-09  7:13                       ` Stefan Bader
@ 2012-08-09  9:34                         ` Avi Kivity
  0 siblings, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2012-08-09  9:34 UTC (permalink / raw)
  To: Stefan Bader; +Cc: nyh, Gleb Natapov, Andy Whitcroft, kvm

On 08/09/2012 10:13 AM, Stefan Bader wrote:

> Avi, was the last version of the patch (only adding the flag to the nested MSRs)
> good for submitting to stable from your point of view?
> 

Yes, it is correct.  I forwarded it to stable, thanks.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-08-09  9:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-01 11:29 Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
2012-08-01 13:39 ` Gleb Natapov
2012-08-01 14:08   ` Avi Kivity
2012-08-01 14:26     ` Stefan Bader
2012-08-01 14:29       ` Avi Kivity
2012-08-01 15:07         ` Nadav Har'El
2012-08-01 15:10           ` Gleb Natapov
2012-08-01 15:11           ` Avi Kivity
2012-08-02 15:19             ` Stefan Bader
2012-08-02 15:19               ` [PATCH 1/7] KVM: Move cpuid code to new file Stefan Bader
2012-08-02 15:19               ` [PATCH 2/7] KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest Stefan Bader
2012-08-02 15:19               ` [PATCH 3/7] KVM: Expose kvm_lapic_local_deliver() Stefan Bader
2012-08-02 15:19               ` [PATCH 4/7] KVM: Expose a version 2 architectural PMU to a guests Stefan Bader
2012-08-02 15:19               ` [PATCH 5/7] KVM: Add generic RDPMC support Stefan Bader
2012-08-02 15:19               ` [PATCH 6/7] KVM: SVM: Intercept RDPMC Stefan Bader
2012-08-02 15:19               ` [PATCH 7/7] KVM: VMX: " Stefan Bader
2012-08-02 15:26               ` Nested kvm_intel broken on pre 3.3 hosts Avi Kivity
2012-08-03 10:55                 ` (unknown), Stefan Bader
2012-08-03 10:57                 ` Nested kvm_intel broken on pre 3.3 hosts Stefan Bader
2012-08-05  9:18                   ` Avi Kivity
2012-08-06 14:40                     ` Stefan Bader
2012-08-09  7:13                       ` Stefan Bader
2012-08-09  9:34                         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).