* [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 11:22 ` Orit Wasserman
2013-07-31 14:48 ` [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
` (12 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.
To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).
Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.
Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e999dc7..27efa6a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
#else
nested_vmx_exit_ctls_high = 0;
#endif
- nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+ nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+ VM_EXIT_LOAD_IA32_EFER);
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high &=
VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
- nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+ nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+ VM_ENTRY_LOAD_IA32_EFER);
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
- /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
- vmcs_write32(VM_EXIT_CONTROLS,
- vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
- vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+ /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
+ * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
+ * bits are further modified by vmx_set_efer() below.
+ */
+ vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+ /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+ * emulated by vmx_set_efer(), below.
+ */
+ vmcs_write32(VM_ENTRY_CONTROLS,
+ (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
+ ~VM_ENTRY_IA32E_MODE) |
(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
2013-07-31 14:48 ` [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-08-01 11:22 ` Orit Wasserman
0 siblings, 0 replies; 42+ messages in thread
From: Orit Wasserman @ 2013-08-01 11:22 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
>
> Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
> switch the EFER MSR when EPT is used and the host and guest have different
> NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
> and want to be able to run recent KVM as L1, we need to allow L1 to use this
> EFER switching feature.
>
> To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
> and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
> support for the former (the latter is still unsupported).
>
> Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
> respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
> that's left to do in this patch is to properly advertise this feature to L1.
>
> Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
> vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
> support this feature, regardless of whether the host supports it.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
> arch/x86/kvm/vmx.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e999dc7..27efa6a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2198,7 +2198,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> #else
> nested_vmx_exit_ctls_high = 0;
> #endif
> - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
> + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> + VM_EXIT_LOAD_IA32_EFER);
>
> /* entry controls */
> rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2207,8 +2208,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
> nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> nested_vmx_entry_ctls_high &=
> VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
> - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
> -
> + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
> + VM_ENTRY_LOAD_IA32_EFER);
> /* cpu-based controls */
> rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
> nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> @@ -7529,10 +7530,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
>
> - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
> - vmcs_write32(VM_EXIT_CONTROLS,
> - vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> + /* L2->L1 exit controls are emulated - the hardware exit is to L0 so
> + * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
> + * bits are further modified by vmx_set_efer() below.
> + */
> + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
> +
> + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
> + * emulated by vmx_set_efer(), below.
> + */
> + vmcs_write32(VM_ENTRY_CONTROLS,
> + (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_EFER &
> + ~VM_ENTRY_IA32E_MODE) |
> (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
>
> if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 11:28 ` Orit Wasserman
2013-07-31 14:48 ` [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
` (11 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:
If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.
If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/vmx.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 27efa6a..4e98764 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7595,6 +7595,16 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
+ /*
+ * L1 may access the L2's PDPTR, so save them to construct vmcs12
+ */
+ if (enable_ept) {
+ vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+ vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+ vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+ vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+ }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
}
@@ -7917,6 +7927,22 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+ /*
+ * In some cases (usually, nested EPT), L2 is allowed to change its
+ * own CR3 without exiting. If it has changed it, we must keep it.
+ * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+ * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+ *
+ * Additionally, restore L2's PDPTR to vmcs12.
+ */
+ if (enable_ept) {
+ vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+ vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+ vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+ vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+ vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+ }
+
vmcs12->vm_entry_controls =
(vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
(vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry
2013-07-31 14:48 ` [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
@ 2013-08-01 11:28 ` Orit Wasserman
0 siblings, 0 replies; 42+ messages in thread
From: Orit Wasserman @ 2013-08-01 11:28 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
>
> The existing code for handling cr3 and related VMCS fields during nested
> exit and entry wasn't correct in all cases:
>
> If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
> during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
> we forgot to do so. This patch adds this copy.
>
> If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
> whoever does control cr3 (L1 or L2) is using PAE, the processor might have
> saved PDPTEs and we should also save them in vmcs12 (and restore later).
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
> arch/x86/kvm/vmx.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 27efa6a..4e98764 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7595,6 +7595,16 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> kvm_mmu_reset_context(vcpu);
>
> + /*
> + * L1 may access the L2's PDPTR, so save them to construct vmcs12
> + */
> + if (enable_ept) {
> + vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
> + vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
> + vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
> + vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> + }
> +
> kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
> kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
> }
> @@ -7917,6 +7927,22 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> vmcs12->guest_pending_dbg_exceptions =
> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>
> + /*
> + * In some cases (usually, nested EPT), L2 is allowed to change its
> + * own CR3 without exiting. If it has changed it, we must keep it.
> + * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
> + * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
> + *
> + * Additionally, restore L2's PDPTR to vmcs12.
> + */
> + if (enable_ept) {
> + vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
> + vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> + vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> + vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> + vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> + }
> +
> vmcs12->vm_entry_controls =
> (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) |
> (vmcs_read32(VM_ENTRY_CONTROLS) & VM_ENTRY_IA32E_MODE);
>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 01/14] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1 Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 02/14] nEPT: Fix cr3 handling in nested exit and entry Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 12:07 ` Orit Wasserman
2013-07-31 14:48 ` [PATCH v5 04/14] nEPT: Move common code to paging_tmpl.h Gleb Natapov
` (10 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.
As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.
Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/x86.c | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2caeb9..e2fef8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
*/
}
- /*
- * Does the new cr3 value map to physical memory? (Note, we
- * catch an invalid cr3 even in real-mode, because it would
- * cause trouble later on when we turn on paging anyway.)
- *
- * A real CPU would silently accept an invalid cr3 and would
- * attempt to use it - with largely undefined (and often hard
- * to debug) behavior on the guest side.
- */
- if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
- return 1;
vcpu->arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
vcpu->arch.mmu.new_cr3(vcpu);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3
2013-07-31 14:48 ` [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
@ 2013-08-01 12:07 ` Orit Wasserman
0 siblings, 0 replies; 42+ messages in thread
From: Orit Wasserman @ 2013-08-01 12:07 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 05:48 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
>
> kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
> address. The problem is that with nested EPT, cr3 is an *L2* physical
> address, not an L1 physical address as this test expects.
>
> As the comment above this test explains, it isn't necessary, and doesn't
> correspond to anything a real processor would do. So this patch removes it.
>
> Note that this wrong test could have also theoretically caused problems
> in nested NPT, not just in nested EPT. However, in practice, the problem
> was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
> nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
> circumventing the problem. Additional potential calls to the buggy function
> are avoided in that we don't trap cr3 modifications when nested NPT is
> enabled. However, because in nested VMX we did want to use kvm_set_cr3()
> (as requested in Avi Kivity's review of the original nested VMX patches),
> we can't avoid this problem and need to fix it.
>
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
> arch/x86/kvm/x86.c | 11 -----------
> 1 file changed, 11 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d2caeb9..e2fef8b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -682,17 +682,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> */
> }
>
> - /*
> - * Does the new cr3 value map to physical memory? (Note, we
> - * catch an invalid cr3 even in real-mode, because it would
> - * cause trouble later on when we turn on paging anyway.)
> - *
> - * A real CPU would silently accept an invalid cr3 and would
> - * attempt to use it - with largely undefined (and often hard
> - * to debug) behavior on the guest side.
> - */
> - if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
> - return 1;
> vcpu->arch.cr3 = cr3;
> __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> vcpu->arch.mmu.new_cr3(vcpu);
>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 04/14] nEPT: Move common code to paging_tmpl.h
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (2 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 03/14] nEPT: Fix wrong test in kvm_set_cr3 Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 05/14] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
` (9 subsequent siblings)
13 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/mmu.c | 55 ------------------------------
arch/x86/kvm/paging_tmpl.h | 80 +++++++++++++++++++++++++++++++++++++-------
2 files changed, 68 insertions(+), 67 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3a9493a..4c4274d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -331,11 +331,6 @@ static int is_large_pte(u64 pte)
return pte & PT_PAGE_SIZE_MASK;
}
-static int is_dirty_gpte(unsigned long pte)
-{
- return pte & PT_DIRTY_MASK;
-}
-
static int is_rmap_spte(u64 pte)
{
return is_shadow_present_pte(pte);
@@ -2574,14 +2569,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
}
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
- int bit7;
-
- bit7 = (gpte >> 7) & 1;
- return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
bool no_dirty_log)
{
@@ -2594,26 +2581,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
return gfn_to_pfn_memslot_atomic(slot, gfn);
}
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp, u64 *spte,
- u64 gpte)
-{
- if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
- goto no_present;
-
- if (!is_present_gpte(gpte))
- goto no_present;
-
- if (!(gpte & PT_ACCESSED_MASK))
- goto no_present;
-
- return false;
-
-no_present:
- drop_spte(vcpu->kvm, spte);
- return true;
-}
-
static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
u64 *start, u64 *end)
@@ -3501,18 +3468,6 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
}
-static inline void protect_clean_gpte(unsigned *access, unsigned gpte)
-{
- unsigned mask;
-
- BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
-
- mask = (unsigned)~ACC_WRITE_MASK;
- /* Allow write access to dirty gptes */
- mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
- *access &= mask;
-}
-
static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
unsigned access, int *nr_present)
{
@@ -3530,16 +3485,6 @@ static bool sync_mmio_spte(struct kvm *kvm, u64 *sptep, gfn_t gfn,
return false;
}
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
- unsigned access;
-
- access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
- access &= ~(gpte >> PT64_NX_SHIFT);
-
- return access;
-}
-
static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte)
{
unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7769699..fb26ca9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -80,6 +80,31 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
}
+static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
+{
+ unsigned mask;
+
+ BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
+
+ mask = (unsigned)~ACC_WRITE_MASK;
+ /* Allow write access to dirty gptes */
+ mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+ *access &= mask;
+}
+
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+ int bit7;
+
+ bit7 = (gpte >> 7) & 1;
+ return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
+static inline int FNAME(is_present_gpte)(unsigned long pte)
+{
+ return is_present_gpte(pte);
+}
+
static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
pt_element_t __user *ptep_user, unsigned index,
pt_element_t orig_pte, pt_element_t new_pte)
@@ -103,6 +128,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return (ret != orig_pte);
}
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *spte,
+ u64 gpte)
+{
+ if (FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+ goto no_present;
+
+ if (!FNAME(is_present_gpte)(gpte))
+ goto no_present;
+
+ if (!(gpte & PT_ACCESSED_MASK))
+ goto no_present;
+
+ return false;
+
+no_present:
+ drop_spte(vcpu->kvm, spte);
+ return true;
+}
+
+static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+ unsigned access;
+
+ access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+ access &= ~(gpte >> PT64_NX_SHIFT);
+
+ return access;
+}
+
static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu,
struct guest_walker *walker,
@@ -123,7 +178,8 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
pte |= PT_ACCESSED_MASK;
}
- if (level == walker->level && write_fault && !is_dirty_gpte(pte)) {
+ if (level == walker->level && write_fault &&
+ !(pte & PT_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
pte |= PT_DIRTY_MASK;
}
@@ -170,7 +226,7 @@ retry_walk:
if (walker->level == PT32E_ROOT_LEVEL) {
pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
trace_kvm_mmu_paging_element(pte, walker->level);
- if (!is_present_gpte(pte))
+ if (!FNAME(is_present_gpte)(pte))
goto error;
--walker->level;
}
@@ -215,17 +271,17 @@ retry_walk:
trace_kvm_mmu_paging_element(pte, walker->level);
- if (unlikely(!is_present_gpte(pte)))
+ if (unlikely(!FNAME(is_present_gpte)(pte)))
goto error;
- if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+ if (unlikely(FNAME(is_rsvd_bits_set)(&vcpu->arch.mmu, pte,
walker->level))) {
errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
accessed_dirty &= pte;
- pte_access = pt_access & gpte_access(vcpu, pte);
+ pte_access = pt_access & FNAME(gpte_access)(vcpu, pte);
walker->ptes[walker->level - 1] = pte;
} while (!is_last_gpte(mmu, walker->level, pte));
@@ -248,7 +304,7 @@ retry_walk:
walker->gfn = real_gpa >> PAGE_SHIFT;
if (!write_fault)
- protect_clean_gpte(&pte_access, pte);
+ FNAME(protect_clean_gpte)(&pte_access, pte);
else
/*
* On a write fault, fold the dirty bit into accessed_dirty by
@@ -309,14 +365,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gfn_t gfn;
pfn_t pfn;
- if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
+ if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
return false;
pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
gfn = gpte_to_gfn(gpte);
- pte_access = sp->role.access & gpte_access(vcpu, gpte);
- protect_clean_gpte(&pte_access, gpte);
+ pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
+ FNAME(protect_clean_gpte)(&pte_access, gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
no_dirty_log && (pte_access & ACC_WRITE_MASK));
if (is_error_pfn(pfn))
@@ -785,15 +841,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
sizeof(pt_element_t)))
return -EINVAL;
- if (prefetch_invalid_gpte(vcpu, sp, &sp->spt[i], gpte)) {
+ if (FNAME(prefetch_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
vcpu->kvm->tlbs_dirty++;
continue;
}
gfn = gpte_to_gfn(gpte);
pte_access = sp->role.access;
- pte_access &= gpte_access(vcpu, gpte);
- protect_clean_gpte(&pte_access, gpte);
+ pte_access &= FNAME(gpte_access)(vcpu, gpte);
+ FNAME(protect_clean_gpte)(&pte_access, gpte);
if (sync_mmio_spte(vcpu->kvm, &sp->spt[i], gfn, pte_access,
&nr_present))
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 05/14] nEPT: make guest's A/D bits depends on guest's paging mode
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (3 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 04/14] nEPT: Move common code to paging_tmpl.h Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 6:51 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 06/14] nEPT: Support shadow paging for guest paging without A/D bits Gleb Natapov
` (8 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
This patch makes guest A/D bits definition to be dependable on paging
mode, so when EPT support will be added it will be able to define them
differently.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/paging_tmpl.h | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index fb26ca9..7581395 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -32,6 +32,10 @@
#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
#define PT_LEVEL_BITS PT64_LEVEL_BITS
+ #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+ #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+ #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+ #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#ifdef CONFIG_X86_64
#define PT_MAX_FULL_LEVELS 4
#define CMPXCHG cmpxchg
@@ -49,6 +53,10 @@
#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
#define PT_LEVEL_BITS PT32_LEVEL_BITS
#define PT_MAX_FULL_LEVELS 2
+ #define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+ #define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+ #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+ #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#define CMPXCHG cmpxchg
#else
#error Invalid PTTYPE value
@@ -88,7 +96,8 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
mask = (unsigned)~ACC_WRITE_MASK;
/* Allow write access to dirty gptes */
- mask |= (gpte >> (PT_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) & PT_WRITABLE_MASK;
+ mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
+ PT_WRITABLE_MASK;
*access &= mask;
}
@@ -138,7 +147,7 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
if (!FNAME(is_present_gpte)(gpte))
goto no_present;
- if (!(gpte & PT_ACCESSED_MASK))
+ if (!(gpte & PT_GUEST_ACCESSED_MASK))
goto no_present;
return false;
@@ -174,14 +183,14 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
table_gfn = walker->table_gfn[level - 1];
ptep_user = walker->ptep_user[level - 1];
index = offset_in_page(ptep_user) / sizeof(pt_element_t);
- if (!(pte & PT_ACCESSED_MASK)) {
+ if (!(pte & PT_GUEST_ACCESSED_MASK)) {
trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte));
- pte |= PT_ACCESSED_MASK;
+ pte |= PT_GUEST_ACCESSED_MASK;
}
if (level == walker->level && write_fault &&
- !(pte & PT_DIRTY_MASK)) {
+ !(pte & PT_GUEST_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, sizeof(pte));
- pte |= PT_DIRTY_MASK;
+ pte |= PT_GUEST_DIRTY_MASK;
}
if (pte == orig_pte)
continue;
@@ -235,7 +244,7 @@ retry_walk:
ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
(mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
- accessed_dirty = PT_ACCESSED_MASK;
+ accessed_dirty = PT_GUEST_ACCESSED_MASK;
pt_access = pte_access = ACC_ALL;
++walker->level;
@@ -310,7 +319,8 @@ retry_walk:
* On a write fault, fold the dirty bit into accessed_dirty by
* shifting it one place right.
*/
- accessed_dirty &= pte >> (PT_DIRTY_SHIFT - PT_ACCESSED_SHIFT);
+ accessed_dirty &= pte >>
+ (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
if (unlikely(!accessed_dirty)) {
ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault);
@@ -886,3 +896,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
#undef gpte_to_gfn
#undef gpte_to_gfn_lvl
#undef CMPXCHG
+#undef PT_GUEST_ACCESSED_MASK
+#undef PT_GUEST_DIRTY_MASK
+#undef PT_GUEST_DIRTY_SHIFT
+#undef PT_GUEST_ACCESSED_SHIFT
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 06/14] nEPT: Support shadow paging for guest paging without A/D bits
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (4 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 05/14] nEPT: make guest's A/D bits depends on guest's paging mode Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 6:54 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
` (7 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
Some guest paging modes do not support A/D bits. Add support for such
modes in shadow page code. For such modes PT_GUEST_DIRTY_MASK,
PT_GUEST_ACCESSED_MASK, PT_GUEST_DIRTY_SHIFT and PT_GUEST_ACCESSED_SHIFT
should be set to zero.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/paging_tmpl.h | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7581395..2c2f635 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -92,6 +92,10 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
{
unsigned mask;
+ /* dirty bit is not supported, so no need to track it */
+ if (!PT_GUEST_DIRTY_MASK)
+ return;
+
BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
mask = (unsigned)~ACC_WRITE_MASK;
@@ -147,7 +151,8 @@ static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
if (!FNAME(is_present_gpte)(gpte))
goto no_present;
- if (!(gpte & PT_GUEST_ACCESSED_MASK))
+ /* if accessed bit is not supported prefetch non accessed gpte */
+ if (PT_GUEST_ACCESSED_MASK && !(gpte & PT_GUEST_ACCESSED_MASK))
goto no_present;
return false;
@@ -178,6 +183,10 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
gfn_t table_gfn;
int ret;
+ /* dirty/accessed bits are not supported, so no need to update them */
+ if (!PT_GUEST_DIRTY_MASK)
+ return 0;
+
for (level = walker->max_level; level >= walker->level; --level) {
pte = orig_pte = walker->ptes[level - 1];
table_gfn = walker->table_gfn[level - 1];
@@ -316,8 +325,9 @@ retry_walk:
FNAME(protect_clean_gpte)(&pte_access, pte);
else
/*
- * On a write fault, fold the dirty bit into accessed_dirty by
- * shifting it one place right.
+ * On a write fault, fold the dirty bit into accessed_dirty.
+ * For modes without A/D bits support accessed_dirty will be
+ * always clear.
*/
accessed_dirty &= pte >>
(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (5 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 06/14] nEPT: Support shadow paging for guest paging without A/D bits Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 7:00 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
` (6 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).
This patch adds EPT support to paging_tmpl.h.
paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.
There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.
So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/mmu.c | 5 +++++
arch/x86/kvm/paging_tmpl.h | 37 ++++++++++++++++++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c4274d..b5273c3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
return mmu->last_pte_bitmap & (1 << index);
}
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include "paging_tmpl.h"
+#undef PTTYPE
+
#define PTTYPE 64
#include "paging_tmpl.h"
#undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 2c2f635..762c904 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -23,6 +23,13 @@
* so the code in this file is compiled twice, once per pte size.
*/
+/*
+ * This is used to catch non optimized PT_GUEST_(DIRTY|ACCESS)_SHIFT macro
+ * uses for EPT without A/D paging type.
+ */
+extern u64 __pure __using_nonexistent_pte_bit(void)
+ __compiletime_error("wrong use of PT_GUEST_(DIRTY|ACCESS)_SHIFT");
+
#if PTTYPE == 64
#define pt_element_t u64
#define guest_walker guest_walker64
@@ -58,6 +65,21 @@
#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+ #define pt_element_t u64
+ #define guest_walker guest_walkerEPT
+ #define FNAME(name) ept_##name
+ #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+ #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+ #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+ #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+ #define PT_LEVEL_BITS PT64_LEVEL_BITS
+ #define PT_GUEST_ACCESSED_MASK 0
+ #define PT_GUEST_DIRTY_MASK 0
+ #define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
+ #define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
+ #define CMPXCHG cmpxchg64
+ #define PT_MAX_FULL_LEVELS 4
#else
#error Invalid PTTYPE value
#endif
@@ -115,7 +137,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
static inline int FNAME(is_present_gpte)(unsigned long pte)
{
+#if PTTYPE != PTTYPE_EPT
return is_present_gpte(pte);
+#else
+ return pte & 7;
+#endif
}
static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
@@ -165,9 +191,14 @@ no_present:
static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
{
unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+ access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
+ ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
+ ACC_USER_MASK;
+#else
access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
access &= ~(gpte >> PT64_NX_SHIFT);
+#endif
return access;
}
@@ -369,6 +400,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
access);
}
+#if PTTYPE != PTTYPE_EPT
static int FNAME(walk_addr_nested)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, gva_t addr,
u32 access)
@@ -376,6 +408,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
addr, access);
}
+#endif
static bool
FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -803,6 +836,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
return gpa;
}
+#if PTTYPE != PTTYPE_EPT
static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
u32 access,
struct x86_exception *exception)
@@ -821,6 +855,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
return gpa;
}
+#endif
/*
* Using the cached information from sp->gfns is safe because:
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-07-31 14:48 ` [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
@ 2013-08-01 7:00 ` Xiao Guangrong
2013-08-01 7:10 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 7:00 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
>
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
>
> This patch adds EPT support to paging_tmpl.h.
>
> paging_tmpl.h contains the code for reading and writing page tables. The code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
>
> There are subtle but important differences between the format of EPT tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
>
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> with "EPT") which correctly read and write EPT tables.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
> arch/x86/kvm/mmu.c | 5 +++++
> arch/x86/kvm/paging_tmpl.h | 37 ++++++++++++++++++++++++++++++++++++-
> 2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4c4274d..b5273c3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> return mmu->last_pte_bitmap & (1 << index);
> }
>
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
> #define PTTYPE 64
> #include "paging_tmpl.h"
> #undef PTTYPE
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 2c2f635..762c904 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -23,6 +23,13 @@
> * so the code in this file is compiled twice, once per pte size.
> */
>
> +/*
> + * This is used to catch non optimized PT_GUEST_(DIRTY|ACCESS)_SHIFT macro
> + * uses for EPT without A/D paging type.
> + */
> +extern u64 __pure __using_nonexistent_pte_bit(void)
> + __compiletime_error("wrong use of PT_GUEST_(DIRTY|ACCESS)_SHIFT");
> +
> #if PTTYPE == 64
> #define pt_element_t u64
> #define guest_walker guest_walker64
> @@ -58,6 +65,21 @@
> #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> #define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> + #define pt_element_t u64
> + #define guest_walker guest_walkerEPT
> + #define FNAME(name) ept_##name
> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> + #define PT_LEVEL_BITS PT64_LEVEL_BITS
> + #define PT_GUEST_ACCESSED_MASK 0
> + #define PT_GUEST_DIRTY_MASK 0
> + #define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
> + #define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
> + #define CMPXCHG cmpxchg64
> + #define PT_MAX_FULL_LEVELS 4
> #else
> #error Invalid PTTYPE value
> #endif
> @@ -115,7 +137,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>
> static inline int FNAME(is_present_gpte)(unsigned long pte)
> {
> +#if PTTYPE != PTTYPE_EPT
> return is_present_gpte(pte);
> +#else
> + return pte & 7;
> +#endif
> }
>
> static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> @@ -165,9 +191,14 @@ no_present:
> static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> {
> unsigned access;
> -
> +#if PTTYPE == PTTYPE_EPT
> + access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
> + ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
> + ACC_USER_MASK;
> +#else
> access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> access &= ~(gpte >> PT64_NX_SHIFT);
> +#endif
Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
that.
>
> return access;
> }
> @@ -369,6 +400,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
> access);
> }
>
> +#if PTTYPE != PTTYPE_EPT
> static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> struct kvm_vcpu *vcpu, gva_t addr,
> u32 access)
> @@ -376,6 +408,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
> addr, access);
> }
> +#endif
>
> static bool
> FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -803,6 +836,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
> return gpa;
> }
>
> +#if PTTYPE != PTTYPE_EPT
> static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> u32 access,
> struct x86_exception *exception)
> @@ -821,6 +855,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
>
> return gpa;
> }
> +#endif
Can we move the "walk" together, then one "#if" can be dropped?
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:00 ` Xiao Guangrong
@ 2013-08-01 7:10 ` Gleb Natapov
2013-08-01 7:18 ` Xiao Guangrong
0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 7:10 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 03:00:12PM +0800, Xiao Guangrong wrote:
> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > From: Nadav Har'El <nyh@il.ibm.com>
> >
> > This is the first patch in a series which adds nested EPT support to KVM's
> > nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
> > EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
> > to set its own cr3 and take its own page faults without either of L0 or L1
> > getting involved. This often significanlty improves L2's performance over the
> > previous two alternatives (shadow page tables over EPT, and shadow page
> > tables over shadow page tables).
> >
> > This patch adds EPT support to paging_tmpl.h.
> >
> > paging_tmpl.h contains the code for reading and writing page tables. The code
> > for 32-bit and 64-bit tables is very similar, but not identical, so
> > paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> > with PTTYPE=64, and this generates the two sets of similar functions.
> >
> > There are subtle but important differences between the format of EPT tables
> > and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> > third set of functions to read the guest EPT table and to write the shadow
> > EPT table.
> >
> > So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
> > with "EPT") which correctly read and write EPT tables.
> >
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> > arch/x86/kvm/mmu.c | 5 +++++
> > arch/x86/kvm/paging_tmpl.h | 37 ++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 41 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 4c4274d..b5273c3 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
> > return mmu->last_pte_bitmap & (1 << index);
> > }
> >
> > +#define PTTYPE_EPT 18 /* arbitrary */
> > +#define PTTYPE PTTYPE_EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> > +
> > #define PTTYPE 64
> > #include "paging_tmpl.h"
> > #undef PTTYPE
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 2c2f635..762c904 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -23,6 +23,13 @@
> > * so the code in this file is compiled twice, once per pte size.
> > */
> >
> > +/*
> > + * This is used to catch non optimized PT_GUEST_(DIRTY|ACCESS)_SHIFT macro
> > + * uses for EPT without A/D paging type.
> > + */
> > +extern u64 __pure __using_nonexistent_pte_bit(void)
> > + __compiletime_error("wrong use of PT_GUEST_(DIRTY|ACCESS)_SHIFT");
> > +
> > #if PTTYPE == 64
> > #define pt_element_t u64
> > #define guest_walker guest_walker64
> > @@ -58,6 +65,21 @@
> > #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> > #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> > #define CMPXCHG cmpxchg
> > +#elif PTTYPE == PTTYPE_EPT
> > + #define pt_element_t u64
> > + #define guest_walker guest_walkerEPT
> > + #define FNAME(name) ept_##name
> > + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> > + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> > + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > + #define PT_LEVEL_BITS PT64_LEVEL_BITS
> > + #define PT_GUEST_ACCESSED_MASK 0
> > + #define PT_GUEST_DIRTY_MASK 0
> > + #define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
> > + #define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
> > + #define CMPXCHG cmpxchg64
> > + #define PT_MAX_FULL_LEVELS 4
> > #else
> > #error Invalid PTTYPE value
> > #endif
> > @@ -115,7 +137,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> >
> > static inline int FNAME(is_present_gpte)(unsigned long pte)
> > {
> > +#if PTTYPE != PTTYPE_EPT
> > return is_present_gpte(pte);
> > +#else
> > + return pte & 7;
> > +#endif
> > }
> >
> > static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > @@ -165,9 +191,14 @@ no_present:
> > static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
> > {
> > unsigned access;
> > -
> > +#if PTTYPE == PTTYPE_EPT
> > + access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
> > + ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
> > + ACC_USER_MASK;
> > +#else
> > access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
> > access &= ~(gpte >> PT64_NX_SHIFT);
> > +#endif
>
> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
> that.
>
shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
so cannot be used here. Since we have to use ifdefs anyway relying on
VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
easier to read.
> >
> > return access;
> > }
> > @@ -369,6 +400,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
> > access);
> > }
> >
> > +#if PTTYPE != PTTYPE_EPT
> > static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> > struct kvm_vcpu *vcpu, gva_t addr,
> > u32 access)
> > @@ -376,6 +408,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker,
> > return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
> > addr, access);
> > }
> > +#endif
> >
> > static bool
> > FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > @@ -803,6 +836,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
> > return gpa;
> > }
> >
> > +#if PTTYPE != PTTYPE_EPT
> > static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> > u32 access,
> > struct x86_exception *exception)
> > @@ -821,6 +855,7 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
> >
> > return gpa;
> > }
> > +#endif
>
> Can we move the "walk" together, then one "#if" can be dropped?
Currently walk_addr_nested() is near walk_addr() and gva_to_gpa_nested()
is near gva_to_gpa(). I like this grouping.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:10 ` Gleb Natapov
@ 2013-08-01 7:18 ` Xiao Guangrong
2013-08-01 7:31 ` Xiao Guangrong
0 siblings, 1 reply; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 7:18 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 08/01/2013 03:10 PM, Gleb Natapov wrote:
> On Thu, Aug 01, 2013 at 03:00:12PM +0800, Xiao Guangrong wrote:
>> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
>>> From: Nadav Har'El <nyh@il.ibm.com>
>>>
>>> This is the first patch in a series which adds nested EPT support to KVM's
>>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
>>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
>>> to set its own cr3 and take its own page faults without either of L0 or L1
>>> getting involved. This often significanlty improves L2's performance over the
>>> previous two alternatives (shadow page tables over EPT, and shadow page
>>> tables over shadow page tables).
>>>
>>> This patch adds EPT support to paging_tmpl.h.
>>>
>>> paging_tmpl.h contains the code for reading and writing page tables. The code
>>> for 32-bit and 64-bit tables is very similar, but not identical, so
>>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
>>> with PTTYPE=64, and this generates the two sets of similar functions.
>>>
>>> There are subtle but important differences between the format of EPT tables
>>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
>>> third set of functions to read the guest EPT table and to write the shadow
>>> EPT table.
>>>
>>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
>>> with "EPT") which correctly read and write EPT tables.
>>>
>>> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
>>> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
>>> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>> Signed-off-by: Gleb Natapov <gleb@redhat.com>
>>> ---
>>> arch/x86/kvm/mmu.c | 5 +++++
>>> arch/x86/kvm/paging_tmpl.h | 37 ++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 41 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 4c4274d..b5273c3 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -3494,6 +3494,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp
>>> return mmu->last_pte_bitmap & (1 << index);
>>> }
>>>
>>> +#define PTTYPE_EPT 18 /* arbitrary */
>>> +#define PTTYPE PTTYPE_EPT
>>> +#include "paging_tmpl.h"
>>> +#undef PTTYPE
>>> +
>>> #define PTTYPE 64
>>> #include "paging_tmpl.h"
>>> #undef PTTYPE
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index 2c2f635..762c904 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -23,6 +23,13 @@
>>> * so the code in this file is compiled twice, once per pte size.
>>> */
>>>
>>> +/*
>>> + * This is used to catch non optimized PT_GUEST_(DIRTY|ACCESS)_SHIFT macro
>>> + * uses for EPT without A/D paging type.
>>> + */
>>> +extern u64 __pure __using_nonexistent_pte_bit(void)
>>> + __compiletime_error("wrong use of PT_GUEST_(DIRTY|ACCESS)_SHIFT");
>>> +
>>> #if PTTYPE == 64
>>> #define pt_element_t u64
>>> #define guest_walker guest_walker64
>>> @@ -58,6 +65,21 @@
>>> #define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
>>> #define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
>>> #define CMPXCHG cmpxchg
>>> +#elif PTTYPE == PTTYPE_EPT
>>> + #define pt_element_t u64
>>> + #define guest_walker guest_walkerEPT
>>> + #define FNAME(name) ept_##name
>>> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
>>> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
>>> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>>> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>>> + #define PT_LEVEL_BITS PT64_LEVEL_BITS
>>> + #define PT_GUEST_ACCESSED_MASK 0
>>> + #define PT_GUEST_DIRTY_MASK 0
>>> + #define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
>>> + #define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
>>> + #define CMPXCHG cmpxchg64
>>> + #define PT_MAX_FULL_LEVELS 4
>>> #else
>>> #error Invalid PTTYPE value
>>> #endif
>>> @@ -115,7 +137,11 @@ static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
>>>
>>> static inline int FNAME(is_present_gpte)(unsigned long pte)
>>> {
>>> +#if PTTYPE != PTTYPE_EPT
>>> return is_present_gpte(pte);
>>> +#else
>>> + return pte & 7;
>>> +#endif
>>> }
>>>
>>> static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>>> @@ -165,9 +191,14 @@ no_present:
>>> static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
>>> {
>>> unsigned access;
>>> -
>>> +#if PTTYPE == PTTYPE_EPT
>>> + access = ((gpte & VMX_EPT_WRITABLE_MASK) ? ACC_WRITE_MASK : 0) |
>>> + ((gpte & VMX_EPT_EXECUTABLE_MASK) ? ACC_EXEC_MASK : 0) |
>>> + ACC_USER_MASK;
>>> +#else
>>> access = (gpte & (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
>>> access &= ~(gpte >> PT64_NX_SHIFT);
>>> +#endif
>>
>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
>> that.
>>
> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
> so cannot be used here. Since we have to use ifdefs anyway relying on
> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
> easier to read.
Oh, yes, you are right.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:18 ` Xiao Guangrong
@ 2013-08-01 7:31 ` Xiao Guangrong
2013-08-01 7:42 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 7:31 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: Gleb Natapov, kvm, Jun Nakajima, Yang Zhang, pbonzini
On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
+#endif
>>>
>>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
>>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
>>> that.
>>>
>> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
>> so cannot be used here. Since we have to use ifdefs anyway relying on
>> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
>> easier to read.
>
> Oh, yes, you are right.
>
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
i.e:
static bool need_remote_flush(u64 old, u64 new)
{
if (!is_shadow_present_pte(old))
return false;
if (!is_shadow_present_pte(new))
return true;
if ((old ^ new) & PT64_BASE_ADDR_MASK)
return true;
old ^= PT64_NX_MASK;
new ^= PT64_NX_MASK;
return (old & ~new & PT64_PERM_MASK) != 0;
}
It checks shadow page table and the mask is wrong one nest ept spte.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:31 ` Xiao Guangrong
@ 2013-08-01 7:42 ` Gleb Natapov
2013-08-01 7:51 ` Xiao Guangrong
0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 7:42 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 03:31:01PM +0800, Xiao Guangrong wrote:
> On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
> +#endif
> >>>
> >>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
> >>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
> >>> that.
> >>>
> >> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
> >> so cannot be used here. Since we have to use ifdefs anyway relying on
> >> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
> >> easier to read.
> >
> > Oh, yes, you are right.
> >
> > Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>
> BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
> i.e:
>
> static bool need_remote_flush(u64 old, u64 new)
> {
> if (!is_shadow_present_pte(old))
> return false;
> if (!is_shadow_present_pte(new))
> return true;
> if ((old ^ new) & PT64_BASE_ADDR_MASK)
> return true;
> old ^= PT64_NX_MASK;
> new ^= PT64_NX_MASK;
> return (old & ~new & PT64_PERM_MASK) != 0;
> }
>
> It checks shadow page table and the mask is wrong one nest ept spte.
So shadow_x_mask need to be used here, correct?
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:42 ` Gleb Natapov
@ 2013-08-01 7:51 ` Xiao Guangrong
2013-08-01 7:56 ` Gleb Natapov
2013-08-01 11:05 ` Paolo Bonzini
0 siblings, 2 replies; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 7:51 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 08/01/2013 03:42 PM, Gleb Natapov wrote:
> On Thu, Aug 01, 2013 at 03:31:01PM +0800, Xiao Guangrong wrote:
>> On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
>> +#endif
>>>>>
>>>>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
>>>>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
>>>>> that.
>>>>>
>>>> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
>>>> so cannot be used here. Since we have to use ifdefs anyway relying on
>>>> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
>>>> easier to read.
>>>
>>> Oh, yes, you are right.
>>>
>>> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>>
>> BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
>> i.e:
>>
>> static bool need_remote_flush(u64 old, u64 new)
>> {
>> if (!is_shadow_present_pte(old))
>> return false;
>> if (!is_shadow_present_pte(new))
>> return true;
>> if ((old ^ new) & PT64_BASE_ADDR_MASK)
>> return true;
>> old ^= PT64_NX_MASK;
>> new ^= PT64_NX_MASK;
>> return (old & ~new & PT64_PERM_MASK) != 0;
>> }
>>
>> It checks shadow page table and the mask is wrong one nest ept spte.
> So shadow_x_mask need to be used here, correct?
Yes. The code checks shadow page table which does not depend on guest mode. :)
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:51 ` Xiao Guangrong
@ 2013-08-01 7:56 ` Gleb Natapov
2013-08-01 11:05 ` Paolo Bonzini
1 sibling, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 7:56 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 03:51:35PM +0800, Xiao Guangrong wrote:
> On 08/01/2013 03:42 PM, Gleb Natapov wrote:
> > On Thu, Aug 01, 2013 at 03:31:01PM +0800, Xiao Guangrong wrote:
> >> On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
> >> +#endif
> >>>>>
> >>>>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
> >>>>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
> >>>>> that.
> >>>>>
> >>>> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
> >>>> so cannot be used here. Since we have to use ifdefs anyway relying on
> >>>> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
> >>>> easier to read.
> >>>
> >>> Oh, yes, you are right.
> >>>
> >>> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >>
> >> BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
> >> i.e:
> >>
> >> static bool need_remote_flush(u64 old, u64 new)
> >> {
> >> if (!is_shadow_present_pte(old))
> >> return false;
> >> if (!is_shadow_present_pte(new))
> >> return true;
> >> if ((old ^ new) & PT64_BASE_ADDR_MASK)
> >> return true;
> >> old ^= PT64_NX_MASK;
> >> new ^= PT64_NX_MASK;
> >> return (old & ~new & PT64_PERM_MASK) != 0;
> >> }
> >>
> >> It checks shadow page table and the mask is wrong one nest ept spte.
> > So shadow_x_mask need to be used here, correct?
>
> Yes. The code checks shadow page table which does not depend on guest mode. :)
>
Yes, good catch. It would have been hard to find when hit on practice.
Will add separate patch to fix that after you review the reset of the
series :)
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 7:51 ` Xiao Guangrong
2013-08-01 7:56 ` Gleb Natapov
@ 2013-08-01 11:05 ` Paolo Bonzini
2013-08-01 11:07 ` Gleb Natapov
1 sibling, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-08-01 11:05 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: Gleb Natapov, kvm, Jun Nakajima, Yang Zhang
On Aug 01 2013, Xiao Guangrong wrote:
> On 08/01/2013 03:42 PM, Gleb Natapov wrote:
> > On Thu, Aug 01, 2013 at 03:31:01PM +0800, Xiao Guangrong wrote:
> >> On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
> >> +#endif
> >>>>>
> >>>>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
> >>>>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
> >>>>> that.
> >>>>>
> >>>> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
> >>>> so cannot be used here. Since we have to use ifdefs anyway relying on
> >>>> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
> >>>> easier to read.
> >>>
> >>> Oh, yes, you are right.
> >>>
> >>> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> >>
> >> BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
> >> i.e:
> >>
> >> static bool need_remote_flush(u64 old, u64 new)
> >> {
> >> if (!is_shadow_present_pte(old))
> >> return false;
> >> if (!is_shadow_present_pte(new))
> >> return true;
> >> if ((old ^ new) & PT64_BASE_ADDR_MASK)
> >> return true;
> >> old ^= PT64_NX_MASK;
> >> new ^= PT64_NX_MASK;
> >> return (old & ~new & PT64_PERM_MASK) != 0;
> >> }
> >>
> >> It checks shadow page table and the mask is wrong one nest ept spte.
> > So shadow_x_mask need to be used here, correct?
>
> Yes. The code checks shadow page table which does not depend on guest mode. :)
The XOR should be with shadow_nx_mask, no? And PT64_PERM_MASK
should include both shadow_x_mask and shadow_nx_mask, I think.
Paolo
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h
2013-08-01 11:05 ` Paolo Bonzini
@ 2013-08-01 11:07 ` Gleb Natapov
0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 11:07 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Thu, Aug 01, 2013 at 01:05:47PM +0200, Paolo Bonzini wrote:
> On Aug 01 2013, Xiao Guangrong wrote:
> > On 08/01/2013 03:42 PM, Gleb Natapov wrote:
> > > On Thu, Aug 01, 2013 at 03:31:01PM +0800, Xiao Guangrong wrote:
> > >> On 08/01/2013 03:18 PM, Xiao Guangrong wrote:
> > >> +#endif
> > >>>>>
> > >>>>> Hmm, why not use shadow_x_mask, shadow_user_mask instead? PT_WRITABLE_MASK
> > >>>>> is also suitable for ept, i guess we can remove the "#if/#else/#endif" after
> > >>>>> that.
> > >>>>>
> > >>>> shadow_x_mask and shadow_user_mask do not depend on guest paging mode,
> > >>>> so cannot be used here. Since we have to use ifdefs anyway relying on
> > >>>> VMX_EPT_WRITABLE_MASK == PT_WRITABLE_MASK is not necessary. Makes code
> > >>>> easier to read.
> > >>>
> > >>> Oh, yes, you are right.
> > >>>
> > >>> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> > >>
> > >> BTW, i notice the code in mmu.c uses PT64_NX_MASK to check the permission,
> > >> i.e:
> > >>
> > >> static bool need_remote_flush(u64 old, u64 new)
> > >> {
> > >> if (!is_shadow_present_pte(old))
> > >> return false;
> > >> if (!is_shadow_present_pte(new))
> > >> return true;
> > >> if ((old ^ new) & PT64_BASE_ADDR_MASK)
> > >> return true;
> > >> old ^= PT64_NX_MASK;
> > >> new ^= PT64_NX_MASK;
> > >> return (old & ~new & PT64_PERM_MASK) != 0;
> > >> }
> > >>
> > >> It checks shadow page table and the mask is wrong one nest ept spte.
> > > So shadow_x_mask need to be used here, correct?
> >
> > Yes. The code checks shadow page table which does not depend on guest mode. :)
>
> The XOR should be with shadow_nx_mask, no? And PT64_PERM_MASK
> should include both shadow_x_mask and shadow_nx_mask, I think.
>
Yes :) That what I did eventually.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page()
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (6 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 07/14] nEPT: Add EPT tables support to paging_tmpl.h Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 7:24 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 09/14] nEPT: Nested INVEPT Gleb Natapov
` (5 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Yang Zhang <yang.z.zhang@Intel.com>
Since nEPT doesn't support A/D bit, so we should not set those bit
when build shadow page table.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/mmu.c | 12 +++++++++---
arch/x86/kvm/paging_tmpl.h | 4 ++--
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b5273c3..9e0f467 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2047,12 +2047,18 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator->sptep);
}
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, bool accessed)
{
u64 spte;
+ BUILD_BUG_ON(VMX_EPT_READABLE_MASK != PT_PRESENT_MASK ||
+ VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
spte = __pa(sp->spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
- shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+ shadow_user_mask | shadow_x_mask;
+
+ if (accessed)
+ spte |= shadow_accessed_mask;
mmu_spte_set(sptep, spte);
}
@@ -2677,7 +2683,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
iterator.level - 1,
1, ACC_ALL, iterator.sptep);
- link_shadow_page(iterator.sptep, sp);
+ link_shadow_page(iterator.sptep, sp, true);
}
}
return emulate;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 762c904..0d25351 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -555,7 +555,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
goto out_gpte_changed;
if (sp)
- link_shadow_page(it.sptep, sp);
+ link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
}
for (;
@@ -575,7 +575,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
true, direct_access, it.sptep);
- link_shadow_page(it.sptep, sp);
+ link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
}
clear_sp_write_flooding_count(it.sptep);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page()
2013-07-31 14:48 ` [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
@ 2013-08-01 7:24 ` Xiao Guangrong
2013-08-01 7:27 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 7:24 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> }
> }
> return emulate;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 762c904..0d25351 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -555,7 +555,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> goto out_gpte_changed;
>
> if (sp)
> - link_shadow_page(it.sptep, sp);
> + link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
It is better to use "!!PT_GUEST_ACCESSED_MASK" instead? It will be easier when
we export A/D to guest in the further.
Others look good to me.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page()
2013-08-01 7:24 ` Xiao Guangrong
@ 2013-08-01 7:27 ` Gleb Natapov
0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 7:27 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 03:24:06PM +0800, Xiao Guangrong wrote:
> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
>
> > }
> > }
> > return emulate;
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 762c904..0d25351 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -555,7 +555,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
> > goto out_gpte_changed;
> >
> > if (sp)
> > - link_shadow_page(it.sptep, sp);
> > + link_shadow_page(it.sptep, sp, PTTYPE != PTTYPE_EPT);
>
> It is better to use "!!PT_GUEST_ACCESSED_MASK" instead? It will be easier when
> we export A/D to guest in the further.
>
Yes, this is indeed better. It shows better what is going on here. Will
change and add your Reviewed-by.
> Others look good to me.
>
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 09/14] nEPT: Nested INVEPT
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (7 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 08/14] nEPT: Redefine EPT-specific link_shadow_page() Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
` (4 subsequent siblings)
13 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
If we let L1 use EPT, we should probably also support the INVEPT instruction.
In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/include/asm/vmx.h | 2 ++
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kvm/mmu.c | 2 ++
arch/x86/kvm/vmx.c | 67 +++++++++++++++++++++++++++++++++++++++
4 files changed, 72 insertions(+)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..966502d 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -387,6 +387,7 @@ enum vmcs_field {
#define VMX_EPT_EXTENT_INDIVIDUAL_ADDR 0
#define VMX_EPT_EXTENT_CONTEXT 1
#define VMX_EPT_EXTENT_GLOBAL 2
+#define VMX_EPT_EXTENT_SHIFT 24
#define VMX_EPT_EXECUTE_ONLY_BIT (1ull)
#define VMX_EPT_PAGE_WALK_4_BIT (1ull << 6)
@@ -394,6 +395,7 @@ enum vmcs_field {
#define VMX_EPTP_WB_BIT (1ull << 14)
#define VMX_EPT_2MB_PAGE_BIT (1ull << 16)
#define VMX_EPT_1GB_PAGE_BIT (1ull << 17)
+#define VMX_EPT_INVEPT_BIT (1ull << 20)
#define VMX_EPT_AD_BIT (1ull << 21)
#define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
#define VMX_EPT_EXTENT_GLOBAL_BIT (1ull << 26)
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
#define EXIT_REASON_EOI_INDUCED 45
#define EXIT_REASON_EPT_VIOLATION 48
#define EXIT_REASON_EPT_MISCONFIG 49
+#define EXIT_REASON_INVEPT 50
#define EXIT_REASON_PREEMPTION_TIMER 52
#define EXIT_REASON_WBINVD 54
#define EXIT_REASON_XSETBV 55
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9e0f467..3df3ac3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3182,6 +3182,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
mmu_sync_roots(vcpu);
spin_unlock(&vcpu->kvm->mmu_lock);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
u32 access, struct x86_exception *exception)
@@ -3451,6 +3452,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
++vcpu->stat.tlb_flush;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb);
static void paging_new_cr3(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4e98764..a4d9385 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2156,6 +2156,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
static __init void nested_vmx_setup_ctls_msrs(void)
{
/*
@@ -6270,6 +6271,70 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
return 1;
}
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+ u32 vmx_instruction_info;
+ bool ok;
+ unsigned long type;
+ gva_t gva;
+ struct x86_exception e;
+ struct {
+ u64 eptp, gpa;
+ } operand;
+
+ if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
+ !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+ kvm_queue_exception(vcpu, UD_VECTOR);
+ return 1;
+ }
+
+ if (!nested_vmx_check_permission(vcpu))
+ return 1;
+
+ if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+ kvm_queue_exception(vcpu, UD_VECTOR);
+ return 1;
+ }
+
+ /* According to the Intel VMX instruction reference, the memory
+ * operand is read even if it isn't needed (e.g., for type==global)
+ */
+ vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+ if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+ vmx_instruction_info, &gva))
+ return 1;
+ if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+ sizeof(operand), &e)) {
+ kvm_inject_page_fault(vcpu, &e);
+ return 1;
+ }
+
+ type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+ switch (type) {
+ case VMX_EPT_EXTENT_GLOBAL:
+ case VMX_EPT_EXTENT_CONTEXT:
+ ok = !!(nested_vmx_ept_caps &
+ (1UL << (type + VMX_EPT_EXTENT_SHIFT)));
+ break;
+ default:
+ ok = false;
+ }
+
+ if (ok) {
+ kvm_mmu_sync_roots(vcpu);
+ kvm_mmu_flush_tlb(vcpu);
+ nested_vmx_succeed(vcpu);
+ }
+ else
+ nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+
+ skip_emulated_instruction(vcpu);
+ return 1;
+}
+
/*
* The exit handlers return 1 if the exit was handled fully and guest execution
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
@@ -6314,6 +6379,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause,
[EXIT_REASON_MWAIT_INSTRUCTION] = handle_invalid_op,
[EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
+ [EXIT_REASON_INVEPT] = handle_invept,
};
static const int kvm_vmx_max_exit_handlers =
@@ -6540,6 +6606,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+ case EXIT_REASON_INVEPT:
/*
* VMX instructions trap unconditionally. This allows L1 to
* emulate them for its L2 guest, i.e., allows 3-level nesting!
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (8 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 09/14] nEPT: Nested INVEPT Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 8:31 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 11/14] nEPT: MMU context for nested EPT Gleb Natapov
` (3 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Yang Zhang <yang.z.zhang@Intel.com>
Inject nEPT fault to L1 guest. This patch is original from Xinhao.
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
arch/x86/kvm/vmx.c | 17 +++++++++++++++++
4 files changed, 79 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 531f47c..58a17c0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -286,6 +286,7 @@ struct kvm_mmu {
u64 *pae_root;
u64 *lm_root;
u64 rsvd_bits_mask[2][4];
+ u64 bad_mt_xwr;
/*
* Bitmap: bit set = last pte in walk
@@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
* instruction.
*/
bool write_fault_to_shadow_pgtable;
+
+ /* set at EPT violation at this point */
+ unsigned long exit_qualification;
};
struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3df3ac3..58ae9db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
+ context->bad_mt_xwr = 0;
+
if (!context->nx)
exb_bit_rsvd = rsvd_bits(63, 63);
switch (context->root_level) {
@@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
}
}
+static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context, bool execonly)
+{
+ int maxphyaddr = cpuid_maxphyaddr(vcpu);
+ int pte;
+
+ context->rsvd_bits_mask[0][3] =
+ rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+ context->rsvd_bits_mask[0][2] =
+ rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+ context->rsvd_bits_mask[0][1] =
+ rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+ context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+
+ /* large page */
+ context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
+ context->rsvd_bits_mask[1][2] =
+ rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+ context->rsvd_bits_mask[1][1] =
+ rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+ context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
+
+ for (pte = 0; pte < 64; pte++) {
+ int rwx_bits = pte & 7;
+ int mt = pte >> 3;
+ if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
+ rwx_bits == 0x2 || rwx_bits == 0x6 ||
+ (rwx_bits == 0x4 && !execonly))
+ context->bad_mt_xwr |= (1ull << pte);
+ }
+}
+
static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
{
unsigned bit, byte, pfec;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 0d25351..ed6773e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
*access &= mask;
}
-static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
+ int level)
{
- int bit7;
+ int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
- bit7 = (gpte >> 7) & 1;
- return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
+ return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
+ ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
}
static inline int FNAME(is_present_gpte)(unsigned long pte)
@@ -386,6 +387,25 @@ error:
walker->fault.vector = PF_VECTOR;
walker->fault.error_code_valid = true;
walker->fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+ /*
+ * Use PFERR_RSVD_MASK in error_code to to tell if EPT
+ * misconfiguration requires to be injected. The detection is
+ * done by is_rsvd_bits_set() above.
+ *
+ * We set up the value of exit_qualification to inject:
+ * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
+ * [5:3] - Calculated by the page walk of the guest EPT page tables
+ * [7:8] - Clear to 0.
+ *
+ * The other bits are set to 0.
+ */
+ if (!(errcode & PFERR_RSVD_MASK)) {
+ vcpu->arch.exit_qualification &= 0x7;
+ vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3;
+ }
+#endif
walker->fault.address = addr;
walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a4d9385..f3514d7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5321,6 +5321,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
/* ept page table is present? */
error_code |= (exit_qualification >> 3) & 0x1;
+ vcpu->arch.exit_qualification = exit_qualification;
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}
@@ -7415,6 +7417,21 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
entry->ecx |= bit(X86_FEATURE_VMX);
}
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+ struct x86_exception *fault)
+{
+ struct vmcs12 *vmcs12;
+ nested_vmx_vmexit(vcpu);
+ vmcs12 = get_vmcs12(vcpu);
+
+ if (fault->error_code & PFERR_RSVD_MASK)
+ vmcs12->vm_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+ else
+ vmcs12->vm_exit_reason = EXIT_REASON_EPT_VIOLATION;
+ vmcs12->exit_qualification = vcpu->arch.exit_qualification;
+ vmcs12->guest_physical_address = fault->address;
+}
+
/*
* prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
* L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-07-31 14:48 ` [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-08-01 8:31 ` Xiao Guangrong
2013-08-01 8:45 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 8:31 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
>
> Inject nEPT fault to L1 guest. This patch is original from Xinhao.
>
> Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++++
> arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> 4 files changed, 79 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 531f47c..58a17c0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -286,6 +286,7 @@ struct kvm_mmu {
> u64 *pae_root;
> u64 *lm_root;
> u64 rsvd_bits_mask[2][4];
> + u64 bad_mt_xwr;
>
> /*
> * Bitmap: bit set = last pte in walk
> @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> * instruction.
> */
> bool write_fault_to_shadow_pgtable;
> +
> + /* set at EPT violation at this point */
> + unsigned long exit_qualification;
> };
>
> struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3df3ac3..58ae9db 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> int maxphyaddr = cpuid_maxphyaddr(vcpu);
> u64 exb_bit_rsvd = 0;
>
> + context->bad_mt_xwr = 0;
> +
> if (!context->nx)
> exb_bit_rsvd = rsvd_bits(63, 63);
> switch (context->root_level) {
> @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> }
> }
>
> +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> + struct kvm_mmu *context, bool execonly)
> +{
> + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> + int pte;
> +
> + context->rsvd_bits_mask[0][3] =
> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> + context->rsvd_bits_mask[0][2] =
> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> + context->rsvd_bits_mask[0][1] =
> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> +
> + /* large page */
> + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> + context->rsvd_bits_mask[1][2] =
> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> + context->rsvd_bits_mask[1][1] =
> + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> +
> + for (pte = 0; pte < 64; pte++) {
> + int rwx_bits = pte & 7;
> + int mt = pte >> 3;
> + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> + (rwx_bits == 0x4 && !execonly))
> + context->bad_mt_xwr |= (1ull << pte);
> + }
> +}
> +
> static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> {
> unsigned bit, byte, pfec;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 0d25351..ed6773e 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> *access &= mask;
> }
>
> -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> + int level)
Not sure this explicit "inline" is needed... Gcc always inlines the small and
static functions.
> {
> - int bit7;
> + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
>
> - bit7 = (gpte >> 7) & 1;
> - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
> }
>
> static inline int FNAME(is_present_gpte)(unsigned long pte)
> @@ -386,6 +387,25 @@ error:
> walker->fault.vector = PF_VECTOR;
> walker->fault.error_code_valid = true;
> walker->fault.error_code = errcode;
> +
> +#if PTTYPE == PTTYPE_EPT
> + /*
> + * Use PFERR_RSVD_MASK in error_code to to tell if EPT
> + * misconfiguration requires to be injected. The detection is
> + * done by is_rsvd_bits_set() above.
> + *
> + * We set up the value of exit_qualification to inject:
> + * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
> + * [5:3] - Calculated by the page walk of the guest EPT page tables
> + * [7:8] - Clear to 0.
Do not know why always clear bit 7 and bit 8, especially bit 7 is always set for
the normal case. The SDM says these about bit 7:
The guest linear-address field is valid for all EPT violations except those
resulting from an attempt to load the guest PDPTEs as part of the execution of
the MOV CR instruction.
So L0 always tells L1 that the fault is caused by "the guest PDPTEs as part of
the execution of the MOV CR instruction".
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 8:31 ` Xiao Guangrong
@ 2013-08-01 8:45 ` Gleb Natapov
2013-08-01 11:19 ` Paolo Bonzini
0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 8:45 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 04:31:31PM +0800, Xiao Guangrong wrote:
> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> >
> > Inject nEPT fault to L1 guest. This patch is original from Xinhao.
> >
> > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> > arch/x86/include/asm/kvm_host.h | 4 ++++
> > arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> > arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> > 4 files changed, 79 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 531f47c..58a17c0 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -286,6 +286,7 @@ struct kvm_mmu {
> > u64 *pae_root;
> > u64 *lm_root;
> > u64 rsvd_bits_mask[2][4];
> > + u64 bad_mt_xwr;
> >
> > /*
> > * Bitmap: bit set = last pte in walk
> > @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> > * instruction.
> > */
> > bool write_fault_to_shadow_pgtable;
> > +
> > + /* set at EPT violation at this point */
> > + unsigned long exit_qualification;
> > };
> >
> > struct kvm_lpage_info {
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 3df3ac3..58ae9db 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > u64 exb_bit_rsvd = 0;
> >
> > + context->bad_mt_xwr = 0;
> > +
> > if (!context->nx)
> > exb_bit_rsvd = rsvd_bits(63, 63);
> > switch (context->root_level) {
> > @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > }
> > }
> >
> > +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> > + struct kvm_mmu *context, bool execonly)
> > +{
> > + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > + int pte;
> > +
> > + context->rsvd_bits_mask[0][3] =
> > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> > + context->rsvd_bits_mask[0][2] =
> > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > + context->rsvd_bits_mask[0][1] =
> > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> > +
> > + /* large page */
> > + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> > + context->rsvd_bits_mask[1][2] =
> > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> > + context->rsvd_bits_mask[1][1] =
> > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> > + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> > +
> > + for (pte = 0; pte < 64; pte++) {
> > + int rwx_bits = pte & 7;
> > + int mt = pte >> 3;
> > + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> > + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> > + (rwx_bits == 0x4 && !execonly))
> > + context->bad_mt_xwr |= (1ull << pte);
> > + }
> > +}
> > +
> > static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > {
> > unsigned bit, byte, pfec;
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index 0d25351..ed6773e 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > *access &= mask;
> > }
> >
> > -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> > + int level)
>
> Not sure this explicit "inline" is needed... Gcc always inlines the small and
> static functions.
>
Paolo asked for it, but now I see that I did in in a wrong patch. I do
not care much personally either way, I agree with you though, compiler
will inline it anyway.
> > {
> > - int bit7;
> > + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
> >
> > - bit7 = (gpte >> 7) & 1;
> > - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> > + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
> > }
> >
> > static inline int FNAME(is_present_gpte)(unsigned long pte)
> > @@ -386,6 +387,25 @@ error:
> > walker->fault.vector = PF_VECTOR;
> > walker->fault.error_code_valid = true;
> > walker->fault.error_code = errcode;
> > +
> > +#if PTTYPE == PTTYPE_EPT
> > + /*
> > + * Use PFERR_RSVD_MASK in error_code to to tell if EPT
> > + * misconfiguration requires to be injected. The detection is
> > + * done by is_rsvd_bits_set() above.
> > + *
> > + * We set up the value of exit_qualification to inject:
> > + * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
> > + * [5:3] - Calculated by the page walk of the guest EPT page tables
> > + * [7:8] - Clear to 0.
>
> Do not know why always clear bit 7 and bit 8, especially bit 7 is always set for
> the normal case. The SDM says these about bit 7:
> The guest linear-address field is valid for all EPT violations except those
> resulting from an attempt to load the guest PDPTEs as part of the execution of
> the MOV CR instruction.
>
> So L0 always tells L1 that the fault is caused by "the guest PDPTEs as part of
> the execution of the MOV CR instruction".
You are right, it looks like we need to copy bits 7:8 from
exit_qualification.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 8:45 ` Gleb Natapov
@ 2013-08-01 11:19 ` Paolo Bonzini
2013-08-01 11:47 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-08-01 11:19 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Aug 01 2013, Gleb Natapov wrote:
> On Thu, Aug 01, 2013 at 04:31:31PM +0800, Xiao Guangrong wrote:
> > On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > >
> > > Inject nEPT fault to L1 guest. This patch is original from Xinhao.
> > >
> > > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 4 ++++
> > > arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> > > arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> > > arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> > > 4 files changed, 79 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 531f47c..58a17c0 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -286,6 +286,7 @@ struct kvm_mmu {
> > > u64 *pae_root;
> > > u64 *lm_root;
> > > u64 rsvd_bits_mask[2][4];
> > > + u64 bad_mt_xwr;
> > >
> > > /*
> > > * Bitmap: bit set = last pte in walk
> > > @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> > > * instruction.
> > > */
> > > bool write_fault_to_shadow_pgtable;
> > > +
> > > + /* set at EPT violation at this point */
> > > + unsigned long exit_qualification;
> > > };
> > >
> > > struct kvm_lpage_info {
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index 3df3ac3..58ae9db 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > u64 exb_bit_rsvd = 0;
> > >
> > > + context->bad_mt_xwr = 0;
> > > +
> > > if (!context->nx)
> > > exb_bit_rsvd = rsvd_bits(63, 63);
> > > switch (context->root_level) {
> > > @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > }
> > > }
> > >
> > > +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> > > + struct kvm_mmu *context, bool execonly)
> > > +{
> > > + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > + int pte;
> > > +
> > > + context->rsvd_bits_mask[0][3] =
> > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> > > + context->rsvd_bits_mask[0][2] =
> > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > + context->rsvd_bits_mask[0][1] =
> > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> > > +
> > > + /* large page */
> > > + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> > > + context->rsvd_bits_mask[1][2] =
> > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> > > + context->rsvd_bits_mask[1][1] =
> > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> > > + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> > > +
> > > + for (pte = 0; pte < 64; pte++) {
> > > + int rwx_bits = pte & 7;
> > > + int mt = pte >> 3;
> > > + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> > > + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> > > + (rwx_bits == 0x4 && !execonly))
> > > + context->bad_mt_xwr |= (1ull << pte);
> > > + }
> > > +}
> > > +
> > > static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > > {
> > > unsigned bit, byte, pfec;
> > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > index 0d25351..ed6773e 100644
> > > --- a/arch/x86/kvm/paging_tmpl.h
> > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > > *access &= mask;
> > > }
> > >
> > > -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > > +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> > > + int level)
> >
> > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > static functions.
>
> Paolo asked for it, but now I see that I did in in a wrong patch. I do
> not care much personally either way, I agree with you though, compiler
> will inline it anyway.
The point here was that if we use "||" below (or multiple "if"s as I
suggested in my review), we really want to inline the function to ensure
that the branches here is merged with the one in the caller's "if()".
With the "|" there is not much effect.
> > > {
> > > - int bit7;
> > > + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
Low 6, actually.
> > > - bit7 = (gpte >> 7) & 1;
> > > - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > > + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> > > + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
If you really want to optimize this thing to avoid branches, you can
also change it to
return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
(mmu->bad_mt_xwr & (1ull << low5))) != 0;
and consider adding the bits to bad_mt_xwr to detect non-presence
at the same time as reserved bits (as non-presence is also tested
near both callers of is_rsvd_bits_set).
On the other hand, it starts to look like useless complication not
backed by any benchmark (and probably unmeasurable anyway). Neither of
the conditions are particularly easy to compute, and they are probably
more expensive than a well-predicted branch. Thus, in this case I
would prefer to have clearer code and just use two "if"s, the second
of which can be guarded to be done only for EPT (it doesn't have to
be an #ifdef, no?).
If you do not want to do that, now that this is checked also on non-EPT
PTEs we should rename it to something else like bad_low_six_bits?
Regular page tables have no MT and XWR bits.
Paolo
> > > }
> > >
> > > static inline int FNAME(is_present_gpte)(unsigned long pte)
> > > @@ -386,6 +387,25 @@ error:
> > > walker->fault.vector = PF_VECTOR;
> > > walker->fault.error_code_valid = true;
> > > walker->fault.error_code = errcode;
> > > +
> > > +#if PTTYPE == PTTYPE_EPT
> > > + /*
> > > + * Use PFERR_RSVD_MASK in error_code to to tell if EPT
> > > + * misconfiguration requires to be injected. The detection is
> > > + * done by is_rsvd_bits_set() above.
> > > + *
> > > + * We set up the value of exit_qualification to inject:
> > > + * [2:0] - Derive from [2:0] of real exit_qualification at EPT violation
> > > + * [5:3] - Calculated by the page walk of the guest EPT page tables
> > > + * [7:8] - Clear to 0.
> >
> > Do not know why always clear bit 7 and bit 8, especially bit 7 is always set for
> > the normal case. The SDM says these about bit 7:
> > The guest linear-address field is valid for all EPT violations except those
> > resulting from an attempt to load the guest PDPTEs as part of the execution of
> > the MOV CR instruction.
> >
> > So L0 always tells L1 that the fault is caused by "the guest PDPTEs as part of
> > the execution of the MOV CR instruction".
> You are right, it looks like we need to copy bits 7:8 from
> exit_qualification.
>
> --
> Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 11:19 ` Paolo Bonzini
@ 2013-08-01 11:47 ` Gleb Natapov
2013-08-01 12:03 ` Paolo Bonzini
0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 11:47 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Thu, Aug 01, 2013 at 01:19:11PM +0200, Paolo Bonzini wrote:
> On Aug 01 2013, Gleb Natapov wrote:
> > On Thu, Aug 01, 2013 at 04:31:31PM +0800, Xiao Guangrong wrote:
> > > On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > >
> > > > Inject nEPT fault to L1 guest. This patch is original from Xinhao.
> > > >
> > > > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 4 ++++
> > > > arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> > > > arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> > > > arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> > > > 4 files changed, 79 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 531f47c..58a17c0 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -286,6 +286,7 @@ struct kvm_mmu {
> > > > u64 *pae_root;
> > > > u64 *lm_root;
> > > > u64 rsvd_bits_mask[2][4];
> > > > + u64 bad_mt_xwr;
> > > >
> > > > /*
> > > > * Bitmap: bit set = last pte in walk
> > > > @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> > > > * instruction.
> > > > */
> > > > bool write_fault_to_shadow_pgtable;
> > > > +
> > > > + /* set at EPT violation at this point */
> > > > + unsigned long exit_qualification;
> > > > };
> > > >
> > > > struct kvm_lpage_info {
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index 3df3ac3..58ae9db 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > u64 exb_bit_rsvd = 0;
> > > >
> > > > + context->bad_mt_xwr = 0;
> > > > +
> > > > if (!context->nx)
> > > > exb_bit_rsvd = rsvd_bits(63, 63);
> > > > switch (context->root_level) {
> > > > @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > }
> > > > }
> > > >
> > > > +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> > > > + struct kvm_mmu *context, bool execonly)
> > > > +{
> > > > + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > + int pte;
> > > > +
> > > > + context->rsvd_bits_mask[0][3] =
> > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> > > > + context->rsvd_bits_mask[0][2] =
> > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > + context->rsvd_bits_mask[0][1] =
> > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> > > > +
> > > > + /* large page */
> > > > + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> > > > + context->rsvd_bits_mask[1][2] =
> > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> > > > + context->rsvd_bits_mask[1][1] =
> > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> > > > + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> > > > +
> > > > + for (pte = 0; pte < 64; pte++) {
> > > > + int rwx_bits = pte & 7;
> > > > + int mt = pte >> 3;
> > > > + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> > > > + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> > > > + (rwx_bits == 0x4 && !execonly))
> > > > + context->bad_mt_xwr |= (1ull << pte);
> > > > + }
> > > > +}
> > > > +
> > > > static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > > > {
> > > > unsigned bit, byte, pfec;
> > > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > > index 0d25351..ed6773e 100644
> > > > --- a/arch/x86/kvm/paging_tmpl.h
> > > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > > @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > > > *access &= mask;
> > > > }
> > > >
> > > > -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > > > +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> > > > + int level)
> > >
> > > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > > static functions.
> >
> > Paolo asked for it, but now I see that I did in in a wrong patch. I do
> > not care much personally either way, I agree with you though, compiler
> > will inline it anyway.
>
> The point here was that if we use "||" below (or multiple "if"s as I
> suggested in my review), we really want to inline the function to ensure
> that the branches here is merged with the one in the caller's "if()".
>
> With the "|" there is not much effect.
>
Even with if() do you really think there is a chance the function will not be
inlined? I see that much much bigger functions are inlined.
> > > > {
> > > > - int bit7;
> > > > + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
>
> Low 6, actually.
Well, 5. This is bug, should be 0x1f. Good catch :)
>
> > > > - bit7 = (gpte >> 7) & 1;
> > > > - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > > > + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> > > > + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
>
> If you really want to optimize this thing to avoid branches, you can
> also change it to
>
> return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
> (mmu->bad_mt_xwr & (1ull << low5))) != 0;
>
Why not drop second != 0 then?
> and consider adding the bits to bad_mt_xwr to detect non-presence
> at the same time as reserved bits (as non-presence is also tested
> near both callers of is_rsvd_bits_set).
We need to distinguish between two of them at least at one call site.
>
> On the other hand, it starts to look like useless complication not
> backed by any benchmark (and probably unmeasurable anyway). Neither of
> the conditions are particularly easy to compute, and they are probably
> more expensive than a well-predicted branch. Thus, in this case I
> would prefer to have clearer code and just use two "if"s, the second
> of which can be guarded to be done only for EPT (it doesn't have to
> be an #ifdef, no?).
return a | b
is less clear than:
if (a)
return true;
if (b)
return true;
?
>
> If you do not want to do that, now that this is checked also on non-EPT
> PTEs we should rename it to something else like bad_low_six_bits?
> Regular page tables have no MT and XWR bits.
>
I think the current name better describes the purpose of the field. It
shows that for non ept it is irrelevant, but if we will use it to detect
nonpresent ptes for regular page tables too the rename make perfect
sense. Lest rename it then.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 11:47 ` Gleb Natapov
@ 2013-08-01 12:03 ` Paolo Bonzini
2013-08-01 12:14 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-08-01 12:03 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Aug 01 2013, Gleb Natapov wrote:
> On Thu, Aug 01, 2013 at 01:19:11PM +0200, Paolo Bonzini wrote:
> > On Aug 01 2013, Gleb Natapov wrote:
> > > On Thu, Aug 01, 2013 at 04:31:31PM +0800, Xiao Guangrong wrote:
> > > > On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > >
> > > > > Inject nEPT fault to L1 guest. This patch is original from Xinhao.
> > > > >
> > > > > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > > > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > > > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > > ---
> > > > > arch/x86/include/asm/kvm_host.h | 4 ++++
> > > > > arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> > > > > arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> > > > > arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> > > > > 4 files changed, 79 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index 531f47c..58a17c0 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -286,6 +286,7 @@ struct kvm_mmu {
> > > > > u64 *pae_root;
> > > > > u64 *lm_root;
> > > > > u64 rsvd_bits_mask[2][4];
> > > > > + u64 bad_mt_xwr;
> > > > >
> > > > > /*
> > > > > * Bitmap: bit set = last pte in walk
> > > > > @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> > > > > * instruction.
> > > > > */
> > > > > bool write_fault_to_shadow_pgtable;
> > > > > +
> > > > > + /* set at EPT violation at this point */
> > > > > + unsigned long exit_qualification;
> > > > > };
> > > > >
> > > > > struct kvm_lpage_info {
> > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > index 3df3ac3..58ae9db 100644
> > > > > --- a/arch/x86/kvm/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > > int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > > u64 exb_bit_rsvd = 0;
> > > > >
> > > > > + context->bad_mt_xwr = 0;
> > > > > +
> > > > > if (!context->nx)
> > > > > exb_bit_rsvd = rsvd_bits(63, 63);
> > > > > switch (context->root_level) {
> > > > > @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > > }
> > > > > }
> > > > >
> > > > > +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> > > > > + struct kvm_mmu *context, bool execonly)
> > > > > +{
> > > > > + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > > + int pte;
> > > > > +
> > > > > + context->rsvd_bits_mask[0][3] =
> > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> > > > > + context->rsvd_bits_mask[0][2] =
> > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > > + context->rsvd_bits_mask[0][1] =
> > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > > + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> > > > > +
> > > > > + /* large page */
> > > > > + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> > > > > + context->rsvd_bits_mask[1][2] =
> > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> > > > > + context->rsvd_bits_mask[1][1] =
> > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> > > > > + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> > > > > +
> > > > > + for (pte = 0; pte < 64; pte++) {
> > > > > + int rwx_bits = pte & 7;
> > > > > + int mt = pte >> 3;
> > > > > + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> > > > > + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> > > > > + (rwx_bits == 0x4 && !execonly))
> > > > > + context->bad_mt_xwr |= (1ull << pte);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > > > > {
> > > > > unsigned bit, byte, pfec;
> > > > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > > > index 0d25351..ed6773e 100644
> > > > > --- a/arch/x86/kvm/paging_tmpl.h
> > > > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > > > @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > > > > *access &= mask;
> > > > > }
> > > > >
> > > > > -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > > > > +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> > > > > + int level)
> > > >
> > > > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > > > static functions.
> > >
> > > Paolo asked for it, but now I see that I did in in a wrong patch. I do
> > > not care much personally either way, I agree with you though, compiler
> > > will inline it anyway.
> >
> > The point here was that if we use "||" below (or multiple "if"s as I
> > suggested in my review), we really want to inline the function to ensure
> > that the branches here is merged with the one in the caller's "if()".
> >
> > With the "|" there is not much effect.
>
> Even with if() do you really think there is a chance the function will not be
> inlined? I see that much much bigger functions are inlined.
I don't know, it depends on the compiler flags, how much the function
is used... Zeroing the chance is not bad.
> > > > > {
> > > > > - int bit7;
> > > > > + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
> >
> > Low 6, actually.
>
> Well, 5. This is bug, should be 0x1f. Good catch :)
MT is three bits, actually.
> > > > > - bit7 = (gpte >> 7) & 1;
> > > > > - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > > > > + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> > > > > + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
> >
> > If you really want to optimize this thing to avoid branches, you can
> > also change it to
> >
> > return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
> > (mmu->bad_mt_xwr & (1ull << low5))) != 0;
>
> Why not drop second != 0 then?
Because the function is "bool". I dislike the magic "!= 0"
that the compiler adds on conversion to bool. It always seemed
like a recipe for trouble since "int" and "bool" are otherwise
interchangeable... Without that "!= 0", s/bool/int/ would ignore
the upper 32 bits and break.
> > and consider adding the bits to bad_mt_xwr to detect non-presence
> > at the same time as reserved bits (as non-presence is also tested
> > near both callers of is_rsvd_bits_set).
>
> We need to distinguish between two of them at least at one call site.
You can always check for present afterwards, like
if (is_rsvd_bit_set_or_nonpresent(pte)) {
if (!is_present_pte(pte))
...
}
> > On the other hand, it starts to look like useless complication not
> > backed by any benchmark (and probably unmeasurable anyway). Neither of
> > the conditions are particularly easy to compute, and they are probably
> > more expensive than a well-predicted branch. Thus, in this case I
> > would prefer to have clearer code and just use two "if"s, the second
> > of which can be guarded to be done only for EPT (it doesn't have to
> > be an #ifdef, no?).
>
> return a | b
>
> is less clear than:
>
> if (a)
> return true;
> if (b)
> return true;
If each of a and b is actually a complicated expression having
AND/SHIFT/arrays, then yes. :)
> > If you do not want to do that, now that this is checked also on non-EPT
> > PTEs we should rename it to something else like bad_low_six_bits?
> > Regular page tables have no MT and XWR bits.
>
> I think the current name better describes the purpose of the field. It
> shows that for non ept it is irrelevant, but if we will use it to detect
> nonpresent ptes for regular page tables too the rename make perfect
> sense. Lest rename it then.
Agreed on both counts (leaving it as is for the current name, renaming
it if used for nonpresent PTEs).
Paolo
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 12:03 ` Paolo Bonzini
@ 2013-08-01 12:14 ` Gleb Natapov
2013-08-01 13:13 ` Paolo Bonzini
0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 12:14 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Thu, Aug 01, 2013 at 02:03:01PM +0200, Paolo Bonzini wrote:
> On Aug 01 2013, Gleb Natapov wrote:
> > On Thu, Aug 01, 2013 at 01:19:11PM +0200, Paolo Bonzini wrote:
> > > On Aug 01 2013, Gleb Natapov wrote:
> > > > On Thu, Aug 01, 2013 at 04:31:31PM +0800, Xiao Guangrong wrote:
> > > > > On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > > > > > From: Yang Zhang <yang.z.zhang@Intel.com>
> > > > > >
> > > > > > Inject nEPT fault to L1 guest. This patch is original from Xinhao.
> > > > > >
> > > > > > Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
> > > > > > Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
> > > > > > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > > > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > > > ---
> > > > > > arch/x86/include/asm/kvm_host.h | 4 ++++
> > > > > > arch/x86/kvm/mmu.c | 34 ++++++++++++++++++++++++++++++++++
> > > > > > arch/x86/kvm/paging_tmpl.h | 28 ++++++++++++++++++++++++----
> > > > > > arch/x86/kvm/vmx.c | 17 +++++++++++++++++
> > > > > > 4 files changed, 79 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > > index 531f47c..58a17c0 100644
> > > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > > @@ -286,6 +286,7 @@ struct kvm_mmu {
> > > > > > u64 *pae_root;
> > > > > > u64 *lm_root;
> > > > > > u64 rsvd_bits_mask[2][4];
> > > > > > + u64 bad_mt_xwr;
> > > > > >
> > > > > > /*
> > > > > > * Bitmap: bit set = last pte in walk
> > > > > > @@ -512,6 +513,9 @@ struct kvm_vcpu_arch {
> > > > > > * instruction.
> > > > > > */
> > > > > > bool write_fault_to_shadow_pgtable;
> > > > > > +
> > > > > > + /* set at EPT violation at this point */
> > > > > > + unsigned long exit_qualification;
> > > > > > };
> > > > > >
> > > > > > struct kvm_lpage_info {
> > > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > > index 3df3ac3..58ae9db 100644
> > > > > > --- a/arch/x86/kvm/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > > @@ -3521,6 +3521,8 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > > > int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > > > u64 exb_bit_rsvd = 0;
> > > > > >
> > > > > > + context->bad_mt_xwr = 0;
> > > > > > +
> > > > > > if (!context->nx)
> > > > > > exb_bit_rsvd = rsvd_bits(63, 63);
> > > > > > switch (context->root_level) {
> > > > > > @@ -3576,6 +3578,38 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > +static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> > > > > > + struct kvm_mmu *context, bool execonly)
> > > > > > +{
> > > > > > + int maxphyaddr = cpuid_maxphyaddr(vcpu);
> > > > > > + int pte;
> > > > > > +
> > > > > > + context->rsvd_bits_mask[0][3] =
> > > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> > > > > > + context->rsvd_bits_mask[0][2] =
> > > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > > > + context->rsvd_bits_mask[0][1] =
> > > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> > > > > > + context->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> > > > > > +
> > > > > > + /* large page */
> > > > > > + context->rsvd_bits_mask[1][3] = context->rsvd_bits_mask[0][3];
> > > > > > + context->rsvd_bits_mask[1][2] =
> > > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> > > > > > + context->rsvd_bits_mask[1][1] =
> > > > > > + rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> > > > > > + context->rsvd_bits_mask[1][0] = context->rsvd_bits_mask[0][0];
> > > > > > +
> > > > > > + for (pte = 0; pte < 64; pte++) {
> > > > > > + int rwx_bits = pte & 7;
> > > > > > + int mt = pte >> 3;
> > > > > > + if (mt == 0x2 || mt == 0x3 || mt == 0x7 ||
> > > > > > + rwx_bits == 0x2 || rwx_bits == 0x6 ||
> > > > > > + (rwx_bits == 0x4 && !execonly))
> > > > > > + context->bad_mt_xwr |= (1ull << pte);
> > > > > > + }
> > > > > > +}
> > > > > > +
> > > > > > static void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > > > > > {
> > > > > > unsigned bit, byte, pfec;
> > > > > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > > > > > index 0d25351..ed6773e 100644
> > > > > > --- a/arch/x86/kvm/paging_tmpl.h
> > > > > > +++ b/arch/x86/kvm/paging_tmpl.h
> > > > > > @@ -127,12 +127,13 @@ static inline void FNAME(protect_clean_gpte)(unsigned *access, unsigned gpte)
> > > > > > *access &= mask;
> > > > > > }
> > > > > >
> > > > > > -static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
> > > > > > +static bool inline FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte,
> > > > > > + int level)
> > > > >
> > > > > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > > > > static functions.
> > > >
> > > > Paolo asked for it, but now I see that I did in in a wrong patch. I do
> > > > not care much personally either way, I agree with you though, compiler
> > > > will inline it anyway.
> > >
> > > The point here was that if we use "||" below (or multiple "if"s as I
> > > suggested in my review), we really want to inline the function to ensure
> > > that the branches here is merged with the one in the caller's "if()".
> > >
> > > With the "|" there is not much effect.
> >
> > Even with if() do you really think there is a chance the function will not be
> > inlined? I see that much much bigger functions are inlined.
>
> I don't know, it depends on the compiler flags, how much the function
> is used... Zeroing the chance is not bad.
>
AFAIK inline is just a hint anyway and compiler is free to ignore it.
That is why we have __always_inline, but compiler should know better
here, do not see the reason for __always_inline.
> > > > > > {
> > > > > > - int bit7;
> > > > > > + int bit7 = (gpte >> 7) & 1, low5 = gpte & 0x3f;
> > >
> > > Low 6, actually.
> >
> > Well, 5. This is bug, should be 0x1f. Good catch :)
>
> MT is three bits, actually.
>
Yep, scrap that. Will rename to low6.
> > > > > > - bit7 = (gpte >> 7) & 1;
> > > > > > - return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
> > > > > > + return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0) |
> > > > > > + ((mmu->bad_mt_xwr & (1ull << low5)) != 0);
> > >
> > > If you really want to optimize this thing to avoid branches, you can
> > > also change it to
> > >
> > > return ((gpte & mmu->rsvd_bits_mask[bit7][level-1]) |
> > > (mmu->bad_mt_xwr & (1ull << low5))) != 0;
> >
> > Why not drop second != 0 then?
>
> Because the function is "bool". I dislike the magic "!= 0"
> that the compiler adds on conversion to bool. It always seemed
> like a recipe for trouble since "int" and "bool" are otherwise
> interchangeable... Without that "!= 0", s/bool/int/ would ignore
> the upper 32 bits and break.
I actually checked that before proposing.
printf("%d\n", (bool)0x1000000000) prints one, but of course if bool is
typedefed to int it will not work, but it should be not.
>
> > > and consider adding the bits to bad_mt_xwr to detect non-presence
> > > at the same time as reserved bits (as non-presence is also tested
> > > near both callers of is_rsvd_bits_set).
> >
> > We need to distinguish between two of them at least at one call site.
>
> You can always check for present afterwards, like
>
> if (is_rsvd_bit_set_or_nonpresent(pte)) {
> if (!is_present_pte(pte))
> ...
> }
I can, but this is unrelated we can consider it afterwords.
>
> > > On the other hand, it starts to look like useless complication not
> > > backed by any benchmark (and probably unmeasurable anyway). Neither of
> > > the conditions are particularly easy to compute, and they are probably
> > > more expensive than a well-predicted branch. Thus, in this case I
> > > would prefer to have clearer code and just use two "if"s, the second
> > > of which can be guarded to be done only for EPT (it doesn't have to
> > > be an #ifdef, no?).
> >
> > return a | b
> >
> > is less clear than:
> >
> > if (a)
> > return true;
> > if (b)
> > return true;
>
> If each of a and b is actually a complicated expression having
> AND/SHIFT/arrays, then yes. :)
>
This is not the most complex expression possible :)
Just having them on different lines is not enough?
> > > If you do not want to do that, now that this is checked also on non-EPT
> > > PTEs we should rename it to something else like bad_low_six_bits?
> > > Regular page tables have no MT and XWR bits.
> >
> > I think the current name better describes the purpose of the field. It
> > shows that for non ept it is irrelevant, but if we will use it to detect
> > nonpresent ptes for regular page tables too the rename make perfect
> > sense. Lest rename it then.
>
> Agreed on both counts (leaving it as is for the current name, renaming
> it if used for nonpresent PTEs).
>
> Paolo
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 12:14 ` Gleb Natapov
@ 2013-08-01 13:13 ` Paolo Bonzini
2013-08-01 13:20 ` Gleb Natapov
0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-08-01 13:13 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
> > > > > > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > > > > > static functions.
> > > > >
> > > > > Paolo asked for it, but now I see that I did in in a wrong patch. I do
> > > > > not care much personally either way, I agree with you though, compiler
> > > > > will inline it anyway.
> > > >
> > > > The point here was that if we use "||" below (or multiple "if"s as I
> > > > suggested in my review), we really want to inline the function to ensure
> > > > that the branches here is merged with the one in the caller's "if()".
> > > >
> > > > With the "|" there is not much effect.
> > >
> > > Even with if() do you really think there is a chance the function will not be
> > > inlined? I see that much much bigger functions are inlined.
> >
> > I don't know, it depends on the compiler flags, how much the function
> > is used... Zeroing the chance is not bad.
>
> AFAIK inline is just a hint anyway and compiler is free to ignore it.
> That is why we have __always_inline, but compiler should know better
> here, do not see the reason for __always_inline.
Yes, but with "inline" the compiler will use more generous thresholds.
The GCC docs say that without "inline", -O2 only inlines "functions into
their callers when their body is smaller than expected function call
code (so overall size of program gets smaller)". I'm not at all sure
this is the case for the new is_rsvd_bits_set, and anyway "making the
program smaller" is not the reason why we want the compiler to inline it.
Did you check that the compiler inlines it? Perhaps you should really
use __always_inline since that's what we really want.
> > Because the function is "bool". I dislike the magic "!= 0"
> > that the compiler adds on conversion to bool. It always seemed
> > like a recipe for trouble since "int" and "bool" are otherwise
> > interchangeable... Without that "!= 0", s/bool/int/ would ignore
> > the upper 32 bits and break.
>
> I actually checked that before proposing.
>
> printf("%d\n", (bool)0x1000000000) prints one, but of course if bool is
> typedefed to int it will not work, but it should be not.
No, it should not be indeed, but not everyone uses bool in the same way;
it is quite common to use "int" for something that is 0/1, and the magic
"!= 0" is dangerous if you cut-and-paste the expression where the compiler
will not do it... It can even be a function argument where you do not
see directly if it is bool, int, u64 or what.
I don't think omitting "!= 0" is common at all in the kernel, so I would
not start doing it here. :)
> > > > On the other hand, it starts to look like useless complication not
> > > > backed by any benchmark (and probably unmeasurable anyway). Neither of
> > > > the conditions are particularly easy to compute, and they are probably
> > > > more expensive than a well-predicted branch. Thus, in this case I
> > > > would prefer to have clearer code and just use two "if"s, the second
> > > > of which can be guarded to be done only for EPT (it doesn't have to
> > > > be an #ifdef, no?).
> > >
> > > return a | b
> > >
> > > is less clear than:
> > >
> > > if (a)
> > > return true;
> > > if (b)
> > > return true;
> >
> > If each of a and b is actually a complicated expression having
> > AND/SHIFT/arrays, then yes. :)
>
> This is not the most complex expression possible :)
> Just having them on different lines is not enough?
No big deal, especially if we plan to merge the present check together.
Paolo
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support
2013-08-01 13:13 ` Paolo Bonzini
@ 2013-08-01 13:20 ` Gleb Natapov
0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 13:20 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Xiao Guangrong, kvm, Jun Nakajima, Yang Zhang
On Thu, Aug 01, 2013 at 03:13:02PM +0200, Paolo Bonzini wrote:
> > > > > > > Not sure this explicit "inline" is needed... Gcc always inlines the small and
> > > > > > > static functions.
> > > > > >
> > > > > > Paolo asked for it, but now I see that I did in in a wrong patch. I do
> > > > > > not care much personally either way, I agree with you though, compiler
> > > > > > will inline it anyway.
> > > > >
> > > > > The point here was that if we use "||" below (or multiple "if"s as I
> > > > > suggested in my review), we really want to inline the function to ensure
> > > > > that the branches here is merged with the one in the caller's "if()".
> > > > >
> > > > > With the "|" there is not much effect.
> > > >
> > > > Even with if() do you really think there is a chance the function will not be
> > > > inlined? I see that much much bigger functions are inlined.
> > >
> > > I don't know, it depends on the compiler flags, how much the function
> > > is used... Zeroing the chance is not bad.
> >
> > AFAIK inline is just a hint anyway and compiler is free to ignore it.
> > That is why we have __always_inline, but compiler should know better
> > here, do not see the reason for __always_inline.
>
> Yes, but with "inline" the compiler will use more generous thresholds.
>
> The GCC docs say that without "inline", -O2 only inlines "functions into
> their callers when their body is smaller than expected function call
> code (so overall size of program gets smaller)". I'm not at all sure
> this is the case for the new is_rsvd_bits_set, and anyway "making the
> program smaller" is not the reason why we want the compiler to inline it.
>
> Did you check that the compiler inlines it? Perhaps you should really
> use __always_inline since that's what we really want.
>
Since I am going to use | and not || you were saying inline is not
needed, no? But yes, I do see that compiler inlines it. I see that it
inlines even update_accessed_dirty_bits() which is much bigger.
> > > Because the function is "bool". I dislike the magic "!= 0"
> > > that the compiler adds on conversion to bool. It always seemed
> > > like a recipe for trouble since "int" and "bool" are otherwise
> > > interchangeable... Without that "!= 0", s/bool/int/ would ignore
> > > the upper 32 bits and break.
> >
> > I actually checked that before proposing.
> >
> > printf("%d\n", (bool)0x1000000000) prints one, but of course if bool is
> > typedefed to int it will not work, but it should be not.
>
> No, it should not be indeed, but not everyone uses bool in the same way;
> it is quite common to use "int" for something that is 0/1, and the magic
> "!= 0" is dangerous if you cut-and-paste the expression where the compiler
> will not do it... It can even be a function argument where you do not
> see directly if it is bool, int, u64 or what.
>
> I don't think omitting "!= 0" is common at all in the kernel, so I would
> not start doing it here. :)
>
OK, will leave != 0. But I checked and Linux devices bool to be _Bool,
so we should be safe here.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 11/14] nEPT: MMU context for nested EPT
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (9 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 10/14] nEPT: Add nEPT violation/misconfigration support Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-08-01 9:16 ` Xiao Guangrong
2013-07-31 14:48 ` [PATCH v5 12/14] nEPT: Advertise EPT to L1 Gleb Natapov
` (2 subsequent siblings)
13 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/mmu.c | 26 ++++++++++++++++++++++++++
arch/x86/kvm/mmu.h | 2 ++
arch/x86/kvm/vmx.c | 41 ++++++++++++++++++++++++++++++++++++++++-
3 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 58ae9db..37fff14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3792,6 +3792,32 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+ bool execonly)
+{
+ ASSERT(vcpu);
+ ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));
+
+ context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+
+ context->nx = true;
+ context->new_cr3 = paging_new_cr3;
+ context->page_fault = ept_page_fault;
+ context->gva_to_gpa = ept_gva_to_gpa;
+ context->sync_page = ept_sync_page;
+ context->invlpg = ept_invlpg;
+ context->update_pte = ept_update_pte;
+ context->free = paging_free;
+ context->root_level = context->shadow_root_level;
+ context->root_hpa = INVALID_PAGE;
+ context->direct_map = false;
+
+ reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
+
static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
{
int r = kvm_init_shadow_mmu(vcpu, vcpu->arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5b59c57..77e044a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -71,6 +71,8 @@ enum {
int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
+ bool execonly);
static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f3514d7..f41751a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1046,6 +1046,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
}
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+ return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
static inline bool is_exception(u32 intr_info)
{
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7432,6 +7437,33 @@ static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
vmcs12->guest_physical_address = fault->address;
}
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+ /* return the page table to be shadowed - in our case, EPT12 */
+ return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+ int r = kvm_init_shadow_ept_mmu(vcpu, &vcpu->arch.mmu,
+ nested_vmx_ept_caps & VMX_EPT_EXECUTE_ONLY_BIT);
+
+ vcpu->arch.mmu.set_cr3 = vmx_set_cr3;
+ vcpu->arch.mmu.get_cr3 = nested_ept_get_cr3;
+ vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+ vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu;
+
+ return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+ vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
/*
* prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
* L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -7652,6 +7684,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
vmx_flush_tlb(vcpu);
}
+ if (nested_cpu_has_ept(vmcs12)) {
+ kvm_mmu_unload(vcpu);
+ nested_ept_init_mmu_context(vcpu);
+ }
+
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
vcpu->arch.efer = vmcs12->guest_ia32_efer;
else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -8124,7 +8161,9 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
kvm_set_cr4(vcpu, vmcs12->host_cr4);
- /* shadow page tables on either EPT or shadow page tables */
+ if (nested_cpu_has_ept(vmcs12))
+ nested_ept_uninit_mmu_context(vcpu);
+
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v5 11/14] nEPT: MMU context for nested EPT
2013-07-31 14:48 ` [PATCH v5 11/14] nEPT: MMU context for nested EPT Gleb Natapov
@ 2013-08-01 9:16 ` Xiao Guangrong
2013-08-01 9:37 ` Gleb Natapov
2013-08-01 9:51 ` Xiao Guangrong
0 siblings, 2 replies; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 9:16 UTC (permalink / raw)
To: Gleb Natapov; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> From: Nadav Har'El <nyh@il.ibm.com>
>
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). This context should also use the EPT versions of
> the page table access functions (defined in the previous patch).
> Then, we need to switch back and forth between this nested context and the
> regular MMU context when switching between L1 and L2 (when L1 runs this L2
> with EPT).
This patch looks good to me.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
But i am confused that update_permission_bitmask() is not adjusted in this
series. That function depends on kvm_read_cr4_bits(X86_CR4_SMEP) and
is_write_protection(), these two functions should read the registers from
L2 guest, using the L2 status to check L1's page table seems strange.
The same issue is in nested npt. Anything i missed?
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v5 11/14] nEPT: MMU context for nested EPT
2013-08-01 9:16 ` Xiao Guangrong
@ 2013-08-01 9:37 ` Gleb Natapov
2013-08-01 9:51 ` Xiao Guangrong
1 sibling, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-08-01 9:37 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: kvm, Jun Nakajima, Yang Zhang, pbonzini
On Thu, Aug 01, 2013 at 05:16:07PM +0800, Xiao Guangrong wrote:
> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
> > From: Nadav Har'El <nyh@il.ibm.com>
> >
> > KVM's existing shadow MMU code already supports nested TDP. To use it, we
> > need to set up a new "MMU context" for nested EPT, and create a few callbacks
> > for it (nested_ept_*()). This context should also use the EPT versions of
> > the page table access functions (defined in the previous patch).
> > Then, we need to switch back and forth between this nested context and the
> > regular MMU context when switching between L1 and L2 (when L1 runs this L2
> > with EPT).
>
> This patch looks good to me.
>
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>
> But i am confused that update_permission_bitmask() is not adjusted in this
> series. That function depends on kvm_read_cr4_bits(X86_CR4_SMEP) and
> is_write_protection(), these two functions should read the registers from
> L2 guest, using the L2 status to check L1's page table seems strange.
> The same issue is in nested npt. Anything i missed?
Good catch again. Looks like we need update_permission_bitmask_ept()
that uses different logic to calculate permissions.
--
Gleb.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v5 11/14] nEPT: MMU context for nested EPT
2013-08-01 9:16 ` Xiao Guangrong
2013-08-01 9:37 ` Gleb Natapov
@ 2013-08-01 9:51 ` Xiao Guangrong
1 sibling, 0 replies; 42+ messages in thread
From: Xiao Guangrong @ 2013-08-01 9:51 UTC (permalink / raw)
To: Xiao Guangrong; +Cc: Gleb Natapov, kvm, Jun Nakajima, Yang Zhang, pbonzini
On 08/01/2013 05:16 PM, Xiao Guangrong wrote:
> On 07/31/2013 10:48 PM, Gleb Natapov wrote:
>> From: Nadav Har'El <nyh@il.ibm.com>
>>
>> KVM's existing shadow MMU code already supports nested TDP. To use it, we
>> need to set up a new "MMU context" for nested EPT, and create a few callbacks
>> for it (nested_ept_*()). This context should also use the EPT versions of
>> the page table access functions (defined in the previous patch).
>> Then, we need to switch back and forth between this nested context and the
>> regular MMU context when switching between L1 and L2 (when L1 runs this L2
>> with EPT).
>
> This patch looks good to me.
>
> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
>
> But i am confused that update_permission_bitmask() is not adjusted in this
> series. That function depends on kvm_read_cr4_bits(X86_CR4_SMEP) and
> is_write_protection(), these two functions should read the registers from
> L2 guest, using the L2 status to check L1's page table seems strange.
> The same issue is in nested npt. Anything i missed?
After check the code, i found vcpu->arch.mmu is not updated when switch to
nested mmu, that means, "using the L2 status to check L1's page table seems
strange" is wrong. That is fine on nested npt, but nested ept should adjust
the logic anyway.
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v5 12/14] nEPT: Advertise EPT to L1
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (10 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 11/14] nEPT: MMU context for nested EPT Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 13/14] nEPT: Some additional comments Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 14/14] nEPT: Miscelleneous cleanups Gleb Natapov
13 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
Advertise the support of EPT to the L1 guest, through the appropriate MSR.
This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/vmx.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f41751a..1336ec0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2249,6 +2249,22 @@ static __init void nested_vmx_setup_ctls_msrs(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_WBINVD_EXITING;
+ if (enable_ept) {
+ /* nested EPT: emulate EPT also to L1 */
+ nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+ nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+ nested_vmx_ept_caps |= VMX_EPT_INVEPT_BIT;
+ nested_vmx_ept_caps &= vmx_capability.ept;
+ /*
+ * Since invept is completely emulated we support both global
+ * and context invalidation independent of what host cpu
+ * supports
+ */
+ nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
+ VMX_EPT_EXTENT_CONTEXT_BIT;
+ } else
+ nested_vmx_ept_caps = 0;
+
/* miscellaneous data */
rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
nested_vmx_misc_low &= VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2357,8 +2373,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
- /* Currently, no nested ept or nested vpid */
- *pdata = 0;
+ /* Currently, no nested vpid support */
+ *pdata = nested_vmx_ept_caps;
break;
default:
return 0;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 13/14] nEPT: Some additional comments
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (11 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 12/14] nEPT: Advertise EPT to L1 Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
2013-07-31 14:48 ` [PATCH v5 14/14] nEPT: Miscelleneous cleanups Gleb Natapov
13 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/vmx.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1336ec0..83d49ff 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6662,7 +6662,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+ /*
+ * L0 always deals with the EPT violation. If nested EPT is
+ * used, and the nested mmu code discovers that the address is
+ * missing in the guest EPT table (EPT12), the EPT violation
+ * will be injected with nested_ept_inject_page_fault()
+ */
+ return 0;
case EXIT_REASON_EPT_MISCONFIG:
+ /*
+ * L2 never uses directly L1's EPT, but rather L0's own EPT
+ * table (shadow on EPT) or a merged EPT table that L0 built
+ * (EPT on EPT). So any problems with the structure of the
+ * table is L0's fault.
+ */
return 0;
case EXIT_REASON_PREEMPTION_TIMER:
return vmcs12->pin_based_vm_exec_control &
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v5 14/14] nEPT: Miscelleneous cleanups
2013-07-31 14:48 [PATCH v5 00/14] Nested EPT Gleb Natapov
` (12 preceding siblings ...)
2013-07-31 14:48 ` [PATCH v5 13/14] nEPT: Some additional comments Gleb Natapov
@ 2013-07-31 14:48 ` Gleb Natapov
13 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2013-07-31 14:48 UTC (permalink / raw)
To: kvm; +Cc: Xiao Guangrong, Jun Nakajima, Yang Zhang, pbonzini
From: Nadav Har'El <nyh@il.ibm.com>
Some trivial code cleanups not really related to nested EPT.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
arch/x86/kvm/vmx.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83d49ff..4545532 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -715,7 +715,6 @@ static void nested_release_page_clean(struct page *page)
static u64 construct_eptp(unsigned long root_hpa);
static void kvm_cpu_vmxon(u64 addr);
static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
static void vmx_set_segment(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -1040,8 +1039,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
(vmcs12->secondary_vm_exec_control & bit);
}
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
- struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
{
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
}
@@ -6763,7 +6761,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
!(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
- get_vmcs12(vcpu), vcpu)))) {
+ get_vmcs12(vcpu))))) {
if (vmx_interrupt_allowed(vcpu)) {
vmx->soft_vnmi_blocked = 0;
} else if (vmx->vnmi_blocked_time > 1000000000LL &&
--
1.7.10.4
^ permalink raw reply related [flat|nested] 42+ messages in thread