kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
@ 2016-11-24 13:29 Ladi Prosek
  2016-11-24 16:28 ` Paolo Bonzini
  2016-11-24 18:25 ` Radim Krčmář
  0 siblings, 2 replies; 7+ messages in thread
From: Ladi Prosek @ 2016-11-24 13:29 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, rkrcmar, bsd

KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:

1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.

2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.

This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
---

tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.

 arch/x86/kvm/vmx.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81fbda0..d4ad6a9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9808,6 +9808,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 exec_control;
+	u32 nested_ept_enabled = 0;
 
 	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
 	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -9972,6 +9973,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 				vmcs12->guest_intr_status);
 		}
 
+		nested_ept_enabled = exec_control & SECONDARY_EXEC_ENABLE_EPT;
 		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 	}
 
@@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
 	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-	/* shadow page tables on either EPT or shadow page tables */
-	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
+	/*
+	 * Shadow page tables on either EPT or shadow page tables.
+	 * If PAE and EPT are both on, CR3 is not used by the CPU and must not
+	 * be dereferenced.
+	 */
+	if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
+		vcpu->arch.cr3 = vmcs12->guest_cr3;
+		__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+	} else
+		kvm_set_cr3(vcpu, vmcs12->guest_cr3);
+
 	kvm_mmu_reset_context(vcpu);
 
 	if (!enable_ept)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-24 13:29 [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT Ladi Prosek
@ 2016-11-24 16:28 ` Paolo Bonzini
  2016-11-24 18:25 ` Radim Krčmář
  1 sibling, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2016-11-24 16:28 UTC (permalink / raw)
  To: Ladi Prosek; +Cc: kvm, rkrcmar, bsd


> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
> related issues:
> 
> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
> overwritten in ept_load_pdptrs because the registers are believed to have
> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
> running with PAE enabled but PDPTRs have been set up by L1.
> 
> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no
> guarantees
> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
> then lost and L2 crashes right after it enables paging.
> 
> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
> and EPT are both on. CR3 is not to be interpreted in this case.
> 
> Signed-off-by: Ladi Prosek <lprosek@redhat.com>

Wow, that's weird!  And a very nice analysis.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks,

Paolo

> ---
> 
> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
> 
>  arch/x86/kvm/vmx.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 81fbda0..d4ad6a9 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -9808,6 +9808,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 exec_control;
> +	u32 nested_ept_enabled = 0;
>  
>  	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>  	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> @@ -9972,6 +9973,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)
>  				vmcs12->guest_intr_status);
>  		}
>  
> +		nested_ept_enabled = exec_control & SECONDARY_EXEC_ENABLE_EPT;
>  		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>  	}
>  
> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
> struct vmcs12 *vmcs12)
>  	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>  	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>  
> -	/* shadow page tables on either EPT or shadow page tables */
> -	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> +	/*
> +	 * Shadow page tables on either EPT or shadow page tables.
> +	 * If PAE and EPT are both on, CR3 is not used by the CPU and must not
> +	 * be dereferenced.
> +	 */
> +	if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
> +		vcpu->arch.cr3 = vmcs12->guest_cr3;
> +		__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> +	} else
> +		kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> +
>  	kvm_mmu_reset_context(vcpu);
>  
>  	if (!enable_ept)
> --
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-24 13:29 [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT Ladi Prosek
  2016-11-24 16:28 ` Paolo Bonzini
@ 2016-11-24 18:25 ` Radim Krčmář
  2016-11-25  8:44   ` Ladi Prosek
  1 sibling, 1 reply; 7+ messages in thread
From: Radim Krčmář @ 2016-11-24 18:25 UTC (permalink / raw)
  To: Ladi Prosek; +Cc: kvm, pbonzini, bsd

2016-11-24 14:29+0100, Ladi Prosek:
> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
> related issues:
> 
> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
> overwritten in ept_load_pdptrs because the registers are believed to have
> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
> running with PAE enabled but PDPTRs have been set up by L1.
> 
> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
> then lost and L2 crashes right after it enables paging.
> 
> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
> and EPT are both on. CR3 is not to be interpreted in this case.
> 
> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> ---
> 
> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.

Great work!

> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>  	vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>  
> -	/* shadow page tables on either EPT or shadow page tables */
> -	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> +	/*
> +	 * Shadow page tables on either EPT or shadow page tables.
> +	 * If PAE and EPT are both on, CR3 is not used by the CPU and must not
> +	 * be dereferenced.
> +	 */
> +	if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {

The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging.
I'm pretty sure that is_pae() and is_paging() is set in long mode ...
the stuff in kvm_set_cr3() is not needed even then?

> +		vcpu->arch.cr3 = vmcs12->guest_cr3;
> +		__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> +	} else
> +		kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> +
>  	kvm_mmu_reset_context(vcpu);
>  
>  	if (!enable_ept)
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-24 18:25 ` Radim Krčmář
@ 2016-11-25  8:44   ` Ladi Prosek
  2016-11-25 14:15     ` Radim Krčmář
  0 siblings, 1 reply; 7+ messages in thread
From: Ladi Prosek @ 2016-11-25  8:44 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: KVM list, Paolo Bonzini, Bandan Das

On Thu, Nov 24, 2016 at 7:25 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2016-11-24 14:29+0100, Ladi Prosek:
>> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
>> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
>> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
>> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
>> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
>> related issues:
>>
>> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
>> overwritten in ept_load_pdptrs because the registers are believed to have
>> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
>> running with PAE enabled but PDPTRs have been set up by L1.
>>
>> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
>> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
>> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
>> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
>> then lost and L2 crashes right after it enables paging.
>>
>> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
>> and EPT are both on. CR3 is not to be interpreted in this case.
>>
>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
>> ---
>>
>> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
>
> Great work!
>
>> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>       vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>>       vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>>
>> -     /* shadow page tables on either EPT or shadow page tables */
>> -     kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>> +     /*
>> +      * Shadow page tables on either EPT or shadow page tables.
>> +      * If PAE and EPT are both on, CR3 is not used by the CPU and must not
>> +      * be dereferenced.
>> +      */
>> +     if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
>
> The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging.
> I'm pretty sure that is_pae() and is_paging() is set in long mode ...
> the stuff in kvm_set_cr3() is not needed even then?

Good catch, yes the spec also has !is_long_mode and it should be here
for completeness. It won't make much of a difference though, I believe
that kvm_set_cr3 is to a large extent unnecessary here.

What kvm_set_cr3 does:

* conditional  kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH)
** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload
<- vcpu_enter_guest
** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <-
kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest

* in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS)
** nobody checks the return value
** Intel manual says "Reserved bits in CR0 and CR3 remain clear after
any load of those registers; attempts to set them have no impact."
Should we just clear the bits and carry on then? This is in conflict
with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit
in CR3." Hmm.

* if (is_pae(vcpu) && is_paging(vcpu), load_pdptrs is executed
** seems legit!

* kvm_mmu_new_cr3
** will run anyway, handled in kvm_mmu_reset_context called right
after kvm_set_cr3

I am tempted to make more invasive changes. What do you think?

>> +             vcpu->arch.cr3 = vmcs12->guest_cr3;
>> +             __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
>> +     } else
>> +             kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>> +
>>       kvm_mmu_reset_context(vcpu);
>>
>>       if (!enable_ept)
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-25  8:44   ` Ladi Prosek
@ 2016-11-25 14:15     ` Radim Krčmář
  2016-11-28 18:12       ` Ladi Prosek
  0 siblings, 1 reply; 7+ messages in thread
From: Radim Krčmář @ 2016-11-25 14:15 UTC (permalink / raw)
  To: Ladi Prosek; +Cc: KVM list, Paolo Bonzini, Bandan Das

2016-11-25 09:44+0100, Ladi Prosek:
> On Thu, Nov 24, 2016 at 7:25 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>> 2016-11-24 14:29+0100, Ladi Prosek:
>>> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
>>> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
>>> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
>>> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
>>> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
>>> related issues:
>>>
>>> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
>>> overwritten in ept_load_pdptrs because the registers are believed to have
>>> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
>>> running with PAE enabled but PDPTRs have been set up by L1.
>>>
>>> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
>>> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
>>> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
>>> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
>>> then lost and L2 crashes right after it enables paging.
>>>
>>> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
>>> and EPT are both on. CR3 is not to be interpreted in this case.
>>>
>>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
>>> ---
>>>
>>> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
>>
>> Great work!
>>
>>> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>>       vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>>>       vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>>>
>>> -     /* shadow page tables on either EPT or shadow page tables */
>>> -     kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>>> +     /*
>>> +      * Shadow page tables on either EPT or shadow page tables.
>>> +      * If PAE and EPT are both on, CR3 is not used by the CPU and must not
>>> +      * be dereferenced.
>>> +      */
>>> +     if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
>>
>> The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging.
>> I'm pretty sure that is_pae() and is_paging() is set in long mode ...
>> the stuff in kvm_set_cr3() is not needed even then?
> 
> Good catch, yes the spec also has !is_long_mode and it should be here
> for completeness. It won't make much of a difference though, I believe
> that kvm_set_cr3 is to a large extent unnecessary here.

I agree.  The patch is in kvm/queue to get some testing -- I'll replace
it if you write an update.

> What kvm_set_cr3 does:
> 
> * conditional  kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH)
> ** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload
> <- vcpu_enter_guest
> ** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <-
> kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest
> 
> * in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS)
> ** nobody checks the return value
> ** Intel manual says "Reserved bits in CR0 and CR3 remain clear after
> any load of those registers; attempts to set them have no impact."
> Should we just clear the bits and carry on then? This is in conflict
> with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit
> in CR3." Hmm.

The spec is quite clear on this.  26.3.1.1 Checks on Guest Control
Registers, Debug Registers, and MSRs:

  The following checks are performed on processors that support Intel 64
  architecture: The CR3 field must be such that bits 63:52 and bits in
  the range 51:32 beyond the processor’s physical-address width are 0.

To verify, I tried these two options on top of vmx_vcpu_run

  vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | 1UL << boot_cpu_data.x86_phys_bits);
  vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | CR3_PCID_INVD);

and both failed VM entry.  We should fail the nested VM entry as well
and use cpuid_maxphyaddr() to determine when.

(And I have a bad feeling that guest's physical address width is not
 being limited by hardware's ...)

> * if (is_pae(vcpu) && is_paging(vcpu), load_pdptrs is executed
> ** seems legit!
> 
> * kvm_mmu_new_cr3
> ** will run anyway, handled in kvm_mmu_reset_context called right
> after kvm_set_cr3
> 
> I am tempted to make more invasive changes. What do you think?

Go ahead!

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-25 14:15     ` Radim Krčmář
@ 2016-11-28 18:12       ` Ladi Prosek
  2016-11-29 16:15         ` Radim Krčmář
  0 siblings, 1 reply; 7+ messages in thread
From: Ladi Prosek @ 2016-11-28 18:12 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: KVM list, Paolo Bonzini, Bandan Das

On Fri, Nov 25, 2016 at 3:15 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2016-11-25 09:44+0100, Ladi Prosek:
>> On Thu, Nov 24, 2016 at 7:25 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>>> 2016-11-24 14:29+0100, Ladi Prosek:
>>>> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
>>>> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
>>>> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
>>>> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
>>>> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
>>>> related issues:
>>>>
>>>> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
>>>> overwritten in ept_load_pdptrs because the registers are believed to have
>>>> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
>>>> running with PAE enabled but PDPTRs have been set up by L1.
>>>>
>>>> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
>>>> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
>>>> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
>>>> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
>>>> then lost and L2 crashes right after it enables paging.
>>>>
>>>> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
>>>> and EPT are both on. CR3 is not to be interpreted in this case.
>>>>
>>>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
>>>> ---
>>>>
>>>> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
>>>
>>> Great work!
>>>
>>>> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>>>       vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>>>>       vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>>>>
>>>> -     /* shadow page tables on either EPT or shadow page tables */
>>>> -     kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>>>> +     /*
>>>> +      * Shadow page tables on either EPT or shadow page tables.
>>>> +      * If PAE and EPT are both on, CR3 is not used by the CPU and must not
>>>> +      * be dereferenced.
>>>> +      */
>>>> +     if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
>>>
>>> The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging.
>>> I'm pretty sure that is_pae() and is_paging() is set in long mode ...
>>> the stuff in kvm_set_cr3() is not needed even then?
>>
>> Good catch, yes the spec also has !is_long_mode and it should be here
>> for completeness. It won't make much of a difference though, I believe
>> that kvm_set_cr3 is to a large extent unnecessary here.
>
> I agree.  The patch is in kvm/queue to get some testing -- I'll replace
> it if you write an update.

Yes, I am planning to send a new version.

>> What kvm_set_cr3 does:
>>
>> * conditional  kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH)
>> ** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload
>> <- vcpu_enter_guest
>> ** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <-
>> kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest
>>
>> * in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS)
>> ** nobody checks the return value
>> ** Intel manual says "Reserved bits in CR0 and CR3 remain clear after
>> any load of those registers; attempts to set them have no impact."
>> Should we just clear the bits and carry on then? This is in conflict
>> with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit
>> in CR3." Hmm.
>
> The spec is quite clear on this.  26.3.1.1 Checks on Guest Control
> Registers, Debug Registers, and MSRs:
>
>   The following checks are performed on processors that support Intel 64
>   architecture: The CR3 field must be such that bits 63:52 and bits in
>   the range 51:32 beyond the processor’s physical-address width are 0.
>
> To verify, I tried these two options on top of vmx_vcpu_run
>
>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | 1UL << boot_cpu_data.x86_phys_bits);
>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | CR3_PCID_INVD);
>
> and both failed VM entry.  We should fail the nested VM entry as well
> and use cpuid_maxphyaddr() to determine when.

Thanks, I hadn't realized that the rules for VM entry are different
from regular CR3 loads. One more reason for not using kvm_set_cr3
here.

> (And I have a bad feeling that guest's physical address width is not
>  being limited by hardware's ...)

Can you elaborate? In which MMU modes would it be causing problems?

>> * if (is_pae(vcpu) && is_paging(vcpu), load_pdptrs is executed
>> ** seems legit!
>>
>> * kvm_mmu_new_cr3
>> ** will run anyway, handled in kvm_mmu_reset_context called right
>> after kvm_set_cr3
>>
>> I am tempted to make more invasive changes. What do you think?
>
> Go ahead!
>
> Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  2016-11-28 18:12       ` Ladi Prosek
@ 2016-11-29 16:15         ` Radim Krčmář
  0 siblings, 0 replies; 7+ messages in thread
From: Radim Krčmář @ 2016-11-29 16:15 UTC (permalink / raw)
  To: Ladi Prosek; +Cc: KVM list, Paolo Bonzini, Bandan Das

2016-11-28 19:12+0100, Ladi Prosek:
> On Fri, Nov 25, 2016 at 3:15 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> > 2016-11-25 09:44+0100, Ladi Prosek:
>>> What kvm_set_cr3 does:
>>>
>>> * conditional  kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH)
>>> ** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload
>>> <- vcpu_enter_guest
>>> ** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <-
>>> kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest
>>>
>>> * in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS)
>>> ** nobody checks the return value
>>> ** Intel manual says "Reserved bits in CR0 and CR3 remain clear after
>>> any load of those registers; attempts to set them have no impact."
>>> Should we just clear the bits and carry on then? This is in conflict
>>> with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit
>>> in CR3." Hmm.
>>
>> The spec is quite clear on this.  26.3.1.1 Checks on Guest Control
>> Registers, Debug Registers, and MSRs:
>>
>>   The following checks are performed on processors that support Intel 64
>>   architecture: The CR3 field must be such that bits 63:52 and bits in
>>   the range 51:32 beyond the processor’s physical-address width are 0.
>>
>> To verify, I tried these two options on top of vmx_vcpu_run
>>
>>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | 1UL << boot_cpu_data.x86_phys_bits);
>>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | CR3_PCID_INVD);
>>
>> and both failed VM entry.  We should fail the nested VM entry as well
>> and use cpuid_maxphyaddr() to determine when.
> 
> Thanks, I hadn't realized that the rules for VM entry are different
> from regular CR3 loads. One more reason for not using kvm_set_cr3
> here.
> 
>> (And I have a bad feeling that guest's physical address width is not
>>  being limited by hardware's ...)
> 
> Can you elaborate? In which MMU modes would it be causing problems?

It's a corner case on hardware that has less physical bits than the
guest is configure for.

If L1 then sets bits between its maximum and the hardware maximum, then
VM entry in L0 will fail and that will kill L1 (report hardware error to
userspace).  L1 did nothing wrong and the bug is in L0, so killing L1 if
we hit the corner case is the best way ...

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-11-29 16:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-24 13:29 [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT Ladi Prosek
2016-11-24 16:28 ` Paolo Bonzini
2016-11-24 18:25 ` Radim Krčmář
2016-11-25  8:44   ` Ladi Prosek
2016-11-25 14:15     ` Radim Krčmář
2016-11-28 18:12       ` Ladi Prosek
2016-11-29 16:15         ` Radim Krčmář

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).