KVM: MMU: remove prefault from invlpg handler

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM: MMU: remove prefault from invlpg handler
@ 2009-12-05 14:34 Marcelo Tosatti
  2009-12-05 16:57 ` Avi Kivity
  0 siblings, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2009-12-05 14:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm


The invlpg prefault optimization breaks Windows 2008 R2 occasionally. 

The visible effect is that the invlpg handler instantiates a pte which
is, microseconds later, written with a different gfn by another vcpu.

The OS could have other mechanisms to prevent a present translation from 
being used, which the hypervisor is unaware of.

Fix by making invlpg emulation follow documented behaviour.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a601713..58a0f1e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -455,8 +455,6 @@ out_unlock:
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct kvm_shadow_walk_iterator iterator;
-	pt_element_t gpte;
-	gpa_t pte_gpa = -1;
 	int level;
 	u64 *sptep;
 	int need_flush = 0;
@@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 		if (level == PT_PAGE_TABLE_LEVEL  ||
 		    ((level == PT_DIRECTORY_LEVEL && is_large_pte(*sptep))) ||
 		    ((level == PT_PDPE_LEVEL && is_large_pte(*sptep)))) {
-			struct kvm_mmu_page *sp = page_header(__pa(sptep));
-
-			pte_gpa = (sp->gfn << PAGE_SHIFT);
-			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
 			if (is_shadow_present_pte(*sptep)) {
 				rmap_remove(vcpu->kvm, sptep);
@@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 	if (need_flush)
 		kvm_flush_remote_tlbs(vcpu->kvm);
 	spin_unlock(&vcpu->kvm->mmu_lock);
-
-	if (pte_gpa == -1)
-		return;
-	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &gpte,
-				  sizeof(pt_element_t)))
-		return;
-	if (is_present_gpte(gpte) && (gpte & PT_ACCESSED_MASK)) {
-		if (mmu_topup_memory_caches(vcpu))
-			return;
-		kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte,
-				  sizeof(pt_element_t), 0);
-	}
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-05 14:34 KVM: MMU: remove prefault from invlpg handler Marcelo Tosatti
@ 2009-12-05 16:57 ` Avi Kivity
  2009-12-05 17:04   ` Avi Kivity
  2009-12-05 19:42   ` Marcelo Tosatti
  0 siblings, 2 replies; 7+ messages in thread
From: Avi Kivity @ 2009-12-05 16:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On 12/05/2009 04:34 PM, Marcelo Tosatti wrote:
> The invlpg prefault optimization breaks Windows 2008 R2 occasionally.
>
> The visible effect is that the invlpg handler instantiates a pte which
> is, microseconds later, written with a different gfn by another vcpu.
>
> The OS could have other mechanisms to prevent a present translation from
> being used, which the hypervisor is unaware of.
>
> Fix by making invlpg emulation follow documented behaviour.
>
>    

Good catch.  How did you track it down?

I don't think the OS has "other mechanisms", though - the processor can 
speculate the tlb so that would be an OS bug.  It looks like a race:

> Signed-off-by: Marcelo Tosatti<mtosatti@redhat.com>
>
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index a601713..58a0f1e 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -455,8 +455,6 @@ out_unlock:
>   static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>   {
>   	struct kvm_shadow_walk_iterator iterator;
> -	pt_element_t gpte;
> -	gpa_t pte_gpa = -1;
>   	int level;
>   	u64 *sptep;
>   	int need_flush = 0;
> @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>   		if (level == PT_PAGE_TABLE_LEVEL  ||
>   		    ((level == PT_DIRECTORY_LEVEL&&  is_large_pte(*sptep))) ||
>   		    ((level == PT_PDPE_LEVEL&&  is_large_pte(*sptep)))) {
> -			struct kvm_mmu_page *sp = page_header(__pa(sptep));
> -
> -			pte_gpa = (sp->gfn<<  PAGE_SHIFT);
> -			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>
>   			if (is_shadow_present_pte(*sptep)) {
>   				rmap_remove(vcpu->kvm, sptep);
> @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>   	if (need_flush)
>   		kvm_flush_remote_tlbs(vcpu->kvm);
>   	spin_unlock(&vcpu->kvm->mmu_lock);
> -
> -	if (pte_gpa == -1)
> -		return;
> -	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte,
> -				  sizeof(pt_element_t)))
> -		return;
>    


Here, another vcpu updates the gpte and issues a new invlpg.


> -	if (is_present_gpte(gpte)&&  (gpte&  PT_ACCESSED_MASK)) {
> -		if (mmu_topup_memory_caches(vcpu))
> -			return;
> -		kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte,
> -				  sizeof(pt_element_t), 0);
> -	}
>    


And here we undo the correct invlpg with the outdated gpte.

Looks like we considered this, since kvm_read_guest_atomic() is only 
needed if inside the spinlock, but some other change moved the 
spin_unlock() upwards.  Will investigate history.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-05 16:57 ` Avi Kivity
@ 2009-12-05 17:04   ` Avi Kivity
  2009-12-05 19:42   ` Marcelo Tosatti
  1 sibling, 0 replies; 7+ messages in thread
From: Avi Kivity @ 2009-12-05 17:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On 12/05/2009 06:57 PM, Avi Kivity wrote:
>
> Looks like we considered this, since kvm_read_guest_atomic() is only 
> needed if inside the spinlock, but some other change moved the 
> spin_unlock() upwards.  Will investigate history.
>

No, the bug was there from day one (and survived a year):

+       spin_lock(&vcpu->kvm->mmu_lock);
         walk_shadow(&walker.walker, vcpu, gva);
+       spin_unlock(&vcpu->kvm->mmu_lock);
+       if (walker.pte_gpa == -1)
+               return;
+       if (kvm_read_guest_atomic(vcpu->kvm, walker.pte_gpa, &gpte,
+                                 sizeof(pt_element_t)))
+               return;
+       if (is_present_pte(gpte) && (gpte & PT_ACCESSED_MASK)) {
+               if (mmu_topup_memory_caches(vcpu))
+                       return;
+               kvm_mmu_pte_write(vcpu, walker.pte_gpa, (const u8 *)&gpte,
+                                 sizeof(pt_element_t), 0);
+       }


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-05 16:57 ` Avi Kivity
  2009-12-05 17:04   ` Avi Kivity
@ 2009-12-05 19:42   ` Marcelo Tosatti
  2009-12-05 20:15     ` Avi Kivity
  1 sibling, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2009-12-05 19:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Sat, Dec 05, 2009 at 06:57:44PM +0200, Avi Kivity wrote:
> On 12/05/2009 04:34 PM, Marcelo Tosatti wrote:
>> The invlpg prefault optimization breaks Windows 2008 R2 occasionally.
>>
>> The visible effect is that the invlpg handler instantiates a pte which
>> is, microseconds later, written with a different gfn by another vcpu.
>>
>> The OS could have other mechanisms to prevent a present translation from
>> being used, which the hypervisor is unaware of.
>>
>> Fix by making invlpg emulation follow documented behaviour.
>>
>>    
>
> Good catch.  How did you track it down?

Lots of tracing.

> I don't think the OS has "other mechanisms", though - the processor can  
> speculate the tlb so that would be an OS bug.

Can it? I figured it relied on the fact that no access (therefore no TLB
entry instantiation) meant there is no need to invlpg (since there is
nothing in the TLB to invalidate), before updating a particular pte.

The documentation states that invlpg invalidates any entries for the
linear address.

> It looks like a race:
>
>> Signed-off-by: Marcelo Tosatti<mtosatti@redhat.com>
>>
>>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index a601713..58a0f1e 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -455,8 +455,6 @@ out_unlock:
>>   static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   {
>>   	struct kvm_shadow_walk_iterator iterator;
>> -	pt_element_t gpte;
>> -	gpa_t pte_gpa = -1;
>>   	int level;
>>   	u64 *sptep;
>>   	int need_flush = 0;
>> @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   		if (level == PT_PAGE_TABLE_LEVEL  ||
>>   		    ((level == PT_DIRECTORY_LEVEL&&  is_large_pte(*sptep))) ||
>>   		    ((level == PT_PDPE_LEVEL&&  is_large_pte(*sptep)))) {
>> -			struct kvm_mmu_page *sp = page_header(__pa(sptep));
>> -
>> -			pte_gpa = (sp->gfn<<  PAGE_SHIFT);
>> -			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>>
>>   			if (is_shadow_present_pte(*sptep)) {
>>   				rmap_remove(vcpu->kvm, sptep);
>> @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   	if (need_flush)
>>   		kvm_flush_remote_tlbs(vcpu->kvm);
>>   	spin_unlock(&vcpu->kvm->mmu_lock);
>> -
>> -	if (pte_gpa == -1)
>> -		return;
>> -	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte,
>> -				  sizeof(pt_element_t)))
>> -		return;
>>    
>
>
> Here, another vcpu updates the gpte and issues a new invlpg.
>
>
>> -	if (is_present_gpte(gpte)&&  (gpte&  PT_ACCESSED_MASK)) {
>> -		if (mmu_topup_memory_caches(vcpu))
>> -			return;
>> -		kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte,
>> -				  sizeof(pt_element_t), 0);
>> -	}
>>    
>
>
> And here we undo the correct invlpg with the outdated gpte.
>
> Looks like we considered this, since kvm_read_guest_atomic() is only  
> needed if inside the spinlock, but some other change moved the  
> spin_unlock() upwards.  Will investigate history.

Isnt it the OS responsability to serialize pte updates + invlpg between
CPUs?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-05 19:42   ` Marcelo Tosatti
@ 2009-12-05 20:15     ` Avi Kivity
  2009-12-07 20:51       ` Marcelo Tosatti
  0 siblings, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2009-12-05 20:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On 12/05/2009 09:42 PM, Marcelo Tosatti wrote:
>
>> I don't think the OS has "other mechanisms", though - the processor can
>> speculate the tlb so that would be an OS bug.
>
> Can it? I figured it relied on the fact that no access (therefore no TLB
> entry instantiation) meant there is no need to invlpg (since there is
> nothing in the TLB to invalidate), before updating a particular pte.
>
> The documentation states that invlpg invalidates any entries for the
> linear address.
>

4.10.1.3 says, "The processor may cache translations required for 
prefetches and for accesses that are a result of speculative execution 
that would never actually occur in the executed code path.", so there is 
no way for the OS to ensure no access has occurred.  If you change a 
present pte, you must execute invlpg afterwards to ensure speculation 
hasn't instantiated the old pte.


>> It looks like a race:
>>
>>> Signed-off-by: Marcelo Tosatti<mtosatti@redhat.com>
>>>
>>>
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index a601713..58a0f1e 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -455,8 +455,6 @@ out_unlock:
>>>    static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>    {
>>>    	struct kvm_shadow_walk_iterator iterator;
>>> -	pt_element_t gpte;
>>> -	gpa_t pte_gpa = -1;
>>>    	int level;
>>>    	u64 *sptep;
>>>    	int need_flush = 0;
>>> @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>    		if (level == PT_PAGE_TABLE_LEVEL  ||
>>>    		    ((level == PT_DIRECTORY_LEVEL&&   is_large_pte(*sptep))) ||
>>>    		    ((level == PT_PDPE_LEVEL&&   is_large_pte(*sptep)))) {
>>> -			struct kvm_mmu_page *sp = page_header(__pa(sptep));
>>> -
>>> -			pte_gpa = (sp->gfn<<   PAGE_SHIFT);
>>> -			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>>>
>>>    			if (is_shadow_present_pte(*sptep)) {
>>>    				rmap_remove(vcpu->kvm, sptep);
>>> @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>    	if (need_flush)
>>>    		kvm_flush_remote_tlbs(vcpu->kvm);
>>>    	spin_unlock(&vcpu->kvm->mmu_lock);
>>> -
>>> -	if (pte_gpa == -1)
>>> -		return;
>>> -	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte,
>>> -				  sizeof(pt_element_t)))
>>> -		return;
>>>
>>
>>
>> Here, another vcpu updates the gpte and issues a new invlpg.
>>
>>
>>> -	if (is_present_gpte(gpte)&&   (gpte&   PT_ACCESSED_MASK)) {
>>> -		if (mmu_topup_memory_caches(vcpu))
>>> -			return;
>>> -		kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte,
>>> -				  sizeof(pt_element_t), 0);
>>> -	}
>>>
>>
>>
>> And here we undo the correct invlpg with the outdated gpte.
>>
>> Looks like we considered this, since kvm_read_guest_atomic() is only
>> needed if inside the spinlock, but some other change moved the
>> spin_unlock() upwards.  Will investigate history.
>
> Isnt it the OS responsability to serialize pte updates + invlpg between
> CPUs?

It is.  Do you still have a trace of the error?  Maybe we can understand 
what the guest thought it was doing.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-05 20:15     ` Avi Kivity
@ 2009-12-07 20:51       ` Marcelo Tosatti
  2009-12-08 10:14         ` Avi Kivity
  0 siblings, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2009-12-07 20:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Sat, Dec 05, 2009 at 10:15:44PM +0200, Avi Kivity wrote:
> On 12/05/2009 09:42 PM, Marcelo Tosatti wrote:
>>
>>> I don't think the OS has "other mechanisms", though - the processor can
>>> speculate the tlb so that would be an OS bug.
>>
>> Can it? I figured it relied on the fact that no access (therefore no TLB
>> entry instantiation) meant there is no need to invlpg (since there is
>> nothing in the TLB to invalidate), before updating a particular pte.
>>
>> The documentation states that invlpg invalidates any entries for the
>> linear address.
>>
>
> 4.10.1.3 says, "The processor may cache translations required for  
> prefetches and for accesses that are a result of speculative execution  
> that would never actually occur in the executed code path.", so there is  
> no way for the OS to ensure no access has occurred.  If you change a  
> present pte, you must execute invlpg afterwards to ensure speculation  
> hasn't instantiated the old pte.
>
>
>>> It looks like a race:
>>>
>>>> Signed-off-by: Marcelo Tosatti<mtosatti@redhat.com>
>>>>
>>>>
>>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>>> index a601713..58a0f1e 100644
>>>> --- a/arch/x86/kvm/paging_tmpl.h
>>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>>> @@ -455,8 +455,6 @@ out_unlock:
>>>>    static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>>    {
>>>>    	struct kvm_shadow_walk_iterator iterator;
>>>> -	pt_element_t gpte;
>>>> -	gpa_t pte_gpa = -1;
>>>>    	int level;
>>>>    	u64 *sptep;
>>>>    	int need_flush = 0;
>>>> @@ -470,10 +468,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>>    		if (level == PT_PAGE_TABLE_LEVEL  ||
>>>>    		    ((level == PT_DIRECTORY_LEVEL&&   is_large_pte(*sptep))) ||
>>>>    		    ((level == PT_PDPE_LEVEL&&   is_large_pte(*sptep)))) {
>>>> -			struct kvm_mmu_page *sp = page_header(__pa(sptep));
>>>> -
>>>> -			pte_gpa = (sp->gfn<<   PAGE_SHIFT);
>>>> -			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>>>>
>>>>    			if (is_shadow_present_pte(*sptep)) {
>>>>    				rmap_remove(vcpu->kvm, sptep);
>>>> @@ -492,18 +486,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>>    	if (need_flush)
>>>>    		kvm_flush_remote_tlbs(vcpu->kvm);
>>>>    	spin_unlock(&vcpu->kvm->mmu_lock);
>>>> -
>>>> -	if (pte_gpa == -1)
>>>> -		return;
>>>> -	if (kvm_read_guest_atomic(vcpu->kvm, pte_gpa,&gpte,
>>>> -				  sizeof(pt_element_t)))
>>>> -		return;
>>>>
>>>
>>>
>>> Here, another vcpu updates the gpte and issues a new invlpg.
>>>
>>>
>>>> -	if (is_present_gpte(gpte)&&   (gpte&   PT_ACCESSED_MASK)) {
>>>> -		if (mmu_topup_memory_caches(vcpu))
>>>> -			return;
>>>> -		kvm_mmu_pte_write(vcpu, pte_gpa, (const u8 *)&gpte,
>>>> -				  sizeof(pt_element_t), 0);
>>>> -	}
>>>>
>>>
>>>
>>> And here we undo the correct invlpg with the outdated gpte.
>>>
>>> Looks like we considered this, since kvm_read_guest_atomic() is only
>>> needed if inside the spinlock, but some other change moved the
>>> spin_unlock() upwards.  Will investigate history.
>>
>> Isnt it the OS responsability to serialize pte updates + invlpg between
>> CPUs?
>
> It is.  Do you still have a trace of the error?  Maybe we can understand  
> what the guest thought it was doing.


BAD_POOL_HEADER (19)
The pool is already corrupt at the time of the current request.
This may or may not be due to the caller.
The internal pool links must be walked to figure out a possible cause of
the problem, and then special pool applied to the suspect tags or the driver
verifier to a suspect driver.
Arguments:
Arg1: 00000021, the data following the pool block being freed is corrupt.  Typically this means the consumer (call stack ) has overrun the block.
Arg2: 95424000, The pool pointer being freed.
Arg3: 00001010, The number of bytes allocated for the pool block.
Arg4: 00000000, The corrupted value found following the pool block.


The BAD_POOL_HEADER BSOD happens at address 0xFFFFF8A000DDD000 (complaining it contains "000000", Arg4).
Walking the pagetables takes to 0x18996 as the pte page:

(qemu) xp 0x18996ee8  (vaddr 0xFFFFF8A000DDD000)
0000000018996ee8: 0x153c9963

(qemu) xp 0x18996ef0  (vaddr 0xFFFFF8A000DDE000)
0000000018996ef0: 0x1528a963


 qemu-system-x86-13667 [007] 425860.260987: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f11 invlpg=1
 qemu-system-x86-13670 [004] 425860.264977: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15253 invlpg=1
 qemu-system-x86-13670 [004] 425860.265039: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f15 invlpg=1
 qemu-system-x86-13670 [004] 425860.266591: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 146f3 invlpg=1
 qemu-system-x86-13670 [004] 425860.268128: kvm_mmu_pte_write: sp->gfn 18996 (offset=ee8) gfn 14688 invlpg=1
 qemu-system-x86-13670 [004] 425860.268592: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 159c7 invlpg=1
 qemu-system-x86-13669 [003] 425861.267453: kvm_mmu_zap_page: sp gfn 18996 1/4 q0 w-x pge nxe root 0 unsync

The page is not shadowed again after this.

gfn 0x159c7 above contains a valid "end of pool block" header:

00000000  00 00 00 00 00 00 00 00  10 10 00 00 00 00 00 00  |................|
00000010  00 01 01 03 46 72 61 67  1c 00 33 d2 48 8d 44 24  |....Frag..3.H.D$|
00000020  01 01 fe 00 46 72 65 65  48 8d 44 04 30 48 f7 f1  |....FreeH.D.0H..|

But not the one which is in the pagetable at the time of BSOD:

(qemu) xp /20x 0x1528a000
000000001528a000: 0x00000000 0x00000000 0x00000000 0x00000000
000000001528a010: 0x00000000 0x00000000 0x00000000 0x00000000
000000001528a020: 0x00000000 0x00000000 0x00000000 0x00000000
000000001528a030: 0x00000000 0x00000000 0x00000000 0x00000000
000000001528a040: 0x00000000 0x00000000 0x00000000 0x00000000

So my theory was, Windows wrote gfn 0x1528a to offset 0xef0, but skipped
the invlpg, so a write to address 0xFFFFF8A000DDE000 ended up in the wrong
gfn (the one prefaulted by the invlpg handler).



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KVM: MMU: remove prefault from invlpg handler
  2009-12-07 20:51       ` Marcelo Tosatti
@ 2009-12-08 10:14         ` Avi Kivity
  0 siblings, 0 replies; 7+ messages in thread
From: Avi Kivity @ 2009-12-08 10:14 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On 12/07/2009 10:51 PM, Marcelo Tosatti wrote:
>
> The BAD_POOL_HEADER BSOD happens at address 0xFFFFF8A000DDD000 (complaining it contains "000000", Arg4).
> Walking the pagetables takes to 0x18996 as the pte page:
>
> (qemu) xp 0x18996ee8  (vaddr 0xFFFFF8A000DDD000)
> 0000000018996ee8: 0x153c9963
>
> (qemu) xp 0x18996ef0  (vaddr 0xFFFFF8A000DDE000)
> 0000000018996ef0: 0x1528a963
>
>
>   qemu-system-x86-13667 [007] 425860.260987: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f11 invlpg=1
>   qemu-system-x86-13670 [004] 425860.264977: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15253 invlpg=1
>   qemu-system-x86-13670 [004] 425860.265039: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 15f15 invlpg=1
>   qemu-system-x86-13670 [004] 425860.266591: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 146f3 invlpg=1
>   qemu-system-x86-13670 [004] 425860.268128: kvm_mmu_pte_write: sp->gfn 18996 (offset=ee8) gfn 14688 invlpg=1
>   qemu-system-x86-13670 [004] 425860.268592: kvm_mmu_pte_write: sp->gfn 18996 (offset=ef0) gfn 159c7 invlpg=1
>   qemu-system-x86-13669 [003] 425861.267453: kvm_mmu_zap_page: sp gfn 18996 1/4 q0 w-x pge nxe root 0 unsync
>
> The page is not shadowed again after this.
>
> gfn 0x159c7 above contains a valid "end of pool block" header:
>
> 00000000  00 00 00 00 00 00 00 00  10 10 00 00 00 00 00 00  |................|
> 00000010  00 01 01 03 46 72 61 67  1c 00 33 d2 48 8d 44 24  |....Frag..3.H.D$|
> 00000020  01 01 fe 00 46 72 65 65  48 8d 44 04 30 48 f7 f1  |....FreeH.D.0H..|
>
> But not the one which is in the pagetable at the time of BSOD:
>
> (qemu) xp /20x 0x1528a000
> 000000001528a000: 0x00000000 0x00000000 0x00000000 0x00000000
> 000000001528a010: 0x00000000 0x00000000 0x00000000 0x00000000
> 000000001528a020: 0x00000000 0x00000000 0x00000000 0x00000000
> 000000001528a030: 0x00000000 0x00000000 0x00000000 0x00000000
> 000000001528a040: 0x00000000 0x00000000 0x00000000 0x00000000
>
> So my theory was, Windows wrote gfn 0x1528a to offset 0xef0, but skipped
> the invlpg, so a write to address 0xFFFFF8A000DDE000 ended up in the wrong
> gfn (the one prefaulted by the invlpg handler).
>    

An alternative is that something fishy happpened to the pte mapping the 
page table.  But that's far fetched.

So I'll commit the patch.  I'm sorry to lose that optimization, I'll try 
later to track down what happened by reinstating the prefault but with 
the page marked not present, so we can see if a later fault fails to 
match the prefetch.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-12-08 10:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-05 14:34 KVM: MMU: remove prefault from invlpg handler Marcelo Tosatti
2009-12-05 16:57 ` Avi Kivity
2009-12-05 17:04   ` Avi Kivity
2009-12-05 19:42   ` Marcelo Tosatti
2009-12-05 20:15     ` Avi Kivity
2009-12-07 20:51       ` Marcelo Tosatti
2009-12-08 10:14         ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox