kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
@ 2012-02-10  6:28 Takuya Yoshikawa
  2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-10  6:28 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, aarcange, xiaoguangrong

Other threads may process the same page in that small window and skip
TLB flush and then return before these functions do flush.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 virt/kvm/kvm_main.c |   19 ++++++++++---------
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 470e305..2b4bc77 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 	 */
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
+
 	kvm->mmu_notifier_seq++;
 	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
-	spin_unlock(&kvm->mmu_lock);
-	srcu_read_unlock(&kvm->srcu, idx);
-
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
 		kvm_flush_remote_tlbs(kvm);
 
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
 }
 
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
@@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	for (; start < end; start += PAGE_SIZE)
 		need_tlb_flush |= kvm_unmap_hva(kvm, start);
 	need_tlb_flush |= kvm->tlbs_dirty;
-	spin_unlock(&kvm->mmu_lock);
-	srcu_read_unlock(&kvm->srcu, idx);
-
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
 		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
-	young = kvm_age_hva(kvm, address);
-	spin_unlock(&kvm->mmu_lock);
-	srcu_read_unlock(&kvm->srcu, idx);
 
+	young = kvm_age_hva(kvm, address);
 	if (young)
 		kvm_flush_remote_tlbs(kvm);
 
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
 	return young;
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
@ 2012-02-10  6:29 ` Takuya Yoshikawa
  2012-02-10  6:55   ` Xiao Guangrong
  2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-10  6:29 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm, aarcange, xiaoguangrong

This function's TLB flush was moved sometimes in the past:

1.
  commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4
  KVM: Fix missing smp tlb flush in invlpg

inserted it in the critical section.

2.
  commit 505aef8f30a95f7e4abf2c07e54ded1521587ba0
  KVM: MMU: cleanup FNAME(invlpg)

moved it inside the loop:  right after mmu_page_zap_pte().

3.
  commit f57f2ef58f6703e6df70ed52a198920cb3e8edba
  KVM: MMU: fast prefetch spte on invlpg path

inserted update_pte() after it.

In addition, the caller, kvm_mmu_invlpg(), still does flush.

This patch simplifies the logic and removes this redundancy.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c         |    1 -
 arch/x86/kvm/paging_tmpl.h |    6 ++++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ae76cc3..ea27ee3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3769,7 +3769,6 @@ EXPORT_SYMBOL_GPL(kvm_mmu_page_fault);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	vcpu->arch.mmu.invlpg(vcpu, gva);
-	kvm_mmu_flush_tlb(vcpu);
 	++vcpu->stat.invlpg;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_invlpg);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1561028..69d06f5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -682,6 +682,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 	mmu_topup_memory_caches(vcpu);
 
 	spin_lock(&vcpu->kvm->mmu_lock);
+
 	for_each_shadow_entry(vcpu, gva, iterator) {
 		level = iterator.level;
 		sptep = iterator.sptep;
@@ -697,8 +698,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 			pte_gpa = FNAME(get_level1_sp_gpa)(sp);
 			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
-			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
-				kvm_flush_remote_tlbs(vcpu->kvm);
+			mmu_page_zap_pte(vcpu->kvm, sp, sptep);
 
 			if (!rmap_can_add(vcpu))
 				break;
@@ -713,6 +713,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
 			break;
 	}
+
+	kvm_flush_remote_tlbs(vcpu->kvm);
 	spin_unlock(&vcpu->kvm->mmu_lock);
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
@ 2012-02-10  6:55   ` Xiao Guangrong
  2012-02-10  7:21     ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-10  6:55 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, aarcange

On 02/10/2012 02:29 PM, Takuya Yoshikawa wrote:

> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 1561028..69d06f5 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -682,6 +682,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>  	mmu_topup_memory_caches(vcpu);
> 
>  	spin_lock(&vcpu->kvm->mmu_lock);
> +
>  	for_each_shadow_entry(vcpu, gva, iterator) {
>  		level = iterator.level;
>  		sptep = iterator.sptep;
> @@ -697,8 +698,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>  			pte_gpa = FNAME(get_level1_sp_gpa)(sp);
>  			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
> 
> -			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
> -				kvm_flush_remote_tlbs(vcpu->kvm);
> +			mmu_page_zap_pte(vcpu->kvm, sp, sptep);
> 
>  			if (!rmap_can_add(vcpu))
>  				break;
> @@ -713,6 +713,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>  		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
>  			break;
>  	}
> +
> +	kvm_flush_remote_tlbs(vcpu->kvm);
>  	spin_unlock(&vcpu->kvm->mmu_lock);


It is obvious wrong, i do not think all tlbs always need be flushed...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-10  6:55   ` Xiao Guangrong
@ 2012-02-10  7:21     ` Takuya Yoshikawa
  2012-02-10  7:42       ` Xiao Guangrong
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-10  7:21 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, aarcange

(2012/02/10 15:55), Xiao Guangrong wrote:
> On 02/10/2012 02:29 PM, Takuya Yoshikawa wrote:
>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 1561028..69d06f5 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -682,6 +682,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   	mmu_topup_memory_caches(vcpu);
>>
>>   	spin_lock(&vcpu->kvm->mmu_lock);
>> +
>>   	for_each_shadow_entry(vcpu, gva, iterator) {
>>   		level = iterator.level;
>>   		sptep = iterator.sptep;
>> @@ -697,8 +698,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   			pte_gpa = FNAME(get_level1_sp_gpa)(sp);
>>   			pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>>
>> -			if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
>> -				kvm_flush_remote_tlbs(vcpu->kvm);
>> +			mmu_page_zap_pte(vcpu->kvm, sp, sptep);
>>
>>   			if (!rmap_can_add(vcpu))
>>   				break;
>> @@ -713,6 +713,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>   		if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
>>   			break;
>>   	}
>> +
>> +	kvm_flush_remote_tlbs(vcpu->kvm);
>>   	spin_unlock(&vcpu->kvm->mmu_lock);
>
>
> It is obvious wrong, i do not think all tlbs always need be flushed...
>

What do you mean by "obvious wrong" ?

Even before this patch, we were always flushing TLBs from the caller.

I have a question:  your patches apparently changed the timing of TLB flush
but all I could see from the changelogs were:


     KVM: MMU: cleanup FNAME(invlpg)

     Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
     same code between FNAME(invlpg) and FNAME(sync_page)


     KVM: MMU: fast prefetch spte on invlpg path

     Fast prefetch spte for the unsync shadow page on invlpg path


Please explain what your patches semantically changed before checking the
details of my patch.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-10  7:21     ` Takuya Yoshikawa
@ 2012-02-10  7:42       ` Xiao Guangrong
  2012-02-14  4:36         ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-10  7:42 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, aarcange

On 02/10/2012 03:21 PM, Takuya Yoshikawa wrote:

> (2012/02/10 15:55), Xiao Guangrong wrote:
>> On 02/10/2012 02:29 PM, Takuya Yoshikawa wrote:
>>
>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>> index 1561028..69d06f5 100644
>>> --- a/arch/x86/kvm/paging_tmpl.h
>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>> @@ -682,6 +682,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>       mmu_topup_memory_caches(vcpu);
>>>
>>>       spin_lock(&vcpu->kvm->mmu_lock);
>>> +
>>>       for_each_shadow_entry(vcpu, gva, iterator) {
>>>           level = iterator.level;
>>>           sptep = iterator.sptep;
>>> @@ -697,8 +698,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>               pte_gpa = FNAME(get_level1_sp_gpa)(sp);
>>>               pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
>>>
>>> -            if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
>>> -                kvm_flush_remote_tlbs(vcpu->kvm);
>>> +            mmu_page_zap_pte(vcpu->kvm, sp, sptep);
>>>
>>>               if (!rmap_can_add(vcpu))
>>>                   break;
>>> @@ -713,6 +713,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
>>>           if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
>>>               break;
>>>       }
>>> +
>>> +    kvm_flush_remote_tlbs(vcpu->kvm);
>>>       spin_unlock(&vcpu->kvm->mmu_lock);
>>
>>
>> It is obvious wrong, i do not think all tlbs always need be flushed...
>>
> 
> What do you mean by "obvious wrong" ?
> 


In the current code, all tlbs are flushed only when s spte is zapped, but after
your change, they are always changed.

> Even before this patch, we were always flushing TLBs from the caller.
> 


Oh, could you please tell me where tlbs can be flushed except when a spte is
zapped in this path?

> I have a question:  your patches apparently changed the timing of TLB flush
> but all I could see from the changelogs were:
> 
> 
>     KVM: MMU: cleanup FNAME(invlpg)
> 
>     Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
>     same code between FNAME(invlpg) and FNAME(sync_page)
> 


This patch dose not change the logic, the tlb flushed time is also not changed,
it just directly call kvm_flush_remote_tlbs when a spte is zapped.

> 
>     KVM: MMU: fast prefetch spte on invlpg path
> 
>     Fast prefetch spte for the unsync shadow page on invlpg path
> 
> 


This patch did not change the code when kvm_flush_remote_tlbs is called.

Where cause your confused?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
  2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
@ 2012-02-10  7:52 ` Xiao Guangrong
  2012-02-13  6:00   ` Takuya Yoshikawa
  2012-02-14 17:27   ` Andrea Arcangeli
  2012-02-10 17:26 ` Marcelo Tosatti
  2012-02-14 17:10 ` Andrea Arcangeli
  3 siblings, 2 replies; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-10  7:52 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, aarcange

On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote:

> Other threads may process the same page in that small window and skip
> TLB flush and then return before these functions do flush.
> 


It is possible that flush tlb in mmu lock only when writeable
spte is invalided? Sometimes, kvm_flush_remote_tlbs need
long time to wait.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
  2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
  2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
@ 2012-02-10 17:26 ` Marcelo Tosatti
  2012-02-14 17:10 ` Andrea Arcangeli
  3 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2012-02-10 17:26 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, kvm, aarcange, xiaoguangrong

On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> Other threads may process the same page in that small window and skip
> TLB flush and then return before these functions do flush.
> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  virt/kvm/kvm_main.c |   19 ++++++++++---------
>  1 files changed, 10 insertions(+), 9 deletions(-)

Applied, thanks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
@ 2012-02-13  6:00   ` Takuya Yoshikawa
  2012-02-14 17:27   ` Andrea Arcangeli
  1 sibling, 0 replies; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-13  6:00 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, aarcange

(2012/02/10 16:52), Xiao Guangrong wrote:
> On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote:
>
>> Other threads may process the same page in that small window and skip
>> TLB flush and then return before these functions do flush.
>>
>
>
> It is possible that flush tlb in mmu lock only when writeable
> spte is invalided? Sometimes, kvm_flush_remote_tlbs need
> long time to wait.
>

I am sorry but I don't see the point of adding yet another complexity
for that.


OK, let me clarify my position.

When I made this patch, I did a simple test to see if mmu notifier would
work normally after my patch applied:  I allocated large memory behind
the guest and made the host call mmu_notifier functions.

In that situation the host as a whole became slow, e.g. I saw really bad
response for some seconds.

In such a case, do you mind the additional time VCPU threads may have to
wait until the mmu notifier finishes the flush, only when they are trying
to take the mmu_lock?


	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-10  7:42       ` Xiao Guangrong
@ 2012-02-14  4:36         ` Takuya Yoshikawa
  2012-02-14  4:56           ` Takuya Yoshikawa
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-14  4:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, aarcange

(Sorry for the delay, I was in bad form this weekend.)

I am sorry, I see what I was misreading.
My eyes misread "kvm_mmu_flush_tlb()" as "kvm_flush_remote_tlbs()".
That's why I could not understand what you said, really sorry.

(2012/02/10 16:42), Xiao Guangrong wrote:

>>> It is obvious wrong, i do not think all tlbs always need be flushed...
>>>
>>
>> What do you mean by "obvious wrong" ?
>>
>
>
> In the current code, all tlbs are flushed only when s spte is zapped, but after
> your change, they are always changed.

Above description will probably tell you what I was thinking.

>> Even before this patch, we were always flushing TLBs from the caller.
>>

> Oh, could you please tell me where tlbs can be flushed except when a spte is
> zapped in this path?

Ditto.

>
>> I have a question:  your patches apparently changed the timing of TLB flush
>> but all I could see from the changelogs were:
>>
>>
>>      KVM: MMU: cleanup FNAME(invlpg)
>>
>>      Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
>>      same code between FNAME(invlpg) and FNAME(sync_page)

> This patch dose not change the logic, the tlb flushed time is also not changed,
> it just directly call kvm_flush_remote_tlbs when a spte is zapped.

>>
>>      KVM: MMU: fast prefetch spte on invlpg path
>>
>>      Fast prefetch spte for the unsync shadow page on invlpg path
>>
> This patch did not change the code when kvm_flush_remote_tlbs is called.
>
> Where cause your confused?
>

Thank you for your explanation!

Probably I should reread the code after taking enough sleep.

BTW, do you think that "kvm_mmu_flush_tlb()" should be moved inside of the
mmu_lock critical section?


	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-14  4:36         ` Takuya Yoshikawa
@ 2012-02-14  4:56           ` Takuya Yoshikawa
  2012-02-14 17:21             ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Takuya Yoshikawa @ 2012-02-14  4:56 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: avi, mtosatti, kvm, aarcange

(2012/02/14 13:36), Takuya Yoshikawa wrote:

> BTW, do you think that "kvm_mmu_flush_tlb()" should be moved inside of the
> mmu_lock critical section?
>

Ah, forget about this.  Trivially no.
I really need to take a rest.

Sorry,
	Takuya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
                   ` (2 preceding siblings ...)
  2012-02-10 17:26 ` Marcelo Tosatti
@ 2012-02-14 17:10 ` Andrea Arcangeli
  2012-02-14 17:29   ` Marcelo Tosatti
  3 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2012-02-14 17:10 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: avi, mtosatti, kvm, xiaoguangrong

On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> Other threads may process the same page in that small window and skip
> TLB flush and then return before these functions do flush.

It's correct to flush the shadow MMU TLB without the mmu_lock only in
the context of mmu notifier methods. So the below while won't hurt,
it's performance regression and shouldn't be applied (and
it obfuscates the code by not being strict anymore).

To the contrary every other place that does a shadow/secondary MMU smp
tlb flush _must_ happen inside the mmu_lock, otherwise the
serialization isn't correct anymore against the very below mmu_lock in
the below quoted patch taken by
kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start.

The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4.

I'll try to explain it more clearly: the moment you drop mmu_lock,
pages can be freed. So if you invalidate a spte in any place inside
the KVM code (except the mmu notifier methods where a reference of the
page is implicitly hold by the caller and so the page can't go away
under a mmu notifier method by design), then the below
kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start
won't get their need_tlb_flush set anymore, and they won't run the tlb
flush before freeing the page.

So every other place (except mmu notifier) must flush the secondary
MMU smp tlb _before_ releasing the mmu_lock.

Only mmu notifier is safe to flush the secondary MMU TLB _after_
releasing the mmu_lock.

If we changed the mmu notifier methods to unconditionally flush the
shadow TLB (regardless if a spte was present or not), we might not
need to hold the mmu_lock in every tlb flush outside the context of
the mmu notifier methods. But then the mmu notifier methods would
become more expensive, I didn't evaluate fully what would be the side
effects of such a change. Also note, only the
kvm_mmu_notifier_invalidate_page and
kvm_mmu_notifier_invalidate_range_start would need to do that, because
they're the only two where the page reference gets dropped.

Even shorter: because the mmu notifier a implicit reference on the
page exists and is hold by the caller, they can flush outside the
mmu_lock. Every other place in KVM only holds an implicit valid
reference on the page only as long as you hold the mmu_lock, or while
a spte is still established.

Well it's not easy logic so it's not surprising it wasn't totally
clear.

It's probably not heavily documented, and the fact you changed it
still is still good so we refresh our minds on the exact rules of mmu
notifier locking, thanks!

Andrea

> 
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  virt/kvm/kvm_main.c |   19 ++++++++++---------
>  1 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 470e305..2b4bc77 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  	 */
>  	idx = srcu_read_lock(&kvm->srcu);
>  	spin_lock(&kvm->mmu_lock);
> +
>  	kvm->mmu_notifier_seq++;
>  	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
> -	spin_unlock(&kvm->mmu_lock);
> -	srcu_read_unlock(&kvm->srcu, idx);
> -
>  	/* we've to flush the tlb before the pages can be freed */
>  	if (need_tlb_flush)
>  		kvm_flush_remote_tlbs(kvm);
>  
> +	spin_unlock(&kvm->mmu_lock);
> +	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
>  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  	for (; start < end; start += PAGE_SIZE)
>  		need_tlb_flush |= kvm_unmap_hva(kvm, start);
>  	need_tlb_flush |= kvm->tlbs_dirty;
> -	spin_unlock(&kvm->mmu_lock);
> -	srcu_read_unlock(&kvm->srcu, idx);
> -
>  	/* we've to flush the tlb before the pages can be freed */
>  	if (need_tlb_flush)
>  		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
>  
>  	idx = srcu_read_lock(&kvm->srcu);
>  	spin_lock(&kvm->mmu_lock);
> -	young = kvm_age_hva(kvm, address);
> -	spin_unlock(&kvm->mmu_lock);
> -	srcu_read_unlock(&kvm->srcu, idx);
>  
> +	young = kvm_age_hva(kvm, address);
>  	if (young)
>  		kvm_flush_remote_tlbs(kvm);
>  
> +	spin_unlock(&kvm->mmu_lock);
> +	srcu_read_unlock(&kvm->srcu, idx);
> +
>  	return young;
>  }
>  
> -- 
> 1.7.5.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() before releasing mmu_lock
  2012-02-14  4:56           ` Takuya Yoshikawa
@ 2012-02-14 17:21             ` Andrea Arcangeli
  0 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2012-02-14 17:21 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Xiao Guangrong, avi, mtosatti, kvm

On Tue, Feb 14, 2012 at 01:56:17PM +0900, Takuya Yoshikawa wrote:
> (2012/02/14 13:36), Takuya Yoshikawa wrote:
> 
> > BTW, do you think that "kvm_mmu_flush_tlb()" should be moved inside of the
> > mmu_lock critical section?
> >
> 
> Ah, forget about this.  Trivially no.

Yes the reason is that it's the local flush and guest mode isn't
running if we're running that function so it's ok to run it later.

About the other change you did in this patch 2/2, I can't find the
code you're patching in the 3.2 upstream source, when I added the tlb
flush to invlpg, I immediately used a cumulative need_flush at the end
(before relasing mmu_lock of course).

   if (need_flush)
      kvm_flush_remote_tlbs(vcpu->kvm);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
  2012-02-13  6:00   ` Takuya Yoshikawa
@ 2012-02-14 17:27   ` Andrea Arcangeli
  1 sibling, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2012-02-14 17:27 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Takuya Yoshikawa, avi, mtosatti, kvm

On Fri, Feb 10, 2012 at 03:52:49PM +0800, Xiao Guangrong wrote:
> On 02/10/2012 02:28 PM, Takuya Yoshikawa wrote:
> 
> > Other threads may process the same page in that small window and skip
> > TLB flush and then return before these functions do flush.
> > 
> 
> 
> It is possible that flush tlb in mmu lock only when writeable
> spte is invalided? Sometimes, kvm_flush_remote_tlbs need
> long time to wait.

readonly isn't enough to defer the flush after mmu_lock is
released... if you do it only for writable spte, then what can happen
is the guest may read random data and would crash.

However for this case, the mmu_notifier methods (and only them) are
perfectly safe to flush the shadow MMU TLB after the mmu_lock is
released because the page reference is guaranteed hold by the caller
(not the case for any other place where a spte gets dropped in KVM,
all other places dropping sptes, can only on the mmu notifier to block
on the mmu_lock in order to have a guarantee of the page not being
freed under them, so in every other place the shadow MMU TLB flush
must happen before releasing the mmu_lock so the mmu_notifier will
wait and prevent the page to be freed until all other CPUs running in
guest mode stopped accessing it).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-14 17:10 ` Andrea Arcangeli
@ 2012-02-14 17:29   ` Marcelo Tosatti
  2012-02-14 18:53     ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2012-02-14 17:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Takuya Yoshikawa, avi, kvm, xiaoguangrong

On Tue, Feb 14, 2012 at 06:10:44PM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 10, 2012 at 03:28:31PM +0900, Takuya Yoshikawa wrote:
> > Other threads may process the same page in that small window and skip
> > TLB flush and then return before these functions do flush.
> 
> It's correct to flush the shadow MMU TLB without the mmu_lock only in
> the context of mmu notifier methods. So the below while won't hurt,
> it's performance regression and shouldn't be applied (and
> it obfuscates the code by not being strict anymore).
> 
> To the contrary every other place that does a shadow/secondary MMU smp
> tlb flush _must_ happen inside the mmu_lock, otherwise the
> serialization isn't correct anymore against the very below mmu_lock in
> the below quoted patch taken by
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start.
> 
> The explanation is in commit 4539b35881ae9664b0e2953438dd83f5ee02c0b4.
> 
> I'll try to explain it more clearly: the moment you drop mmu_lock,
> pages can be freed. So if you invalidate a spte in any place inside
> the KVM code (except the mmu notifier methods where a reference of the
> page is implicitly hold by the caller and so the page can't go away
> under a mmu notifier method by design), then the below
> kvm_mmu_notifier_invalidate_page/kvm_mmu_notifier_invalidate_range_start
> won't get their need_tlb_flush set anymore, and they won't run the tlb
> flush before freeing the page.
> 
> So every other place (except mmu notifier) must flush the secondary
> MMU smp tlb _before_ releasing the mmu_lock.
> 
> Only mmu notifier is safe to flush the secondary MMU TLB _after_
> releasing the mmu_lock.
> 
> If we changed the mmu notifier methods to unconditionally flush the
> shadow TLB (regardless if a spte was present or not), we might not
> need to hold the mmu_lock in every tlb flush outside the context of
> the mmu notifier methods. But then the mmu notifier methods would
> become more expensive, I didn't evaluate fully what would be the side
> effects of such a change. Also note, only the
> kvm_mmu_notifier_invalidate_page and
> kvm_mmu_notifier_invalidate_range_start would need to do that, because
> they're the only two where the page reference gets dropped.
> 
> Even shorter: because the mmu notifier a implicit reference on the
> page exists and is hold by the caller, they can flush outside the
> mmu_lock. Every other place in KVM only holds an implicit valid
> reference on the page only as long as you hold the mmu_lock, or while
> a spte is still established.
> 
> Well it's not easy logic so it's not surprising it wasn't totally
> clear.
> 
> It's probably not heavily documented, and the fact you changed it
> still is still good so we refresh our minds on the exact rules of mmu
> notifier locking, thanks!

The problem the patch is fixing is not related to page freeing, but
rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d
(replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):


During protecting pages for dirty logging, other threads may also try
to protect a page in mmu_sync_children() or kvm_mmu_get_page().

In such a case, because get_dirty_log releases mmu_lock before flushing
TLB's, the following race condition can happen:

  A (get_dirty_log)     B (another thread)

  lock(mmu_lock)
  clear pte.w
  unlock(mmu_lock)
                        lock(mmu_lock)
                        pte.w is already cleared
                        unlock(mmu_lock)
                        skip TLB flush
                        return
  ...
  TLB flush

Though thread B assumes the page has already been protected when it
returns, the remaining TLB entry will break that assumption.


> 
> Andrea
> 
> > 
> > Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> > ---
> >  virt/kvm/kvm_main.c |   19 ++++++++++---------
> >  1 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 470e305..2b4bc77 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -289,15 +289,15 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  	 */
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > +
> >  	kvm->mmu_notifier_seq++;
> >  	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> > @@ -335,12 +335,12 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  	for (; start < end; start += PAGE_SIZE)
> >  		need_tlb_flush |= kvm_unmap_hva(kvm, start);
> >  	need_tlb_flush |= kvm->tlbs_dirty;
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> > -
> >  	/* we've to flush the tlb before the pages can be freed */
> >  	if (need_tlb_flush)
> >  		kvm_flush_remote_tlbs(kvm);
> > +
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> >  }
> >  
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > @@ -378,13 +378,14 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> >  
> >  	idx = srcu_read_lock(&kvm->srcu);
> >  	spin_lock(&kvm->mmu_lock);
> > -	young = kvm_age_hva(kvm, address);
> > -	spin_unlock(&kvm->mmu_lock);
> > -	srcu_read_unlock(&kvm->srcu, idx);
> >  
> > +	young = kvm_age_hva(kvm, address);
> >  	if (young)
> >  		kvm_flush_remote_tlbs(kvm);
> >  
> > +	spin_unlock(&kvm->mmu_lock);
> > +	srcu_read_unlock(&kvm->srcu, idx);
> > +
> >  	return young;
> >  }
> >  
> > -- 
> > 1.7.5.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-14 17:29   ` Marcelo Tosatti
@ 2012-02-14 18:53     ` Andrea Arcangeli
  2012-02-14 19:43       ` Marcelo Tosatti
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2012-02-14 18:53 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Takuya Yoshikawa, avi, kvm, xiaoguangrong

On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote:
> The problem the patch is fixing is not related to page freeing, but
> rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d

Can't find the commit on kvm.git.

> (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):
> 
> 
> During protecting pages for dirty logging, other threads may also try
> to protect a page in mmu_sync_children() or kvm_mmu_get_page().
> 
> In such a case, because get_dirty_log releases mmu_lock before flushing
> TLB's, the following race condition can happen:
> 
>   A (get_dirty_log)     B (another thread)
> 
>   lock(mmu_lock)
>   clear pte.w
>   unlock(mmu_lock)
>                         lock(mmu_lock)
>                         pte.w is already cleared
>                         unlock(mmu_lock)
>                         skip TLB flush

Not sure which tree it is, but in kvm and upstream I see an
unconditional tlb flush here, not skip (both
kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I
assume this has been updated in your tree to eb conditional.

Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock
in the kvm_mmu_rmap_write_protect case (like in quoted description),
so two write_protect_slot in parallel against each other may not be
ok, but that may be enforced by design if qemu won't ever call that
ioctl from two different userland threads (it doesn't sounds security
related so it should be ok to enforce its safety by userland design).

>                         return
>   ...
>   TLB flush
> 
> Though thread B assumes the page has already been protected when it
> returns, the remaining TLB entry will break that assumption.

Now I get the question of why not running the TLB flush inside the
mmu_lock only if the spte was writable :).

kvm_mmu_get_page as long as it only runs in the context of a kvm page
fault is ok, because the page fault would be inhibited by the mmu
notifier invalidates, so maybe it's safe.

mmu_sync_children seems to have a problem instead, in your tree
get_dirty_log also has an issue if it has been updated to skip the
flush on readonly sptes, like I guess.

Interesting how the spte is already non present, the page is just
being freed shortly later, but yet we still need to trigger write
faults synchronously and prevent other CPUs in guest mode to further
modify the page to avoid losing dirty bits updates or updates on
pagetables that maps pagetables in the not NPT/EPT case. If the page
was really only going to be freed it would be ok if the other cpus
would still write to it for a little longer until the page was freed.

Like I wrote in previous email, I was thinking if we'd change the mmu
notifier methods to do an unconditional flush, then every other flush
could also run outside of the mmu_lock. But then I didn't think enough
about this to be sure. My guess is we could move all flushes outside
the mmu_lock if we stop flushling the tlb conditonally to the current
spte values. It'd clearly be slower for an UP guest though :). Large
SMP guests might benefit, if that is feasible at all... It depends how
problematic the mmu_lock is on the large SMP guests and how much we're
saving by doing conditional TLB flushes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-14 18:53     ` Andrea Arcangeli
@ 2012-02-14 19:43       ` Marcelo Tosatti
  2012-02-15  9:18         ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2012-02-14 19:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Takuya Yoshikawa, avi, kvm, xiaoguangrong

On Tue, Feb 14, 2012 at 07:53:56PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 14, 2012 at 03:29:47PM -0200, Marcelo Tosatti wrote:
> > The problem the patch is fixing is not related to page freeing, but
> > rmap_write_protect. From 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d
> 
> Can't find the commit on kvm.git.

Sorry, we got kvm.git out of sync. But you can see an equivalent below.

> 
> > (replace "A (get_dirty_log)" with "mmu_notifier_invalidate_page"):
> > 
> > 
> > During protecting pages for dirty logging, other threads may also try
> > to protect a page in mmu_sync_children() or kvm_mmu_get_page().
> > 
> > In such a case, because get_dirty_log releases mmu_lock before flushing
> > TLB's, the following race condition can happen:
> > 
> >   A (get_dirty_log)     B (another thread)
> > 
> >   lock(mmu_lock)
> >   clear pte.w
> >   unlock(mmu_lock)
> >                         lock(mmu_lock)
> >                         pte.w is already cleared
> >                         unlock(mmu_lock)
> >                         skip TLB flush
> 
> Not sure which tree it is, but in kvm and upstream I see an
> unconditional tlb flush here, not skip (both
> kvm_mmu_slot_remove_write_access and kvm_mmu_rmap_write_protect). So I
> assume this has been updated in your tree to eb conditional.

        if (!direct) {
                if (rmap_write_protect(vcpu->kvm, gfn))
                        kvm_flush_remote_tlbs(vcpu->kvm);


> Also note kvm_mmu_rmap_write_protect, flushes outside of the mmu_lock
> in the kvm_mmu_rmap_write_protect case (like in quoted description),
> so two write_protect_slot in parallel against each other may not be
> ok, but that may be enforced by design if qemu won't ever call that
> ioctl from two different userland threads (it doesn't sounds security
> related so it should be ok to enforce its safety by userland design).

Yes, here is the fix:

http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=02b48d00d7f1853bdf8a06da19ca5413ebe334c6

This is an equivalent of 8bf3f6f06fcdfd097b6c6ec51531d8292fa0d81d.

> 
> >                         return
> >   ...
> >   TLB flush
> > 
> > Though thread B assumes the page has already been protected when it
> > returns, the remaining TLB entry will break that assumption.
> 
> Now I get the question of why not running the TLB flush inside the
> mmu_lock only if the spte was writable :).
> 
> kvm_mmu_get_page as long as it only runs in the context of a kvm page
> fault is ok, because the page fault would be inhibited by the mmu
> notifier invalidates, so maybe it's safe.

Ah, perhaps, but this was not taken into account before. Can you confirm
this is the case so we can revert the invalidate_page patch?

> mmu_sync_children seems to have a problem instead, in your tree
> get_dirty_log also has an issue if it has been updated to skip the
> flush on readonly sptes, like I guess.
> 
> Interesting how the spte is already non present, the page is just
> being freed shortly later, but yet we still need to trigger write
> faults synchronously and prevent other CPUs in guest mode to further
> modify the page to avoid losing dirty bits updates or updates on
> pagetables that maps pagetables in the not NPT/EPT case. If the page
> was really only going to be freed it would be ok if the other cpus
> would still write to it for a little longer until the page was freed.
> 
> Like I wrote in previous email, I was thinking if we'd change the mmu
> notifier methods to do an unconditional flush, then every other flush
> could also run outside of the mmu_lock. But then I didn't think enough
> about this to be sure. My guess is we could move all flushes outside
> the mmu_lock if we stop flushling the tlb conditonally to the current
> spte values. It'd clearly be slower for an UP guest though :). Large
> SMP guests might benefit, if that is feasible at all... It depends how
> problematic the mmu_lock is on the large SMP guests and how much we're
> saving by doing conditional TLB flushes.

Also it should not be necessary for these flushes to be inside mmu_lock
on EPT/NPT case (since there is no write protection there). But it would
be awkward to differentiate the unlock position based on EPT/NPT.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-14 19:43       ` Marcelo Tosatti
@ 2012-02-15  9:18         ` Avi Kivity
  2012-02-15  9:47           ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-15  9:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, Takuya Yoshikawa, kvm, xiaoguangrong

On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
> Also it should not be necessary for these flushes to be inside mmu_lock
> on EPT/NPT case (since there is no write protection there). 

We do write protect with TDP, if nested virt is active.  The question is
whether we have indirect pages or not, not whether TDP is active or not
(even without TDP, if you don't enable paging in the guest, you don't
have to write protect).

> But it would
> be awkward to differentiate the unlock position based on EPT/NPT.
>

I would really like to move the IPI back out of the lock.

How about something like a sequence lock:


    spin_lock(mmu_lock)
    need_flush = write_protect_stuff();
    atomic_add(kvm->want_flush_counter, need_flush);
    spin_unlock(mmu_lock);

    while ((done = atomic_read(kvm->done_flush_counter)) < (want =
atomic_read(kvm->want_flush_counter)) {
          kvm_make_request(flush)
          atomic_cmpxchg(kvm->done_flush_counter, done, want)
    }

This (or maybe a corrected and optimized version) ensures that any
need_flush cannot pass the while () barrier, no matter which thread
encounters it first.  However it violates the "do not invent new locking
techniques" commandment.  Can we map it to some existing method?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-15  9:18         ` Avi Kivity
@ 2012-02-15  9:47           ` Avi Kivity
  2012-02-15 11:37             ` Xiao Guangrong
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-15  9:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, Takuya Yoshikawa, kvm, xiaoguangrong

On 02/15/2012 11:18 AM, Avi Kivity wrote:
> On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
> > Also it should not be necessary for these flushes to be inside mmu_lock
> > on EPT/NPT case (since there is no write protection there). 
>
> We do write protect with TDP, if nested virt is active.  The question is
> whether we have indirect pages or not, not whether TDP is active or not
> (even without TDP, if you don't enable paging in the guest, you don't
> have to write protect).
>
> > But it would
> > be awkward to differentiate the unlock position based on EPT/NPT.
> >
>
> I would really like to move the IPI back out of the lock.
>
> How about something like a sequence lock:
>
>
>     spin_lock(mmu_lock)
>     need_flush = write_protect_stuff();
>     atomic_add(kvm->want_flush_counter, need_flush);
>     spin_unlock(mmu_lock);
>
>     while ((done = atomic_read(kvm->done_flush_counter)) < (want =
> atomic_read(kvm->want_flush_counter)) {
>           kvm_make_request(flush)
>           atomic_cmpxchg(kvm->done_flush_counter, done, want)
>     }
>
> This (or maybe a corrected and optimized version) ensures that any
> need_flush cannot pass the while () barrier, no matter which thread
> encounters it first.  However it violates the "do not invent new locking
> techniques" commandment.  Can we map it to some existing method?

There is no need to advance 'want' in the loop.  So we could do

/* must call with mmu_lock held */
void kvm_mmu_defer_remote_flush(kvm, need_flush)
{
      if (need_flush)
            ++kvm->flush_counter.want;
}

/* may call without mmu_lock */
void kvm_mmu_commit_remote_flush(kvm)
{
      want = ACCESS_ONCE(kvm->flush_counter.want)
      while ((done = atomic_read(kvm->flush_counter.done) < want) {
            kvm_make_request(flush)
            atomic_cmpxchg(kvm->flush_counter.done, done, want)
      }
}




-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-15  9:47           ` Avi Kivity
@ 2012-02-15 11:37             ` Xiao Guangrong
  2012-02-15 14:07               ` Avi Kivity
  0 siblings, 1 reply; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-15 11:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Andrea Arcangeli, Takuya Yoshikawa, kvm

On 02/15/2012 05:47 PM, Avi Kivity wrote:

> On 02/15/2012 11:18 AM, Avi Kivity wrote:
>> On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
>>> Also it should not be necessary for these flushes to be inside mmu_lock
>>> on EPT/NPT case (since there is no write protection there). 
>>
>> We do write protect with TDP, if nested virt is active.  The question is
>> whether we have indirect pages or not, not whether TDP is active or not
>> (even without TDP, if you don't enable paging in the guest, you don't
>> have to write protect).
>>
>>> But it would
>>> be awkward to differentiate the unlock position based on EPT/NPT.
>>>
>>
>> I would really like to move the IPI back out of the lock.
>>
>> How about something like a sequence lock:
>>
>>
>>     spin_lock(mmu_lock)
>>     need_flush = write_protect_stuff();
>>     atomic_add(kvm->want_flush_counter, need_flush);
>>     spin_unlock(mmu_lock);
>>
>>     while ((done = atomic_read(kvm->done_flush_counter)) < (want =
>> atomic_read(kvm->want_flush_counter)) {
>>           kvm_make_request(flush)
>>           atomic_cmpxchg(kvm->done_flush_counter, done, want)
>>     }
>>
>> This (or maybe a corrected and optimized version) ensures that any
>> need_flush cannot pass the while () barrier, no matter which thread
>> encounters it first.  However it violates the "do not invent new locking
>> techniques" commandment.  Can we map it to some existing method?
> 
> There is no need to advance 'want' in the loop.  So we could do
> 
> /* must call with mmu_lock held */
> void kvm_mmu_defer_remote_flush(kvm, need_flush)
> {
>       if (need_flush)
>             ++kvm->flush_counter.want;
> }
> 
> /* may call without mmu_lock */
> void kvm_mmu_commit_remote_flush(kvm)
> {
>       want = ACCESS_ONCE(kvm->flush_counter.want)
>       while ((done = atomic_read(kvm->flush_counter.done) < want) {
>             kvm_make_request(flush)
>             atomic_cmpxchg(kvm->flush_counter.done, done, want)
>       }
> }
> 


Hmm, we already have kvm->tlbs_dirty, so, we can do it like this:

#define SPTE_INVALID_UNCLEAN (1 << 63 )

in invalid page path:
lock mmu_lock
if (spte is invalid)
	kvm->tlbs_dirty |= SPTE_INVALID_UNCLEAN;
need_tlb_flush = kvm->tlbs_dirty;
unlock mmu_lock
if (need_tlb_flush)
	kvm_flush_remote_tlbs()

And in page write-protected path:
lock mmu_lock
	if (it has spte change to readonly |
	      kvm->tlbs_dirty & SPTE_INVALID_UNCLEAN)
		kvm_flush_remote_tlbs()
unlock mmu_lock

How about this?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-15 11:37             ` Xiao Guangrong
@ 2012-02-15 14:07               ` Avi Kivity
  2012-02-15 19:16                 ` Andrea Arcangeli
  2012-02-16  4:50                 ` Xiao Guangrong
  0 siblings, 2 replies; 24+ messages in thread
From: Avi Kivity @ 2012-02-15 14:07 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Andrea Arcangeli, Takuya Yoshikawa, kvm

On 02/15/2012 01:37 PM, Xiao Guangrong wrote:
> >>
> >> I would really like to move the IPI back out of the lock.
> >>
> >> How about something like a sequence lock:
> >>
> >>
> >>     spin_lock(mmu_lock)
> >>     need_flush = write_protect_stuff();
> >>     atomic_add(kvm->want_flush_counter, need_flush);
> >>     spin_unlock(mmu_lock);
> >>
> >>     while ((done = atomic_read(kvm->done_flush_counter)) < (want =
> >> atomic_read(kvm->want_flush_counter)) {
> >>           kvm_make_request(flush)
> >>           atomic_cmpxchg(kvm->done_flush_counter, done, want)
> >>     }
> >>
> >> This (or maybe a corrected and optimized version) ensures that any
> >> need_flush cannot pass the while () barrier, no matter which thread
> >> encounters it first.  However it violates the "do not invent new locking
> >> techniques" commandment.  Can we map it to some existing method?
> > 
> > There is no need to advance 'want' in the loop.  So we could do
> > 
> > /* must call with mmu_lock held */
> > void kvm_mmu_defer_remote_flush(kvm, need_flush)
> > {
> >       if (need_flush)
> >             ++kvm->flush_counter.want;
> > }
> > 
> > /* may call without mmu_lock */
> > void kvm_mmu_commit_remote_flush(kvm)
> > {
> >       want = ACCESS_ONCE(kvm->flush_counter.want)
> >       while ((done = atomic_read(kvm->flush_counter.done) < want) {
> >             kvm_make_request(flush)
> >             atomic_cmpxchg(kvm->flush_counter.done, done, want)
> >       }
> > }
> > 
>
>
> Hmm, we already have kvm->tlbs_dirty, so, we can do it like this:
>
> #define SPTE_INVALID_UNCLEAN (1 << 63 )
>
> in invalid page path:
> lock mmu_lock
> if (spte is invalid)
> 	kvm->tlbs_dirty |= SPTE_INVALID_UNCLEAN;
> need_tlb_flush = kvm->tlbs_dirty;
> unlock mmu_lock
> if (need_tlb_flush)
> 	kvm_flush_remote_tlbs()
>
> And in page write-protected path:
> lock mmu_lock
> 	if (it has spte change to readonly |
> 	      kvm->tlbs_dirty & SPTE_INVALID_UNCLEAN)
> 		kvm_flush_remote_tlbs()
> unlock mmu_lock
>
> How about this?

Well, it still has flushes inside the lock.  And it seems to be more
complicated, but maybe that's because I thought of my idea and didn't
fully grok yours yet.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-15 14:07               ` Avi Kivity
@ 2012-02-15 19:16                 ` Andrea Arcangeli
  2012-02-16  4:50                 ` Xiao Guangrong
  1 sibling, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2012-02-15 19:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Xiao Guangrong, Marcelo Tosatti, Takuya Yoshikawa, kvm

On Wed, Feb 15, 2012 at 04:07:49PM +0200, Avi Kivity wrote:
> Well, it still has flushes inside the lock.  And it seems to be more
> complicated, but maybe that's because I thought of my idea and didn't
> fully grok yours yet.

If we go more complicated I prefer Avi's suggestion to move them all
outside the lock.

Yesterday I was also thinking at the regular pagetables and how we do
not have similar issues there. On the regular pagetables we just do
unconditional flush in fork when we make it readonly and KSM (the
other place that ptes stuff readonly that later can cow) uses
ptep_clear_flush which does an unconditional flush and furthermore it
does it inside the PT lock, so generally we don't optimize for those
things on the regular pagetables. But then these events don't happen
as frequently as they can on KVM without EPT/NPT.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-15 14:07               ` Avi Kivity
  2012-02-15 19:16                 ` Andrea Arcangeli
@ 2012-02-16  4:50                 ` Xiao Guangrong
  2012-02-16 11:57                   ` Avi Kivity
  1 sibling, 1 reply; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-16  4:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Andrea Arcangeli, Takuya Yoshikawa, kvm

On 02/15/2012 10:07 PM, Avi Kivity wrote:

> On 02/15/2012 01:37 PM, Xiao Guangrong wrote:
>>>>
>>>> I would really like to move the IPI back out of the lock.
>>>>
>>>> How about something like a sequence lock:
>>>>
>>>>
>>>>     spin_lock(mmu_lock)
>>>>     need_flush = write_protect_stuff();
>>>>     atomic_add(kvm->want_flush_counter, need_flush);
>>>>     spin_unlock(mmu_lock);
>>>>
>>>>     while ((done = atomic_read(kvm->done_flush_counter)) < (want =
>>>> atomic_read(kvm->want_flush_counter)) {
>>>>           kvm_make_request(flush)
>>>>           atomic_cmpxchg(kvm->done_flush_counter, done, want)
>>>>     }
>>>>
>>>> This (or maybe a corrected and optimized version) ensures that any
>>>> need_flush cannot pass the while () barrier, no matter which thread
>>>> encounters it first.  However it violates the "do not invent new locking
>>>> techniques" commandment.  Can we map it to some existing method?
>>>
>>> There is no need to advance 'want' in the loop.  So we could do
>>>
>>> /* must call with mmu_lock held */
>>> void kvm_mmu_defer_remote_flush(kvm, need_flush)
>>> {
>>>       if (need_flush)
>>>             ++kvm->flush_counter.want;
>>> }
>>>
>>> /* may call without mmu_lock */
>>> void kvm_mmu_commit_remote_flush(kvm)
>>> {
>>>       want = ACCESS_ONCE(kvm->flush_counter.want)
>>>       while ((done = atomic_read(kvm->flush_counter.done) < want) {
>>>             kvm_make_request(flush)
>>>             atomic_cmpxchg(kvm->flush_counter.done, done, want)
>>>       }
>>> }
>>>
>>
>>
>> Hmm, we already have kvm->tlbs_dirty, so, we can do it like this:
>>
>> #define SPTE_INVALID_UNCLEAN (1 << 63 )
>>
>> in invalid page path:
>> lock mmu_lock
>> if (spte is invalid)
>> 	kvm->tlbs_dirty |= SPTE_INVALID_UNCLEAN;
>> need_tlb_flush = kvm->tlbs_dirty;
>> unlock mmu_lock
>> if (need_tlb_flush)
>> 	kvm_flush_remote_tlbs()
>>
>> And in page write-protected path:
>> lock mmu_lock
>> 	if (it has spte change to readonly |
>> 	      kvm->tlbs_dirty & SPTE_INVALID_UNCLEAN)
>> 		kvm_flush_remote_tlbs()
>> unlock mmu_lock
>>
>> How about this?
> 
> Well, it still has flushes inside the lock.  And it seems to be more
> complicated, but maybe that's because I thought of my idea and didn't
> fully grok yours yet.
> 


Oh, i was not think that flush all tlbs out of mmu-lock, just invalid page
path.

But, there still have some paths need flush tlbs inside of mmu-lock(like
sync_children, get_page).

In your code:

>>> /* must call with mmu_lock held */
>>> void kvm_mmu_defer_remote_flush(kvm, need_flush)
>>> {
>>>       if (need_flush)
>>>             ++kvm->flush_counter.want;
>>> }
>>>
>>> /* may call without mmu_lock */
>>> void kvm_mmu_commit_remote_flush(kvm)
>>> {
>>>       want = ACCESS_ONCE(kvm->flush_counter.want)
>>>       while ((done = atomic_read(kvm->flush_counter.done) < want) {
>>>             kvm_make_request(flush)
>>>             atomic_cmpxchg(kvm->flush_counter.done, done, want)
>>>       }
>>> }

I think we do not need handle all tlb-flushed request here since all of these
request can be delayed to the point where mmu-lock is released , we can simply
do it:

void kvm_mmu_defer_remote_flush(kvm, need_flush)
{
	if (need_flush)
		++kvm->tlbs_dirty;
}

void kvm_mmu_commit_remote_flush(struct kvm *kvm)
{
	int dirty_count = kvm->tlbs_dirty;

	smp_mb();

	if (!dirty_count)
		return;

	if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
		++kvm->stat.remote_tlb_flush;
	cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
}

if this is ok, we only need do small change in the current code, since
kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs().


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-16  4:50                 ` Xiao Guangrong
@ 2012-02-16 11:57                   ` Avi Kivity
  2012-02-17  2:36                     ` Xiao Guangrong
  0 siblings, 1 reply; 24+ messages in thread
From: Avi Kivity @ 2012-02-16 11:57 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Marcelo Tosatti, Andrea Arcangeli, Takuya Yoshikawa, kvm

On 02/16/2012 06:50 AM, Xiao Guangrong wrote:
> I think we do not need handle all tlb-flushed request here since all of these
> request can be delayed to the point where mmu-lock is released , we can simply
> do it:
>
> void kvm_mmu_defer_remote_flush(kvm, need_flush)
> {
> 	if (need_flush)
> 		++kvm->tlbs_dirty;
> }
>
> void kvm_mmu_commit_remote_flush(struct kvm *kvm)
> {
> 	int dirty_count = kvm->tlbs_dirty;
>
> 	smp_mb();
>
> 	if (!dirty_count)
> 		return;
>
> 	if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
> 		++kvm->stat.remote_tlb_flush;

<-- point A

> 	cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
> }
>
> if this is ok, we only need do small change in the current code, since
> kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs().

Suppose at point A another thread executes defer_remote_flush(),
commit_remote_flush(), and defer_remote_flush() again.  This brings the
balue of tlbs_dirty back to 1 again, with the tlbs dirty.  The cmpxchg()
then resets tlbs_dirty, leaving the actual tlbs dirty.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
  2012-02-16 11:57                   ` Avi Kivity
@ 2012-02-17  2:36                     ` Xiao Guangrong
  0 siblings, 0 replies; 24+ messages in thread
From: Xiao Guangrong @ 2012-02-17  2:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, Andrea Arcangeli, Takuya Yoshikawa, kvm

On 02/16/2012 07:57 PM, Avi Kivity wrote:


> Suppose at point A another thread executes defer_remote_flush(),
> commit_remote_flush(), and defer_remote_flush() again.  This brings the
> balue of tlbs_dirty back to 1 again, with the tlbs dirty.  The cmpxchg()
> then resets tlbs_dirty, leaving the actual tlbs dirty.
> 


Oh, right, sorry for my careless!


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-02-17  2:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-10  6:28 [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock Takuya Yoshikawa
2012-02-10  6:29 ` [PATCH 2/2] KVM: MMU: Flush TLBs only once in invlpg() " Takuya Yoshikawa
2012-02-10  6:55   ` Xiao Guangrong
2012-02-10  7:21     ` Takuya Yoshikawa
2012-02-10  7:42       ` Xiao Guangrong
2012-02-14  4:36         ` Takuya Yoshikawa
2012-02-14  4:56           ` Takuya Yoshikawa
2012-02-14 17:21             ` Andrea Arcangeli
2012-02-10  7:52 ` [PATCH 1/2] KVM: mmu_notifier: Flush TLBs " Xiao Guangrong
2012-02-13  6:00   ` Takuya Yoshikawa
2012-02-14 17:27   ` Andrea Arcangeli
2012-02-10 17:26 ` Marcelo Tosatti
2012-02-14 17:10 ` Andrea Arcangeli
2012-02-14 17:29   ` Marcelo Tosatti
2012-02-14 18:53     ` Andrea Arcangeli
2012-02-14 19:43       ` Marcelo Tosatti
2012-02-15  9:18         ` Avi Kivity
2012-02-15  9:47           ` Avi Kivity
2012-02-15 11:37             ` Xiao Guangrong
2012-02-15 14:07               ` Avi Kivity
2012-02-15 19:16                 ` Andrea Arcangeli
2012-02-16  4:50                 ` Xiao Guangrong
2012-02-16 11:57                   ` Avi Kivity
2012-02-17  2:36                     ` Xiao Guangrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).