public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Use cmpxchg for pte updates on walk_addr()
@ 2007-12-06 15:04 Marcelo Tosatti
  2007-12-06 15:24 ` Avi Kivity
  0 siblings, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2007-12-06 15:04 UTC (permalink / raw)
  To: Avi Kivity, Izik Eidus; +Cc: kvm-devel


In preparation for multi-threaded guest pte walking, use cmpxchg()
when updating guest pte's. This guarantees that the assignment of the
dirty bit can't be lost if two CPU's are faulting the same address
simultaneously.

In case of pte update via write emulation, no synchronization is
necessary since its the guest responsability.

Signed-off-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
index b24bc7c..593073a 100644
--- a/drivers/kvm/paging_tmpl.h
+++ b/drivers/kvm/paging_tmpl.h
@@ -78,6 +78,23 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
 	return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
 
+static inline void FNAME(kvm_cmpxchg_guest_pte)(struct kvm *kvm,
+			 gfn_t table_gfn, unsigned index, pt_element_t new_bits)
+{
+	pt_element_t pte;
+	pt_element_t *table;
+	struct page *page;
+
+	page = gfn_to_page(kvm, table_gfn);
+	table = kmap_atomic(page, KM_USER0);
+
+	do {
+ 		pte = table[index];
+	} while (cmpxchg(&table[index], pte, pte | new_bits) != pte);
+
+	kunmap_atomic(page, KM_USER0);
+}
+
 /*
  * Fetch a guest pte for a guest virtual address
  */
@@ -136,7 +153,8 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 		if (!(pte & PT_ACCESSED_MASK)) {
 			mark_page_dirty(vcpu->kvm, table_gfn);
 			pte |= PT_ACCESSED_MASK;
-			kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
+			FNAME(kvm_cmpxchg_guest_pte)
+				(vcpu->kvm, table_gfn, index, pte);
 		}
 
 		if (walker->level == PT_PAGE_TABLE_LEVEL) {
@@ -161,7 +179,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	if (write_fault && !is_dirty_pte(pte)) {
 		mark_page_dirty(vcpu->kvm, table_gfn);
 		pte |= PT_DIRTY_MASK;
-		kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
+		FNAME(kvm_cmpxchg_guest_pte)(vcpu->kvm, table_gfn, index, pte);
 		kvm_mmu_pte_write(vcpu, pte_gpa, (u8 *)&pte, sizeof(pte));
 	}
 
diff --git a/drivers/kvm/x86.c b/drivers/kvm/x86.c
index c70ac33..edba8ce 100644
--- a/drivers/kvm/x86.c
+++ b/drivers/kvm/x86.c
@@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 {
 	int ret;
 
+	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
+ 	 * responsability to synchronize pte updates and page faults.
+	 */
 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
 	if (ret < 0)
 		return 0;

-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
  2007-12-06 15:04 [PATCH] Use cmpxchg for pte updates on walk_addr() Marcelo Tosatti
@ 2007-12-06 15:24 ` Avi Kivity
       [not found]   ` <47581418.8000506-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2007-12-06 15:24 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> In preparation for multi-threaded guest pte walking, use cmpxchg()
> when updating guest pte's. This guarantees that the assignment of the
> dirty bit can't be lost if two CPU's are faulting the same address
> simultaneously.
>
> In case of pte update via write emulation, no synchronization is
> necessary since its the guest responsability.
>
> Signed-off-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
>
> diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
> index b24bc7c..593073a 100644
> --- a/drivers/kvm/paging_tmpl.h
> +++ b/drivers/kvm/paging_tmpl.h
> @@ -78,6 +78,23 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
>  	return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  }
>  
> +static inline void FNAME(kvm_cmpxchg_guest_pte)(struct kvm *kvm,
> +			 gfn_t table_gfn, unsigned index, pt_element_t new_bits)
>   

You can drop the kvm_ prefix since this is static.

> +{
> +	pt_element_t pte;
> +	pt_element_t *table;
> +	struct page *page;
> +
> +	page = gfn_to_page(kvm, table_gfn);
> +	table = kmap_atomic(page, KM_USER0);
> +
> +	do {
> + 		pte = table[index];
> +	} while (cmpxchg(&table[index], pte, pte | new_bits) != pte);
> +
> +	kunmap_atomic(page, KM_USER0);
> +}
>   

I don't think cmpxchg() handles 64-bits on i386; you need  cmpxchg64().

What happens if the guest changed the pte under our feet (say, one vcpu 
accesses the page, the other vcpu swaps it out, setting the pte to 
zero)?  We end up modifying the new pte.  So we need to compare against 
the original pte, and abort the walk if we fail to set the bit.

> diff --git a/drivers/kvm/x86.c b/drivers/kvm/x86.c
> index c70ac33..edba8ce 100644
> --- a/drivers/kvm/x86.c
> +++ b/drivers/kvm/x86.c
> @@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
>  {
>  	int ret;
>  
> +	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
> + 	 * responsability to synchronize pte updates and page faults.
> +	 */
>  	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
>  	if (ret < 0)
>  		return 0;

Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
intended atomic write can be seen splitted by the guest walker doing a 
concurrent walk.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.  From the desktop to the data center, Linux is going
mainstream.  Let it simplify your IT future.
http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
       [not found]   ` <47581418.8000506-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-12-07  2:32     ` Marcelo Tosatti
  2007-12-07  5:06       ` Avi Kivity
  0 siblings, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2007-12-07  2:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel

On Thu, Dec 06, 2007 at 05:24:08PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> > In preparation for multi-threaded guest pte walking, use cmpxchg()
> > when updating guest pte's. This guarantees that the assignment of the
> > dirty bit can't be lost if two CPU's are faulting the same address
> > simultaneously.
> >
> > In case of pte update via write emulation, no synchronization is
> > necessary since its the guest responsability.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >
> >
> > diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
> > index b24bc7c..593073a 100644
> > --- a/drivers/kvm/paging_tmpl.h
> > +++ b/drivers/kvm/paging_tmpl.h
> > @@ -78,6 +78,23 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
> >  	return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
> >  }
> >  
> > +static inline void FNAME(kvm_cmpxchg_guest_pte)(struct kvm *kvm,
> > +			 gfn_t table_gfn, unsigned index, pt_element_t new_bits)
> >   
> 
> You can drop the kvm_ prefix since this is static.
> 
> > +{
> > +	pt_element_t pte;
> > +	pt_element_t *table;
> > +	struct page *page;
> > +
> > +	page = gfn_to_page(kvm, table_gfn);
> > +	table = kmap_atomic(page, KM_USER0);
> > +
> > +	do {
> > + 		pte = table[index];
> > +	} while (cmpxchg(&table[index], pte, pte | new_bits) != pte);
> > +
> > +	kunmap_atomic(page, KM_USER0);
> > +}
> >   
> 
> I don't think cmpxchg() handles 64-bits on i386; you need  cmpxchg64().
> 
> What happens if the guest changed the pte under our feet (say, one vcpu 
> accesses the page, the other vcpu swaps it out, setting the pte to 
> zero)?  We end up modifying the new pte.  So we need to compare against 
> the original pte, and abort the walk if we fail to set the bit.

Right, patch at end of the message restarts the process if the pte
changes under the walker. The goto is pretty ugly, but I fail to see any
elegant way of doing that. Ideas?

> > diff --git a/drivers/kvm/x86.c b/drivers/kvm/x86.c
> > index c70ac33..edba8ce 100644
> > --- a/drivers/kvm/x86.c
> > +++ b/drivers/kvm/x86.c
> > @@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
> >  {
> >  	int ret;
> >  
> > +	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
> > + 	 * responsability to synchronize pte updates and page faults.
> > +	 */
> >  	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
> >  	if (ret < 0)
> >  		return 0;
> 
> Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
> set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
> intended atomic write can be seen splitted by the guest walker doing a 
> concurrent walk.

True, an atomic write is needed... a separate patch for that seems more
appropriate.


diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
index b24bc7c..a39f8a6 100644
--- a/drivers/kvm/paging_tmpl.h
+++ b/drivers/kvm/paging_tmpl.h
@@ -34,7 +34,9 @@
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
 	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
 	#else
+	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
 #elif PTTYPE == 32
@@ -48,6 +50,7 @@
 	#define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
+	#define CMPXCHG cmpxchg
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -78,6 +81,24 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
 	return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
 
+static inline bool FNAME(cmpxchg_gpte)(struct kvm *kvm,
+			 gfn_t table_gfn, unsigned index, 
+			 pt_element_t orig_pte, pt_element_t new_pte)
+{
+	pt_element_t ret;
+	pt_element_t *table;
+	struct page *page;
+
+	page = gfn_to_page(kvm, table_gfn);
+	table = kmap_atomic(page, KM_USER0);
+	
+	ret = CMPXCHG(&table[index], orig_pte, new_pte);
+
+	kunmap_atomic(page, KM_USER0);
+
+	return (ret != orig_pte);
+}
+
 /*
  * Fetch a guest pte for a guest virtual address
  */
@@ -91,6 +112,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	gpa_t pte_gpa;
 
 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
+walk:
 	walker->level = vcpu->mmu.root_level;
 	pte = vcpu->cr3;
 #if PTTYPE == 64
@@ -135,8 +157,9 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 
 		if (!(pte & PT_ACCESSED_MASK)) {
 			mark_page_dirty(vcpu->kvm, table_gfn);
-			pte |= PT_ACCESSED_MASK;
-			kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
+			if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, 
+			    index, pte, pte|PT_ACCESSED_MASK))
+				goto walk;
 		}
 
 		if (walker->level == PT_PAGE_TABLE_LEVEL) {
@@ -159,9 +182,13 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	}
 
 	if (write_fault && !is_dirty_pte(pte)) {
+		bool ret;
 		mark_page_dirty(vcpu->kvm, table_gfn);
-		pte |= PT_DIRTY_MASK;
-		kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
+		ret = FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, index, pte,
+		       	    pte|PT_DIRTY_MASK);
+		if (ret)
+			goto walk;
+			
 		kvm_mmu_pte_write(vcpu, pte_gpa, (u8 *)&pte, sizeof(pte));
 	}
 

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
  2007-12-07  2:32     ` Marcelo Tosatti
@ 2007-12-07  5:06       ` Avi Kivity
       [not found]         ` <4758D4D2.8090208-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2007-12-07  5:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> Right, patch at end of the message restarts the process if the pte
> changes under the walker. The goto is pretty ugly, but I fail to see any
> elegant way of doing that. Ideas?
>
>   

goto is fine for that.  But there's a subtle livelock here: suppose vcpu 
0 is in guest mode with continuously updating a memory location.  vcpu 1 
is faulting with that memory location acting as a pte.  While we're in 
kernel mode, we aren't responding to signals like we should; so we need 
to abort the walk and let the guest retry; that way we go through the 
signal_pending() check.

However, this is an intrusive change, so let's start with the goto and 
drop it later in favor or an abort.

>>> @@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
>>>  {
>>>  	int ret;
>>>  
>>> +	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
>>> + 	 * responsability to synchronize pte updates and page faults.
>>> +	 */
>>>  	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
>>>  	if (ret < 0)
>>>  		return 0;
>>>       
>> Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
>> set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
>> intended atomic write can be seen splitted by the guest walker doing a 
>> concurrent walk.
>>     
>
> True, an atomic write is needed... a separate patch for that seems more
> appropriate.
>
>
>   

Yes.

> +static inline bool FNAME(cmpxchg_gpte)(struct kvm *kvm,
> +			 gfn_t table_gfn, unsigned index, 
> +			 pt_element_t orig_pte, pt_element_t new_pte)
> +{
> +	pt_element_t ret;
> +	pt_element_t *table;
> +	struct page *page;
> +
> +	page = gfn_to_page(kvm, table_gfn);
> +	table = kmap_atomic(page, KM_USER0);
> +	
> +	ret = CMPXCHG(&table[index], orig_pte, new_pte);
> +
> +	kunmap_atomic(page, KM_USER0);
> +
>   

Missing kvm_release_page_dirty() here.  May also move mark_page_dirty() 
here.

No need to force inlining.

> +	return (ret != orig_pte);
> +}
> +
>  /*
>   * Fetch a guest pte for a guest virtual address
>   */
> @@ -91,6 +112,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  	gpa_t pte_gpa;
>  
>  	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
> +walk:
>  	walker->level = vcpu->mmu.root_level;
>  	pte = vcpu->cr3;
>  #if PTTYPE == 64
> @@ -135,8 +157,9 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  
>  		if (!(pte & PT_ACCESSED_MASK)) {
>  			mark_page_dirty(vcpu->kvm, table_gfn);
> -			pte |= PT_ACCESSED_MASK;
> -			kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
> +			if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, 
> +			    index, pte, pte|PT_ACCESSED_MASK))
> +				goto walk;
>   

We lose the accessed bit in the local variable pte here.  Not sure if it 
matters but let's play it safe.

>  		}
>  
>  		if (walker->level == PT_PAGE_TABLE_LEVEL) {
> @@ -159,9 +182,13 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
>  	}
>  
>  	if (write_fault && !is_dirty_pte(pte)) {
> +		bool ret;
>  		mark_page_dirty(vcpu->kvm, table_gfn);
> -		pte |= PT_DIRTY_MASK;
> -		kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
> +		ret = FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, index, pte,
> +		       	    pte|PT_DIRTY_MASK);
> +		if (ret)
> +			goto walk;
> +	

Again we lose a bit in pte.  That ends up in walker->pte and is quite 
important.




-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
       [not found]         ` <4758D4D2.8090208-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-12-07 12:56           ` Marcelo Tosatti
  2007-12-07 17:54             ` Andrea Arcangeli
  2007-12-09  8:38             ` Avi Kivity
  2007-12-07 22:47           ` Marcelo Tosatti
  1 sibling, 2 replies; 9+ messages in thread
From: Marcelo Tosatti @ 2007-12-07 12:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

On Fri, Dec 07, 2007 at 07:06:26AM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >Right, patch at end of the message restarts the process if the pte
> >changes under the walker. The goto is pretty ugly, but I fail to see any
> >elegant way of doing that. Ideas?
> >
> >  
> 
> goto is fine for that.  But there's a subtle livelock here: suppose vcpu 
> 0 is in guest mode with continuously updating a memory location.  vcpu 1 
> is faulting with that memory location acting as a pte.  While we're in 
> kernel mode, we aren't responding to signals like we should; so we need 
> to abort the walk and let the guest retry; that way we go through the 
> signal_pending() check.

I see that possibility, but why on earth would the guest be continuously
updating a pagetable entry ?

> 
> However, this is an intrusive change, so let's start with the goto and 
> drop it later in favor or an abort.
> 
> >>>@@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu 
> >>>*vcpu, gpa_t gpa,
> >>> {
> >>> 	int ret;
> >>> 
> >>>+	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
> >>>+ 	 * responsability to synchronize pte updates and page faults.
> >>>+	 */
> >>> 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
> >>> 	if (ret < 0)
> >>> 		return 0;
> >>>      
> >>Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
> >>set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
> >>intended atomic write can be seen splitted by the guest walker doing a 
> >>concurrent walk.
> >>    
> >
> >True, an atomic write is needed... a separate patch for that seems more
> >appropriate.
> >
> >
> >  
> 
> Yes.
> 
> >+static inline bool FNAME(cmpxchg_gpte)(struct kvm *kvm,
> >+			 gfn_t table_gfn, unsigned index, 
> >+			 pt_element_t orig_pte, pt_element_t new_pte)
> >+{
> >+	pt_element_t ret;
> >+	pt_element_t *table;
> >+	struct page *page;
> >+
> >+	page = gfn_to_page(kvm, table_gfn);
> >+	table = kmap_atomic(page, KM_USER0);
> >+	
> >+	ret = CMPXCHG(&table[index], orig_pte, new_pte);
> >+
> >+	kunmap_atomic(page, KM_USER0);
> >+
> >  
> 
> Missing kvm_release_page_dirty() here.  May also move mark_page_dirty() 
> here.

I prefer leaving that for later when the locking for mark_page_dirty()
gets sorted out...

> No need to force inlining.

Alright.

> >+	return (ret != orig_pte);
> >+}
> >+
> > /*
> >  * Fetch a guest pte for a guest virtual address
> >  */
> >@@ -91,6 +112,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
> > 	gpa_t pte_gpa;
> > 
> > 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
> >+walk:
> > 	walker->level = vcpu->mmu.root_level;
> > 	pte = vcpu->cr3;
> > #if PTTYPE == 64
> >@@ -135,8 +157,9 @@ static int FNAME(walk_addr)(struct guest_walker 
> >*walker,
> > 
> > 		if (!(pte & PT_ACCESSED_MASK)) {
> > 			mark_page_dirty(vcpu->kvm, table_gfn);
> >-			pte |= PT_ACCESSED_MASK;
> >-			kvm_write_guest(vcpu->kvm, pte_gpa, &pte, 
> >sizeof(pte));
> >+			if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, 
> >+			    index, pte, pte|PT_ACCESSED_MASK))
> >+				goto walk;
> >  
> 
> We lose the accessed bit in the local variable pte here.  Not sure if it 
> matters but let's play it safe.
>
> > 		}
> > 
> > 		if (walker->level == PT_PAGE_TABLE_LEVEL) {
> >@@ -159,9 +182,13 @@ static int FNAME(walk_addr)(struct guest_walker 
> >*walker,
> > 	}
> > 
> > 	if (write_fault && !is_dirty_pte(pte)) {
> >+		bool ret;
> > 		mark_page_dirty(vcpu->kvm, table_gfn);
> >-		pte |= PT_DIRTY_MASK;
> >-		kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
> >+		ret = FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, index, pte,
> >+		       	    pte|PT_DIRTY_MASK);
> >+		if (ret)
> >+			goto walk;
> >+	
> 
> Again we lose a bit in pte.  That ends up in walker->pte and is quite 
> important.

Oops!

Updated patch:


diff --git a/drivers/kvm/paging_tmpl.h b/drivers/kvm/paging_tmpl.h
index b24bc7c..2e54b02 100644
--- a/drivers/kvm/paging_tmpl.h
+++ b/drivers/kvm/paging_tmpl.h
@@ -34,7 +34,9 @@
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
 	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS 4
+	#define CMPXCHG cmpxchg
 	#else
+	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
 #elif PTTYPE == 32
@@ -48,6 +50,7 @@
 	#define PT_LEVEL_MASK(level) PT32_LEVEL_MASK(level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
 	#define PT_MAX_FULL_LEVELS 2
+	#define CMPXCHG cmpxchg
 #else
 	#error Invalid PTTYPE value
 #endif
@@ -78,6 +81,26 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
 	return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
 
+static bool FNAME(cmpxchg_gpte)(struct kvm *kvm,
+			 gfn_t table_gfn, unsigned index, 
+			 pt_element_t orig_pte, pt_element_t new_pte)
+{
+	pt_element_t ret;
+	pt_element_t *table;
+	struct page *page;
+
+	page = gfn_to_page(kvm, table_gfn);
+	table = kmap_atomic(page, KM_USER0);
+	
+	ret = CMPXCHG(&table[index], orig_pte, new_pte);
+
+	kunmap_atomic(page, KM_USER0);
+
+	kvm_release_page_dirty(page);
+
+	return (ret != orig_pte);
+}
+
 /*
  * Fetch a guest pte for a guest virtual address
  */
@@ -91,6 +114,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	gpa_t pte_gpa;
 
 	pgprintk("%s: addr %lx\n", __FUNCTION__, addr);
+walk:
 	walker->level = vcpu->mmu.root_level;
 	pte = vcpu->cr3;
 #if PTTYPE == 64
@@ -135,8 +159,10 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 
 		if (!(pte & PT_ACCESSED_MASK)) {
 			mark_page_dirty(vcpu->kvm, table_gfn);
+			if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, 
+			    index, pte, pte|PT_ACCESSED_MASK))
+				goto walk;
 			pte |= PT_ACCESSED_MASK;
-			kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
 		}
 
 		if (walker->level == PT_PAGE_TABLE_LEVEL) {
@@ -159,9 +185,13 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
 	}
 
 	if (write_fault && !is_dirty_pte(pte)) {
+		bool ret;
 		mark_page_dirty(vcpu->kvm, table_gfn);
+		ret = FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, index, pte,
+		       	    pte|PT_DIRTY_MASK);
+		if (ret)
+			goto walk;
 		pte |= PT_DIRTY_MASK;
-		kvm_write_guest(vcpu->kvm, pte_gpa, &pte, sizeof(pte));
 		kvm_mmu_pte_write(vcpu, pte_gpa, (u8 *)&pte, sizeof(pte));
 	}
 

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
  2007-12-07 12:56           ` Marcelo Tosatti
@ 2007-12-07 17:54             ` Andrea Arcangeli
  2007-12-09  8:38             ` Avi Kivity
  1 sibling, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2007-12-07 17:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, Avi Kivity

On Fri, Dec 07, 2007 at 07:56:58AM -0500, Marcelo Tosatti wrote:
> I see that possibility, but why on earth would the guest be continuously
> updating a pagetable entry ?

If I understood correctly this would be just to be more robust against
malicious guests that could try to create unkillable tight loops in
host kernel mode with -smp > 1. Not sure if timings would allow it
though. With numa archs if the guest cpu is lucky and has the pte in
its direct connected ram, and the walker cpu instead need to access it
remotely, such starvation could probably materialize.

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
       [not found]         ` <4758D4D2.8090208-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-12-07 12:56           ` Marcelo Tosatti
@ 2007-12-07 22:47           ` Marcelo Tosatti
  2007-12-09  8:47             ` Avi Kivity
  1 sibling, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2007-12-07 22:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel

On Fri, Dec 07, 2007 at 07:06:26AM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >Right, patch at end of the message restarts the process if the pte
> >changes under the walker. The goto is pretty ugly, but I fail to see any
> >elegant way of doing that. Ideas?
> >
> >  
> 
> goto is fine for that.  But there's a subtle livelock here: suppose vcpu 
> 0 is in guest mode with continuously updating a memory location.  vcpu 1 
> is faulting with that memory location acting as a pte.  While we're in 
> kernel mode, we aren't responding to signals like we should; so we need 
> to abort the walk and let the guest retry; that way we go through the 
> signal_pending() check.
> 
> However, this is an intrusive change, so let's start with the goto and 
> drop it later in favor or an abort.
> 
> >>>@@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu 
> >>>*vcpu, gpa_t gpa,
> >>> {
> >>> 	int ret;
> >>> 
> >>>+	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
> >>>+ 	 * responsability to synchronize pte updates and page faults.
> >>>+	 */
> >>> 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
> >>> 	if (ret < 0)
> >>> 		return 0;
> >>>      
> >>Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
> >>set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
> >>intended atomic write can be seen splitted by the guest walker doing a 
> >>concurrent walk.
> >>    
> >
> >True, an atomic write is needed... a separate patch for that seems more
> >appropriate.
> >
> >
> >  
> 
> Yes.

Hi Avi,

How about the following

diff --git a/drivers/kvm/x86.c b/drivers/kvm/x86.c
index c70ac33..8678651 100644
--- a/drivers/kvm/x86.c
+++ b/drivers/kvm/x86.c
@@ -25,6 +25,7 @@
 #include <linux/vmalloc.h>
 #include <linux/module.h>
 #include <linux/mman.h>
+#include <linux/highmem.h>
 
 #include <asm/uaccess.h>
 #include <asm/msr.h>
@@ -1589,6 +1590,34 @@ static int emulator_cmpxchg_emulated(unsigned long addr,
 		reported = 1;
 		printk(KERN_WARNING "kvm: emulating exchange as write\n");
 	}
+
+	/* guests cmpxchg8b have to be emulated atomically */
+	if (bytes == 8) {
+		gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+		struct page *page;
+		char *addr;
+		u64 *val;
+
+		if (gpa == UNMAPPED_GVA ||
+	    	   (gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
+			goto emul_write;
+
+		val = (u64 *)new;
+		page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
+		addr = kmap_atomic(page, KM_USER0);
+		addr += offset_in_page(gpa);
+
+#ifdef CONFIG_X86_64
+		set_64bit((unsigned long *)addr, *val);
+#else
+		set_64bit((unsigned long long *)addr, *val);
+#endif
+		kunmap_atomic(page, KM_USER0);
+		kvm_release_page_dirty(page);
+			
+	}
+
+emul_write:
 	return emulator_write_emulated(addr, new, bytes, vcpu);
 }
 

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
  2007-12-07 12:56           ` Marcelo Tosatti
  2007-12-07 17:54             ` Andrea Arcangeli
@ 2007-12-09  8:38             ` Avi Kivity
  1 sibling, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2007-12-09  8:38 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> Oops!
>
> Updated patch:
>
>
>   

Applied, thanks.

>  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
>  	#ifdef CONFIG_X86_64
>  	#define PT_MAX_FULL_LEVELS 4
> +	#define CMPXCHG cmpxchg
>  	#else
> +	#define CMPXCHG cmpxchg64
>  	#define PT_MAX_FULL_LEVELS 2
>  	#endif
>   

I updated the #undef block at the end of paging_tmpl.h to account for
the new define, otherwise it wouldn't compile cleanly on i386 due to
CMPXCHG being redefined.


I picked up the patch description and signoff from the first version of
this patch, but please keep them with the patch next time.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Use cmpxchg for pte updates on walk_addr()
  2007-12-07 22:47           ` Marcelo Tosatti
@ 2007-12-09  8:47             ` Avi Kivity
  0 siblings, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2007-12-09  8:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel

Marcelo Tosatti wrote:
> On Fri, Dec 07, 2007 at 07:06:26AM +0200, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> Right, patch at end of the message restarts the process if the pte
>>> changes under the walker. The goto is pretty ugly, but I fail to see any
>>> elegant way of doing that. Ideas?
>>>
>>>  
>>>       
>> goto is fine for that.  But there's a subtle livelock here: suppose vcpu 
>> 0 is in guest mode with continuously updating a memory location.  vcpu 1 
>> is faulting with that memory location acting as a pte.  While we're in 
>> kernel mode, we aren't responding to signals like we should; so we need 
>> to abort the walk and let the guest retry; that way we go through the 
>> signal_pending() check.
>>
>> However, this is an intrusive change, so let's start with the goto and 
>> drop it later in favor or an abort.
>>
>>     
>>>>> @@ -1510,6 +1510,9 @@ static int emulator_write_phys(struct kvm_vcpu 
>>>>> *vcpu, gpa_t gpa,
>>>>> {
>>>>> 	int ret;
>>>>>
>>>>> +	/* No need for kvm_cmpxchg_guest_pte here, its the guest 
>>>>> + 	 * responsability to synchronize pte updates and page faults.
>>>>> +	 */
>>>>> 	ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
>>>>> 	if (ret < 0)
>>>>> 		return 0;
>>>>>      
>>>>>           
>>>> Hmm.  What if an i386 pae guest carefully uses cmpxchg8b to atomically 
>>>> set a pte?  kvm_write_guest() doesn't guarantee atomicity, so an 
>>>> intended atomic write can be seen splitted by the guest walker doing a 
>>>> concurrent walk.
>>>>    
>>>>         
>>> True, an atomic write is needed... a separate patch for that seems more
>>> appropriate.
>>>
>>>
>>>  
>>>       
>> Yes.
>>     
>
> Hi Avi,
>
> How about the following
>
> diff --git a/drivers/kvm/x86.c b/drivers/kvm/x86.c
> index c70ac33..8678651 100644
> --- a/drivers/kvm/x86.c
> +++ b/drivers/kvm/x86.c
> @@ -25,6 +25,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/module.h>
>  #include <linux/mman.h>
> +#include <linux/highmem.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/msr.h>
> @@ -1589,6 +1590,34 @@ static int emulator_cmpxchg_emulated(unsigned long addr,
>  		reported = 1;
>  		printk(KERN_WARNING "kvm: emulating exchange as write\n");
>  	}
> +
> +	/* guests cmpxchg8b have to be emulated atomically */
> +	if (bytes == 8) {
>   

It's only needed on i386, as 8-byte copy_to_user should be atomic on x86_64.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-12-09  8:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-06 15:04 [PATCH] Use cmpxchg for pte updates on walk_addr() Marcelo Tosatti
2007-12-06 15:24 ` Avi Kivity
     [not found]   ` <47581418.8000506-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-12-07  2:32     ` Marcelo Tosatti
2007-12-07  5:06       ` Avi Kivity
     [not found]         ` <4758D4D2.8090208-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-12-07 12:56           ` Marcelo Tosatti
2007-12-07 17:54             ` Andrea Arcangeli
2007-12-09  8:38             ` Avi Kivity
2007-12-07 22:47           ` Marcelo Tosatti
2007-12-09  8:47             ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox