KVM: MMU: rmap_write_protect() hugepage iteration bug

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM: MMU: rmap_write_protect() hugepage iteration bug
@ 2008-06-08  0:27 Marcelo Tosatti
  2008-06-08  1:54 ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Marcelo Tosatti @ 2008-06-08  0:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel


rmap_next() expects the "spte" argument to be NULL whenever there's only
one remaining entry in the descriptor. That is, it was not designed to
handle changes in the chain while iterating.

This bug cripples rmap_write_protect() so that it won't nuke all
writable large mappings to a particular gfn.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aaccc40..e11ff17 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -632,17 +632,19 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 	rmapp = gfn_to_rmap(kvm, gfn, 1);
 	spte = rmap_next(kvm, rmapp, NULL);
 	while (spte) {
+		u64 *next_spte;
 		BUG_ON(!spte);
 		BUG_ON(!(*spte & PT_PRESENT_MASK));
 		BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
 		pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn);
+		next_spte = rmap_next(kvm, rmapp, spte);
 		if (is_writeble_pte(*spte)) {
 			rmap_remove(kvm, spte);
 			--kvm->stat.lpages;
 			set_shadow_pte(spte, shadow_trap_nonpresent_pte);
 			write_protected = 1;
 		}
-		spte = rmap_next(kvm, rmapp, spte);
+		spte = next_spte;
 	}
 
 	if (write_protected)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  0:27 KVM: MMU: rmap_write_protect() hugepage iteration bug Marcelo Tosatti
@ 2008-06-08  1:54 ` Andrea Arcangeli
  2008-06-08  4:48   ` Marcelo Tosatti
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2008-06-08  1:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm-devel

On Sat, Jun 07, 2008 at 09:27:36PM -0300, Marcelo Tosatti wrote:
> 
> rmap_next() expects the "spte" argument to be NULL whenever there's only
> one remaining entry in the descriptor. That is, it was not designed to
> handle changes in the chain while iterating.
> 
> This bug cripples rmap_write_protect() so that it won't nuke all
> writable large mappings to a particular gfn.

Better than before but the patched version will still not nuke all
writeable sptes (I did exactly the same mistake once).

I found out the trouble the hard way with a trace of ksm kprobe
wp_notifier faults, and that's this reordering here:

	   desc->shadow_ptes[i] = desc->shadow_ptes[j];
	   desc->shadow_ptes[j] = NULL;

In short there's no way to mix rmap_next and rmap_remove in the same
loop as rmap_remove will reorder stuff in the array, so using the
next_spte as pointer will lead to missing the sptes after the
next_spte that have been reordered in place of the spte just before
the next_spte. This was quite subtle issue as it's not immediately
evident the first time you read rmap_remove internals.

rmap_remove invocation requires to restart from scratch (or
alternatively to avoid restarting from scratch, we'd need to extend
the rmap methods to provide that functionality, something like
rmap_next_safe that would be robust against rmap_remove if used in the
same way you did now with rmap_next).

For now this will fix it for real.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aaccc40..9e4622c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -641,8 +641,9 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 			--kvm->stat.lpages;
 			set_shadow_pte(spte, shadow_trap_nonpresent_pte);
 			write_protected = 1;
-		}
-		spte = rmap_next(kvm, rmapp, spte);
+			spte = rmap_next(kvm, rmapp, NULL);
+		} else
+			spte = rmap_next(kvm, rmapp, spte);
 	}

 	if (write_protected)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  1:54 ` Andrea Arcangeli
@ 2008-06-08  4:48   ` Marcelo Tosatti
  2008-06-08  7:30     ` Avi Kivity
  2008-06-08 18:31     ` Andrea Arcangeli
  0 siblings, 2 replies; 11+ messages in thread
From: Marcelo Tosatti @ 2008-06-08  4:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm-devel

Hi Andrea,

On Sun, Jun 08, 2008 at 03:54:52AM +0200, Andrea Arcangeli wrote:
> On Sat, Jun 07, 2008 at 09:27:36PM -0300, Marcelo Tosatti wrote:
> > 
> > rmap_next() expects the "spte" argument to be NULL whenever there's only
> > one remaining entry in the descriptor. That is, it was not designed to
> > handle changes in the chain while iterating.
> > 
> > This bug cripples rmap_write_protect() so that it won't nuke all
> > writable large mappings to a particular gfn.
> 
> Better than before but the patched version will still not nuke all
> writeable sptes (I did exactly the same mistake once).

Why not posting it at the time?

> I found out the trouble the hard way with a trace of ksm kprobe
> wp_notifier faults, and that's this reordering here:
> 
> 	   desc->shadow_ptes[i] = desc->shadow_ptes[j];
> 	   desc->shadow_ptes[j] = NULL;

You're right.

> In short there's no way to mix rmap_next and rmap_remove in the same
> loop as rmap_remove will reorder stuff in the array, so using the
> next_spte as pointer will lead to missing the sptes after the
> next_spte that have been reordered in place of the spte just before
> the next_spte. This was quite subtle issue as it's not immediately
> evident the first time you read rmap_remove internals.
> 
> rmap_remove invocation requires to restart from scratch (or
> alternatively to avoid restarting from scratch, we'd need to extend
> the rmap methods to provide that functionality, something like
> rmap_next_safe that would be robust against rmap_remove if used in the
> same way you did now with rmap_next).
> 
> For now this will fix it for real.

This makes no difference since rmap_next() still can't handle the case
where rmap_remove() left a single entry in the array and "spte" has been
passed as non-NULL:

        while (desc) {
                for (i = 0; i < RMAP_EXT && desc->shadow_ptes[i]; ++i) {
                        if (prev_spte == spte)
                                return desc->shadow_ptes[i];
                        prev_spte = desc->shadow_ptes[i];
                }
                desc = desc->more;
        }

Other than the BUG_ON()'s at the beginning of the while() loop.

How bad you think restarting is? Unless there are workloads with very
large hugepage chains most of the data should be cached in L1. Also the
fact that a 4K page has just been shadowed inside a large region makes
it unlikely for the same chain to be examined again soon.

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aaccc40..d01d098 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -640,6 +640,7 @@ static void rmap_write_protect(struct kvm *kvm, u64
gfn)
                        rmap_remove(kvm, spte);
                        --kvm->stat.lpages;
                        set_shadow_pte(spte,shadow_trap_nonpresent_pte);
+                       spte = NULL;
                        write_protected = 1;
                }
                spte = rmap_next(kvm, rmapp, spte);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  4:48   ` Marcelo Tosatti
@ 2008-06-08  7:30     ` Avi Kivity
  2008-06-08  8:03       ` Avi Kivity
  2008-06-08 18:31     ` Andrea Arcangeli
  1 sibling, 1 reply; 11+ messages in thread
From: Avi Kivity @ 2008-06-08  7:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, kvm-devel

Marcelo Tosatti wrote:
> How bad you think restarting is? Unless there are workloads with very
> large hugepage chains most of the data should be cached in L1. Also the
> fact that a 4K page has just been shadowed inside a large region makes
> it unlikely for the same chain to be examined again soon.
>   

Non-pae guests will have a chain entry for every mm_struct, since the 
large ptes reside in the pgd and that is replicated for every 
mm_struct.  So it's easy to have very long chains.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  7:30     ` Avi Kivity
@ 2008-06-08  8:03       ` Avi Kivity
  2008-06-08  8:04         ` Avi Kivity
  0 siblings, 1 reply; 11+ messages in thread
From: Avi Kivity @ 2008-06-08  8:03 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, kvm-devel

Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> How bad you think restarting is? Unless there are workloads with very
>> large hugepage chains most of the data should be cached in L1. Also the
>> fact that a 4K page has just been shadowed inside a large region makes
>> it unlikely for the same chain to be examined again soon.
>>   
>
> Non-pae guests will have a chain entry for every mm_struct, since the 
> large ptes reside in the pgd and that is replicated for every 
> mm_struct.  So it's easy to have very long chains.
>
>

Anyway I applied the patch as a stopgap.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  8:03       ` Avi Kivity
@ 2008-06-08  8:04         ` Avi Kivity
  0 siblings, 0 replies; 11+ messages in thread
From: Avi Kivity @ 2008-06-08  8:04 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrea Arcangeli, kvm-devel

Avi Kivity wrote:
> Avi Kivity wrote:
>> Marcelo Tosatti wrote:
>>> How bad you think restarting is? Unless there are workloads with very
>>> large hugepage chains most of the data should be cached in L1. Also the
>>> fact that a 4K page has just been shadowed inside a large region makes
>>> it unlikely for the same chain to be examined again soon.
>>>   
>>
>> Non-pae guests will have a chain entry for every mm_struct, since the 
>> large ptes reside in the pgd and that is replicated for every 
>> mm_struct.  So it's easy to have very long chains.
>>
>>
>
> Anyway I applied the patch as a stopgap.
>

btw, please send a signoff for 2.6.26.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08  4:48   ` Marcelo Tosatti
  2008-06-08  7:30     ` Avi Kivity
@ 2008-06-08 18:31     ` Andrea Arcangeli
  2008-06-08 19:52       ` Marcelo Tosatti
  1 sibling, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2008-06-08 18:31 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm-devel

On Sun, Jun 08, 2008 at 01:48:53AM -0300, Marcelo Tosatti wrote:
> > writeable sptes (I did exactly the same mistake once).
> 
> Why not posting it at the time?

See the comments in the mail I posted a few days ago with subject
"[patch] kvm with mmu notifier v18". I posted the incompatibility with
rmap_next and rmap_remove in the same loop, the day after I fixed it
in ksm.

> This makes no difference since rmap_next() still can't handle the case
> where rmap_remove() left a single entry in the array and "spte" has been
> passed as non-NULL:

I don't see the problem, with the fixed code we never pass spte
not-NULL after any rmap_remove run.

> How bad you think restarting is? Unless there are workloads with very

It probably worth to optimize it with rmap_next_safe in the long run,
but it may not be urgent as it may be lost in the noise even if the
chains are fairly long. To be sure you could try to profile it and see
if rmap_write_protect goes up in the opreport.

> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index aaccc40..d01d098 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -640,6 +640,7 @@ static void rmap_write_protect(struct kvm *kvm, u64
> gfn)
>                         rmap_remove(kvm, spte);
>                         --kvm->stat.lpages;
>                         set_shadow_pte(spte,shadow_trap_nonpresent_pte);
> +                       spte = NULL;
>                         write_protected = 1;
>                 }
>                 spte = rmap_next(kvm, rmapp, spte);

This is shorter and functionally equivalent to my proposed fix, so you
seem to agree this makes a difference by posting this patch, and I
sure agree with the above fix too!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08 18:31     ` Andrea Arcangeli
@ 2008-06-08 19:52       ` Marcelo Tosatti
  2008-06-08 20:30         ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Marcelo Tosatti @ 2008-06-08 19:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm-devel

On Sun, Jun 08, 2008 at 08:31:19PM +0200, Andrea Arcangeli wrote:
> On Sun, Jun 08, 2008 at 01:48:53AM -0300, Marcelo Tosatti wrote:
> > > writeable sptes (I did exactly the same mistake once).
> > 
> > Why not posting it at the time?
> 
> See the comments in the mail I posted a few days ago with subject
> "[patch] kvm with mmu notifier v18". I posted the incompatibility with
> rmap_next and rmap_remove in the same loop, the day after I fixed it
> in ksm.
> 
> > This makes no difference since rmap_next() still can't handle the case
> > where rmap_remove() left a single entry in the array and "spte" has been
> > passed as non-NULL:
> 
> I don't see the problem, with the fixed code we never pass spte
> not-NULL after any rmap_remove run.

We do. The case is were you have two entries in the array. rmap_remove
will first remove the entry at index 0, and move the entry at index 1 to
0.

Then we call "rmap_next()" with a non-NULL spte, which will skip the
only remaining entry at index 0. IOW rmap_next() requires the spte
argument to be NULL if the array has one valid entry.

> 
> > How bad you think restarting is? Unless there are workloads with very
> 
> It probably worth to optimize it with rmap_next_safe in the long run,
> but it may not be urgent as it may be lost in the noise even if the
> chains are fairly long. To be sure you could try to profile it and see
> if rmap_write_protect goes up in the opreport.

Will do.

> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index aaccc40..d01d098 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -640,6 +640,7 @@ static void rmap_write_protect(struct kvm *kvm, u64
> > gfn)
> >                         rmap_remove(kvm, spte);
> >                         --kvm->stat.lpages;
> >                         set_shadow_pte(spte,shadow_trap_nonpresent_pte);
> > +                       spte = NULL;
> >                         write_protected = 1;
> >                 }
> >                 spte = rmap_next(kvm, rmapp, spte);
> 
> This is shorter and functionally equivalent to my proposed fix, so you
> seem to agree this makes a difference by posting this patch, and I
> sure agree with the above fix too!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08 19:52       ` Marcelo Tosatti
@ 2008-06-08 20:30         ` Andrea Arcangeli
  2008-06-09  0:20           ` Marcelo Tosatti
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2008-06-08 20:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm-devel

On Sun, Jun 08, 2008 at 04:52:15PM -0300, Marcelo Tosatti wrote:
> We do. The case is were you have two entries in the array. rmap_remove
> will first remove the entry at index 0, and move the entry at index 1 to
> 0.
> 
> Then we call "rmap_next()" with a non-NULL spte, which will skip the
> only remaining entry at index 0. IOW rmap_next() requires the spte
> argument to be NULL if the array has one valid entry.

How exactly we could call rmap_next with not-NULL spte after any
rmap_remove with your last patch applied?

> > chains are fairly long. To be sure you could try to profile it and see
> > if rmap_write_protect goes up in the opreport.
> 
> Will do.

Cool, thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-08 20:30         ` Andrea Arcangeli
@ 2008-06-09  0:20           ` Marcelo Tosatti
  2008-06-09  0:55             ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Marcelo Tosatti @ 2008-06-09  0:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm-devel

On Sun, Jun 08, 2008 at 10:30:37PM +0200, Andrea Arcangeli wrote:
> On Sun, Jun 08, 2008 at 04:52:15PM -0300, Marcelo Tosatti wrote:
> > We do. The case is were you have two entries in the array. rmap_remove
> > will first remove the entry at index 0, and move the entry at index 1 to
> > 0.
> > 
> > Then we call "rmap_next()" with a non-NULL spte, which will skip the
> > only remaining entry at index 0. IOW rmap_next() requires the spte
> > argument to be NULL if the array has one valid entry.
> 
> How exactly we could call rmap_next with not-NULL spte after any
> rmap_remove with your last patch applied?

With the original code.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
  2008-06-09  0:20           ` Marcelo Tosatti
@ 2008-06-09  0:55             ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2008-06-09  0:55 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm-devel

On Sun, Jun 08, 2008 at 09:20:41PM -0300, Marcelo Tosatti wrote:
> With the original code.

Ah sure! ;)

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-06-09  0:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-08  0:27 KVM: MMU: rmap_write_protect() hugepage iteration bug Marcelo Tosatti
2008-06-08  1:54 ` Andrea Arcangeli
2008-06-08  4:48   ` Marcelo Tosatti
2008-06-08  7:30     ` Avi Kivity
2008-06-08  8:03       ` Avi Kivity
2008-06-08  8:04         ` Avi Kivity
2008-06-08 18:31     ` Andrea Arcangeli
2008-06-08 19:52       ` Marcelo Tosatti
2008-06-08 20:30         ` Andrea Arcangeli
2008-06-09  0:20           ` Marcelo Tosatti
2008-06-09  0:55             ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox