From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: KVM: MMU: rmap_write_protect() hugepage iteration bug
Date: Sun, 8 Jun 2008 01:48:53 -0300
Message-ID: <20080608044853.GA1408@dmt.cnet>
References: <20080608002736.GA25582@dmt.cnet> <20080608015452.GC8321@duo.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Avi Kivity <avi@qumranet.com>, kvm-devel <kvm@vger.kernel.org>
To: Andrea Arcangeli <andrea@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:39588 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751483AbYFHEts (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 8 Jun 2008 00:49:48 -0400
Content-Disposition: inline
In-Reply-To: <20080608015452.GC8321@duo.random>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Hi Andrea,

On Sun, Jun 08, 2008 at 03:54:52AM +0200, Andrea Arcangeli wrote:
> On Sat, Jun 07, 2008 at 09:27:36PM -0300, Marcelo Tosatti wrote:
> > 
> > rmap_next() expects the "spte" argument to be NULL whenever there's only
> > one remaining entry in the descriptor. That is, it was not designed to
> > handle changes in the chain while iterating.
> > 
> > This bug cripples rmap_write_protect() so that it won't nuke all
> > writable large mappings to a particular gfn.
> 
> Better than before but the patched version will still not nuke all
> writeable sptes (I did exactly the same mistake once).

Why not posting it at the time?

> I found out the trouble the hard way with a trace of ksm kprobe
> wp_notifier faults, and that's this reordering here:
> 
> 	   desc->shadow_ptes[i] = desc->shadow_ptes[j];
> 	   desc->shadow_ptes[j] = NULL;

You're right.

> In short there's no way to mix rmap_next and rmap_remove in the same
> loop as rmap_remove will reorder stuff in the array, so using the
> next_spte as pointer will lead to missing the sptes after the
> next_spte that have been reordered in place of the spte just before
> the next_spte. This was quite subtle issue as it's not immediately
> evident the first time you read rmap_remove internals.
> 
> rmap_remove invocation requires to restart from scratch (or
> alternatively to avoid restarting from scratch, we'd need to extend
> the rmap methods to provide that functionality, something like
> rmap_next_safe that would be robust against rmap_remove if used in the
> same way you did now with rmap_next).
> 
> For now this will fix it for real.

This makes no difference since rmap_next() still can't handle the case
where rmap_remove() left a single entry in the array and "spte" has been
passed as non-NULL:

        while (desc) {
                for (i = 0; i < RMAP_EXT && desc->shadow_ptes[i]; ++i) {
                        if (prev_spte == spte)
                                return desc->shadow_ptes[i];
                        prev_spte = desc->shadow_ptes[i];
                }
                desc = desc->more;
        }

Other than the BUG_ON()'s at the beginning of the while() loop.

How bad you think restarting is? Unless there are workloads with very
large hugepage chains most of the data should be cached in L1. Also the
fact that a 4K page has just been shadowed inside a large region makes
it unlikely for the same chain to be examined again soon.

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index aaccc40..d01d098 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -640,6 +640,7 @@ static void rmap_write_protect(struct kvm *kvm, u64
gfn)
                        rmap_remove(kvm, spte);
                        --kvm->stat.lpages;
                        set_shadow_pte(spte,shadow_trap_nonpresent_pte);
+                       spte = NULL;
                        write_protected = 1;
                }
                spte = rmap_next(kvm, rmapp, spte);