From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: KVM: MMU: rmap_write_protect() hugepage iteration bug Date: Sun, 8 Jun 2008 03:54:52 +0200 Message-ID: <20080608015452.GC8321@duo.random> References: <20080608002736.GA25582@dmt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Avi Kivity , kvm-devel To: Marcelo Tosatti Return-path: Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:55522 "EHLO mx.cpushare.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757046AbYFHBy6 (ORCPT ); Sat, 7 Jun 2008 21:54:58 -0400 Content-Disposition: inline In-Reply-To: <20080608002736.GA25582@dmt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: On Sat, Jun 07, 2008 at 09:27:36PM -0300, Marcelo Tosatti wrote: > > rmap_next() expects the "spte" argument to be NULL whenever there's only > one remaining entry in the descriptor. That is, it was not designed to > handle changes in the chain while iterating. > > This bug cripples rmap_write_protect() so that it won't nuke all > writable large mappings to a particular gfn. Better than before but the patched version will still not nuke all writeable sptes (I did exactly the same mistake once). I found out the trouble the hard way with a trace of ksm kprobe wp_notifier faults, and that's this reordering here: desc->shadow_ptes[i] = desc->shadow_ptes[j]; desc->shadow_ptes[j] = NULL; In short there's no way to mix rmap_next and rmap_remove in the same loop as rmap_remove will reorder stuff in the array, so using the next_spte as pointer will lead to missing the sptes after the next_spte that have been reordered in place of the spte just before the next_spte. This was quite subtle issue as it's not immediately evident the first time you read rmap_remove internals. rmap_remove invocation requires to restart from scratch (or alternatively to avoid restarting from scratch, we'd need to extend the rmap methods to provide that functionality, something like rmap_next_safe that would be robust against rmap_remove if used in the same way you did now with rmap_next). For now this will fix it for real. Signed-off-by: Andrea Arcangeli diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index aaccc40..9e4622c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -641,8 +641,9 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) --kvm->stat.lpages; set_shadow_pte(spte, shadow_trap_nonpresent_pte); write_protected = 1; - } - spte = rmap_next(kvm, rmapp, spte); + spte = rmap_next(kvm, rmapp, NULL); + } else + spte = rmap_next(kvm, rmapp, spte); } if (write_protected)