From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiao Guangrong Subject: Re: [PATCH RFC] KVM: MMU: Don't use RCU for lockless shadow walking Date: Tue, 24 Apr 2012 14:37:32 +0800 Message-ID: <4F964A2C.7050106@linux.vnet.ibm.com> References: <1335197812-32064-1-git-send-email-avi@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Marcelo Tosatti , kvm@vger.kernel.org To: Avi Kivity Return-path: Received: from e23smtp03.au.ibm.com ([202.81.31.145]:34579 "EHLO e23smtp03.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752881Ab2DXGjX (ORCPT ); Tue, 24 Apr 2012 02:39:23 -0400 Received: from /spool/local by e23smtp03.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 24 Apr 2012 06:28:24 +1000 Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q3O6bYXk32440518 for ; Tue, 24 Apr 2012 16:37:34 +1000 Received: from d23av01.au.ibm.com (loopback [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q3O6bXRl031843 for ; Tue, 24 Apr 2012 16:37:33 +1000 In-Reply-To: <1335197812-32064-1-git-send-email-avi@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: On 04/24/2012 12:16 AM, Avi Kivity wrote: > Using RCU for lockless shadow walking can increase the amount of memory > in use by the system, since RCU grace periods are unpredictable. We also > have an unconditional write to a shared variable (reader_counter), which > isn't good for scaling. > > Replace that with a scheme similar to x86's get_user_pages_fast(): disable > interrupts during lockless shadow walk to force the freer > (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the > processor with interrupts enabled. > > We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent > kvm_flush_remote_tlbs() from avoiding the IPI. > > Signed-off-by: Avi Kivity > --- > > Turned out to be simpler than expected. However, I think there's a problem > with make_all_cpus_request() possible reading an incorrect vcpu->cpu. It seems possible. Can we fix it by reading vcpu->cpu when the vcpu is in GUEST_MODE or EXITING_GUEST_MODE (IIRC, in these modes, interrupt is disabled)? Like: if (kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE) cpumask_set_cpu(vcpu->cpu, cpus); > > arch/x86/include/asm/kvm_host.h | 4 --- > arch/x86/kvm/mmu.c | 61 +++++++++++---------------------------- > include/linux/kvm_host.h | 3 +- > 3 files changed, 19 insertions(+), 49 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index f624ca7..67e66e6 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -237,8 +237,6 @@ struct kvm_mmu_page { > #endif > > int write_flooding_count; > - > - struct rcu_head rcu; > }; > > struct kvm_pio_request { > @@ -536,8 +534,6 @@ struct kvm_arch { > u64 hv_guest_os_id; > u64 hv_hypercall; > > - atomic_t reader_counter; > - > #ifdef CONFIG_KVM_MMU_AUDIT > int audit_point; > #endif > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 07424cf..903af5e 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -551,19 +551,23 @@ static u64 mmu_spte_get_lockless(u64 *sptep) > > static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu) > { > - rcu_read_lock(); > - atomic_inc(&vcpu->kvm->arch.reader_counter); > - > - /* Increase the counter before walking shadow page table */ > - smp_mb__after_atomic_inc(); > + /* > + * Prevent page table teardown by making any free-er wait during > + * kvm_flush_remote_tlbs() IPI to all active vcpus. > + */ > + local_irq_disable(); > + vcpu->mode = READING_SHADOW_PAGE_TABLES; > + /* > + * wmb: advertise vcpu->mode change > + * rmb: make sure we see updated sptes > + */ > + smp_mb(); > } > > static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) > { > - /* Decrease the counter after walking shadow page table finished */ > - smp_mb__before_atomic_dec(); > - atomic_dec(&vcpu->kvm->arch.reader_counter); > - rcu_read_unlock(); We need a mb here to avoid that setting vcpu->mode is reordered to the head of reading/writing spte? (it is safe on x86, but we need a comment at least?) Otherwise it looks good to me, i will measure it later.