From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756933Ab3CSDHQ (ORCPT ); Mon, 18 Mar 2013 23:07:16 -0400 Received: from e23smtp07.au.ibm.com ([202.81.31.140]:57027 "EHLO e23smtp07.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754463Ab3CSDHO (ORCPT ); Mon, 18 Mar 2013 23:07:14 -0400 Message-ID: <5147D63B.4000400@linux.vnet.ibm.com> Date: Tue, 19 Mar 2013 11:06:35 +0800 From: Xiao Guangrong User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Marcelo Tosatti CC: Gleb Natapov , LKML , KVM Subject: Re: [PATCH 6/6] KVM: MMU: fast zap all shadow pages References: <514006AC.2020904@linux.vnet.ibm.com> <514007A0.1040400@linux.vnet.ibm.com> <20130318204601.GA16208@amt.cnet> In-Reply-To: <20130318204601.GA16208@amt.cnet> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13031902-0260-0000-0000-000002AFBF2C Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/19/2013 04:46 AM, Marcelo Tosatti wrote: > On Wed, Mar 13, 2013 at 12:59:12PM +0800, Xiao Guangrong wrote: >> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to >> walk and zap all shadow pages one by one, also it need to zap all guest >> page's rmap and all shadow page's parent spte list. Particularly, things >> become worse if guest uses more memory or vcpus. It is not good for >> scalability. >> >> Since all shadow page will be zapped, we can directly zap the mmu-cache >> and rmap so that vcpu will fault on the new mmu-cache, after that, we can >> directly free the memory used by old mmu-cache. >> >> The root shadow page is little especial since they are currently used by >> vcpus, we can not directly free them. So, we zap the root shadow pages and >> re-add them into the new mmu-cache. >> >> After this patch, kvm_mmu_zap_all can be faster 113% than before >> >> Signed-off-by: Xiao Guangrong >> --- >> arch/x86/kvm/mmu.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++----- >> 1 files changed, 56 insertions(+), 6 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index e326099..536d9ce 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -4186,18 +4186,68 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) >> >> void kvm_mmu_zap_all(struct kvm *kvm) >> { >> - struct kvm_mmu_page *sp, *node; >> + LIST_HEAD(root_mmu_pages); >> LIST_HEAD(invalid_list); >> + struct list_head pte_list_descs; >> + struct kvm_mmu_cache *cache = &kvm->arch.mmu_cache; >> + struct kvm_mmu_page *sp, *node; >> + struct pte_list_desc *desc, *ndesc; >> + int root_sp = 0; >> >> spin_lock(&kvm->mmu_lock); >> + >> restart: >> - list_for_each_entry_safe(sp, node, >> - &kvm->arch.mmu_cache.active_mmu_pages, link) >> - if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list)) >> - goto restart; >> + /* >> + * The root shadow pages are being used on vcpus that can not >> + * directly removed, we filter them out and re-add them to the >> + * new mmu cache. >> + */ >> + list_for_each_entry_safe(sp, node, &cache->active_mmu_pages, link) >> + if (sp->root_count) { >> + int ret; >> + >> + root_sp++; >> + ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); >> + list_move(&sp->link, &root_mmu_pages); >> + if (ret) >> + goto restart; >> + } >> + >> + list_splice(&cache->active_mmu_pages, &invalid_list); >> + list_replace(&cache->pte_list_descs, &pte_list_descs); >> + >> + /* >> + * Reset the mmu cache so that later vcpu will fault on the new >> + * mmu cache. >> + */ >> + memset(cache, 0, sizeof(*cache)); >> + kvm_mmu_init(kvm); > > Xiao, > > I suppose zeroing of kvm_mmu_cache can be avoided, if the links are > removed at prepare_zap_page. So perhaps The purpose of zeroing of kvm_mmu_cache is resetting the hashtable and some count numbers. [.n_request_mmu_pages and .n_max_mmu_pages should not be changed, i will fix this]. > > - spin_lock(mmu_lock) > - for each page > - zero sp->spt[], remove page from linked lists sizeof(mmu_cache) is: (1 << 10) * sizeof (hlist_head) + 4 * sizeof(unsigned int) = 2^13 + 16 and it is constant. In your way, for every sp, we need to zap: 512 entries + a hash-node = 2^12 + 8 especially the workload depends on the size of guest memory. Why you think this way is better? > - flush remote TLB (batched) > - spin_unlock(mmu_lock) > - free data (which is safe because freeing has its own serialization) We should free the root sp in mmu-lock like my patch. > - spin_lock(mmu_lock) > - account for the pages freed > - spin_unlock(mmu_lock) The count numbers are still inconsistent if other thread hold mmu-lock between zero shadow page and recount. Marcelo, i really confused what is the benefit in this way but i might completely misunderstand it.