From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@qumranet.com>
Subject: Re: [patch 09/13] KVM: MMU: out of sync shadow core
Date: Mon, 08 Sep 2008 17:51:26 +0300
Message-ID: <48C53BEE.4050402@qumranet.com>
References: <20080906184822.560099087@localhost.localdomain> <20080906192431.211131067@localhost.localdomain> <48C3B496.20905@qumranet.com> <20080908071933.GC1014@dmt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from il.qumranet.com ([212.179.150.194]:25377 "EHLO il.qumranet.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751254AbYIHOv2 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 8 Sep 2008 10:51:28 -0400
In-Reply-To: <20080908071933.GC1014@dmt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Marcelo Tosatti wrote:
>> I'm worried about the complexity this (and the rest) introduces.
>>
>> A possible alternative is:
>>
>> - for non-leaf pages, including roots, add a 'unsync_children' flag.
>> - when marking a page unsync, set the flag recursively on all parents
>> - when switching cr3, recursively descend to locate unsynced leaves,  
>> clearing flags along the way
>> - to speed this up, put a bitmap with 1 bit per pte in the pages (512  
>> bits = 64 bytes)
>> - the bitmap can be externally allocated to save space, or not
>>
>> This means we no longer have to worry about multiple roots, when a page  
>> acquires another root while it is unsynced, etc.
>>     
>
> I thought about that when you first mentioned it, but it seems more
> complex than the current structure. Remember you have to clean the
> unsynced flag on resync, which means walking up the parents verifying if
> this is the last unsynced children.
>   

No, if you have a false positive you can simply ignore it.

> Other than the bitmap space.
>   

The bitmap space could be stored in a separate structure.  
Alternatively, put a few u16s with indexes into the page header.  Would 
be faster to walk as well, though less general.

> And see comments about multiple roles below.
>
>   
>>> @@ -963,8 +1112,24 @@ static struct kvm_mmu_page *kvm_mmu_get_
>>>  		 gfn, role.word);
>>>  	index = kvm_page_table_hashfn(gfn);
>>>  	bucket = &vcpu->kvm->arch.mmu_page_hash[index];
>>> -	hlist_for_each_entry(sp, node, bucket, hash_link)
>>> -		if (sp->gfn == gfn && sp->role.word == role.word) {
>>> +	hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
>>> +		if (sp->gfn == gfn) {
>>> +			/*
>>> + 			 * If a pagetable becomes referenced by more than one
>>> + 			 * root, or has multiple roles, unsync it and disable
>>> + 			 * oos. For higher level pgtables the entire tree
>>> + 			 * has to be synced.
>>> + 			 */
>>> +			if (sp->root_gfn != root_gfn) {
>>> +				kvm_set_pg_inuse(sp);
>>>   
>>>       
>> What does inuse mean exactly?
>>     
>
> That we're going to access struct kvm_mmu_page, so kvm_sync_page won't
> free it (also used for global->nonglobal resync).
>
>   

Couldn't it be passed as a parameter?

>> I became a little unsynced myself reading the patch.  It's very complex.
>>     
>
> Can you go into detail? Worrying about multiple roots is more about
> code change (passing root_gfn down to mmu_get_page etc) than structural
> complexity I think. It boils down to
>
>                     if (sp->root_gfn != root_gfn) {
>                         kvm_set_pg_inuse(sp);
>                         if (set_shared_mmu_page(vcpu, sp))
>                             tmp = bucket->first;
>                         kvm_clear_pg_inuse(sp);
>                     }
>
> And this also deals with the pagetable with shadows in different
> modes/roles case. You'd still have to deal with that by keeping unsync
> information all the way up to root.
>
>   

I'm worried about the amount of state we add.  Whether a page is 
single-root or multi-root, if it's in the same mode or multiple modes.  
The problems with the large amount of state is that the number of 
possible state transitions increases rapidly.

So far we treat each page completely independently of other pages (apart 
from the connectivity pointers), so we avoid the combinatorial 
explosion.  The tree walk approach keeps that (at the expense of some 
efficiency, unfortunately).

>> or disallowing a parent to be zapped while any of its  
>> children are alive.
>>     
>
> What is the problem with that? 

It reduces the mmu flexibility.  If we (say) introduce an lru algorithm, 
it is orthogonal to everything else in the mmu.  If we have a root/child 
dependency, the lru has to know.

> And what the alternative would be, 
> to zap all children first?
>   

That has the disadvantage of allowing very bad corner cases if we are 
forced to zap a root.

I'd really like to avoid bad worst cases.

> So more details please, what exactly is annoying you:
>
> - Awareness of multiple roots in the current form ? I agree its
>   not very elegant.
>   

Yes.

> - The fact that hash table bucket and active_mmu_page
>   for_each_entry_safe walks are unsafe because several list
>   entries (the unsynced leafs) can be deleted ?
>
>   

Hadn't even considered that...

What worries me most is that everything is interconnected: multiple 
modes, cr3 switch, out-of-sync, zapping via the inuse flag.  It's very 
difficult for me to understand, what about someone new?

We need to make this fit better.  We need to morph some mmu 
infrastructure to something else, but we can't keep adding complexity.

> Oh, and another argument in favour of atomic resync is that you can
> do it from mmu_get_page (for multiple role case), and from within
> mmu_set_spte (for global->nonglobal change).
>   

I'll be more comfortable with atomic resync if we have snapshots as a 
means not to require so many walks in an atomic context.

-- 
error compiling committee.c: too many arguments to function