From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [patch 07/13] KVM: MMU: mode specific sync_page
Date: Mon, 8 Sep 2008 03:03:54 -0300
Message-ID: <20080908060354.GA1014@dmt.cnet>
References: <20080906184822.560099087@localhost.localdomain> <20080906192431.043506161@localhost.localdomain> <48C3A455.5080100@qumranet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: KVM list <kvm@vger.kernel.org>
To: Avi Kivity <avi@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([66.187.233.31]:57928 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752837AbYIHGF0 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 8 Sep 2008 02:05:26 -0400
Content-Disposition: inline
In-Reply-To: <48C3A455.5080100@qumranet.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Sun, Sep 07, 2008 at 12:52:21PM +0300, Avi Kivity wrote:
> What if vcpu0 is in mode X, while vcpu1 is in mode Y.  vcpu0 writes to  
> some pagetable, causing both mode X and mode Y shadows to become  
> unsynced, so on the next resync (either by vcpu0 or vcpu1) we need to  
> sync both modes.

>>From the oos core patch:

-       hlist_for_each_entry(sp, node, bucket, hash_link)
-               if (sp->gfn == gfn && sp->role.word == role.word) {
+       hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
+               if (sp->gfn == gfn) {
+                       /*
+                        * If a pagetable becomes referenced by more than one
+                        * root, or has multiple roles, unsync it and disable
+                        * oos. For higher level pgtables the entire tree
+                        * has to be synced.
+                        */
+                       if (sp->root_gfn != root_gfn) {
+                               kvm_set_pg_inuse(sp);
+                               if (set_shared_mmu_page(vcpu, sp))
+                                       tmp = bucket->first;
+                               kvm_clear_pg_inuse(sp);
+                               unsyncable = 0;
+                       }

So as soon as a pagetable is shadowed with different modes, its resynced 
and unsyncing is disabled.

> Same problem with kvm_mmu_pte_write(), which right now hacks around it.
>
> Maybe we need a ->ops member.

>> +			if (!is_present_pte(*pt)) {
>> +				rmap_remove(vcpu->kvm, &sp->spt[i]);
>> +				sp->spt[i] = shadow_notrap_nonpresent_pte;
>> +				pt++;
>> +				continue;
>> +			}
>>   
>
> Are we missing a tlb flush?  Or will the caller take care of it?

Yes, there's a local TLB flush missing, which can be collapsed into a
single kvm_x86_ops->tlb_flush in the caller.

>> +
>> +			pte_access = sp->role.access & FNAME(gpte_access)(vcpu, *pt);
>> +			/* user */
>> +			if (pte_access & ACC_USER_MASK)
>> +				spte |= shadow_user_mask;
>>   
>
> There are some special cases involving cr0.wp=0 and the user mask.  so  
> spte.u is not correlated exactly with gpte.u.

How come?

>> +			/* guest->shadow accessed sync */
>> +			if (!(*pt & PT_ACCESSED_MASK))
>> +				spte &= ~PT_ACCESSED_MASK;
>>   
>
> spte shouldn't be accessible at all if gpte is not accessed, so we can  
> set gpte.a on the next access (similar to spte not being writeable if  
> gpte is not dirty).

Right. Perhaps accessed bit synchronization to guest could be performed
lazily somehow, so as to avoid a vmexit on every first page access.

>> +			/* shadow->guest accessed sync */
>> +			if (spte & PT_ACCESSED_MASK)
>> +				set_bit(PT_ACCESSED_SHIFT, (unsigned long *)pt);
>>   
>
> host accessed and guest accessed are very different.  We shouldn't set  
> host accessed unless we're sure the guest will access the page very soon.
>
>> +			set_shadow_pte(&sp->spt[i], spte);
>>   
>
> What if permissions are reduced?

Then a local TLB flush is needed. Flushing the TLB's of remote vcpus
should be done by the guest AFAICS.

> You can use PT_* instead of shadow_* as this will never be called when  
> ept is active.
>
> I'm worried about the duplication with kvm_mmu_set_pte().  Perhaps that  
> can be refactored instead to be the inner loop.

Will look into that.