From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <andrea@qumranet.com>
Subject: Re: [kvm-devel] performance with guests running 2.4 kernels
	(specifically RHEL3)
Date: Thu, 29 May 2008 16:27:03 +0200
Message-ID: <20080529142703.GJ8086@duo.random>
References: <48318E64.8090706@qumranet.com> <4832DDEB.4000100@qumranet.com> <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com> <20080528170410.GC8086@duo.random> <483E7EE2.8010508@qumranet.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: "David S. Ahern" <daahern@cisco.com>, kvm@vger.kernel.org
To: Avi Kivity <avi@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:58789
	"EHLO mx.cpushare.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752982AbYE2O1H (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 29 May 2008 10:27:07 -0400
Content-Disposition: inline
In-Reply-To: <483E7EE2.8010508@qumranet.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
> No, two:
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
>        ptep->pte_high = pte.pte_high;
>        smp_wmb();
>        ptep->pte_low = pte.pte_low;
> }

Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
enterprise distro with pte-highmem ships non-PAE kernels by default.

>>>>  - if these accesses trigger flooding, we will have to tear down the
>>>> shadow for this page, only to set it up again soon
>>>>       
>>
>> So the shadow mapping the fixmap area would be tear down by the
>> flooding.
>>   
>
> Before we started patching this, yes.

Ok so now the one/two writes to the guest fixmap virt address are
emulated and the spte isn't tear down.

>
>> Or is the shadow corresponding to the real user pte pointed by the
>> fixmap, that is unshadowed by the flooding, or both/all?
>>
>>   
>
> After we started patching this, no, but with per-page-pte-history, yes 
> (correctly).

So with the per-page-pte-history the shadow representing the guest
user pte that is being modified by page_referenced is unshadowed.

>>>> - an access to the pte (emulted)
>>>>       
>>
>> Here I count the second write and this isn't done on the fixmap area
>> like the first write above, but this is a write to the real user pte,
>> pointed by the fixmap. So if this is emulated it means the shadow of
>> the user pte pointing to the real data page is still active.
>>   
>
> Right.  But if we are scanning a page table linearly, it should be 
> unshadowed.

I think we're often not scanning page table linearly with pte_chains,
but yet those should be still unshadowed. mmaps won't always bring
memory in linear order, memory isn't always initialized or by memset
or pagedin with contiguous virtual accesses.

So while the assumption that following the active list will sometime
return guest ptes that maps contiguous guest virtual address is valid,
it only accounts for a small percentage of the active list. It largely
depends on the userland apps. Furthermore even if the active lru is
initially pointing to linear ptes, the list is then split into age
buckets depending on the access patterns at runtime, so that further
fragments the linearity of the virtual addresses of the kmapped ptes.

BTW, one thing we didn't account for in previous email, is that there
can be more than one guest user pte modified by page_referenced, if
it's not a direct page. And non direct pages surely won't provide
linear scans, infact for non linear pages the most common is that the
pte_t will point to the same virtual address but on a different
pgd_t * (and in turn on a different pmd_t).

>>>>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>>> emulations.  The pte is worthless anyway since the accessed bit is clear
>>>> (so we can't set up a shadow pte for it)
>>>>    - this bug was fixed
>>>>       
>>
>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte
>> pointed by the fixmap pte?
>>   
>
> The user pte.  After guest code runs test_and_clear_bit(accessed_bit, 
> ptep), we can't shadow that pte (all shadowed ptes must have the accessed 
> bit set in the corresponding guest pte, similar to how a tlb entry can only 
> exist if the accessed bit is set).

Is this software invariant to ensure that we'll refresh the accessed
bit on the user pte too?

I assume this is needed because otherwise if we clear the accessed bit
on the shadow pte and we clear it on the user pte, when the shadow is
mapped in the TLB again the accessed bit will be set on the shadow in
hardware, but not on the user pte because the accessed bit is set on
the spte without kvm page fault.

So this means kscand by clearing the accessed bitflag on them, should
automatically unshadowing all user ptes pointed by the fixmap pte.

So a secnd test_and_clear_bit on the same user pte will run through
the fixmap pte established by kmap_atomic without traps.

So this means when the user program run again, it'll find the user pte
unshadowed and it'll have to re-instantiate the shadow ptes with a kvm
page fault, that has the primary objective of marking the user pte
accessed again (to notify the next kscand pass that the data page
pointed by the user pte was used meanwhile).

If I understand correctly, the establishment of the shadow pte
corresponding to the user pte, will have to mark wrprotect the spte
corresponding to the fixmap pte because we need to intercept
modifications to shadowed guest ptes and the spte corresponding to the
fixmap guest pte is now pointing to a shadowed guest pte after the
program returns running.

Then when kscand runs again, for the pages that have been faulted in
by the user program, we'll trap the test_and_clear_bit happening
through the readonly spte corresponding to the fixmap guest pte, and
we'll unshadow the spte of the guest user pte again and we'll mark the
spte corresponding to the fixmap pte as read-write again, because of
the test_and_clear_bit tells us that we've to unshadow instead of
emulating.

>>>> - an access to tear down the kmap
>>>>       
>>
>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>> matters).
>>   
>
> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.

2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.

2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
does nothing in kunmap_atomic.

2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.

>> I think what we should aim for is to quickly reach this condition:
>>
>> 1) always keep the fixmap/kmap pte_t shadowed and emulate the
>>    kmap/kunmap access so the test_and_clear_young done on the user pte
>>    doesn't require to re-establish the spte representing the fixmap
>>    virtual address. If we don't emulate fixmap we'll have to
>>    re-establish the spte during the write to the user pte, and
>>    tear it down again during kunmap_atomic. So there's not much doubt
>>    fixmap access emulation is worth it.
>>   
>
> That is what is done by current HEAD.  
> 418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible.

Cool!

>
> Note that there is an alternative: allow the kmap pte to be unshadowed, and 
> instead emulate the access through that pte (i.e. emulate the btc 
> instruction).  I don't think it's worth it though because it hurts other 
> users of the fixmap page.
>> 2) get rid of the user pte shadow mapping pointing to the user data so
>>    the test_and_clear of the young bitflag on the user pte will not be
>>    emulated and it'll run at full CPU speed through the shadow pte
>>    mapping corresponding to the fixmap virtual address
>>   
>
> That's what per-page-pte-history is supposed to do.  The first few accesses 
> are emulated, the next will be native.

Why not to go native immediately when we notice a test_and_clear of
the accessed bit? First the ptes won't be in contiguous virtual
address order, so if the flooding of the sptes corresponding to the
guest user pte depends on the gpa of the guest user ptes being
contiguous it won't work well. But more importantly we've found a
test_and_clear_bit of the accessed bitflag, so we should unshadow the
user pte that is being marked "old" immediately without need to detect
any flooding.

> It's still not full speed as the kmap setup has to be emulated (twice).

Agreed, the 1/2/3 emulations on writes to the fixmap area during
kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4
debug-highmem) seems unavoidable.

But the test_and_clear_bit writprotect fault (when the guest user pte
is shadowed) should just unshadow the guest user pte, mark the spte
representing the fixmap pte as writeable, and return immediately to
guest mode to actually run test_and_clear_bit natively without writing
it through emulation.

Noticing the test_and_clear_bit also requires a bit of instruction
"detection", but once we detected it from the eip address, we don't
have to write anything to the guest.

But I guess I'm missing something...

> One possible optimization is that if we see the first part of the kmap 
> instantiation, we emulate a few more instructions before returning to the 
> guest.  Xen does this IIRC.

Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
sure if 32bit PAE is that important to do this. Most 32bit enterprise
kernels I worked aren't compiled with PAE, only one called bigsmp is.

Also on 2.6, we could get the same benefit by making 2.6 at least as
optimal as 2.4 by never clearing the fixmap pte and by doing invlpg
only after setting it to a new value. Xen can't optimize that write in
kunmap_atomic.

2.6 has debug enabled by default for no good reason. So that would be
the first optimization to do as it saves a few cycles per
kunmap_atomic on host too.

> I'm no longer sure the access pattern is sequential, since I see 
> kmap_atomic() will not recreate the pte if its value has not changed 
> (unless HIGHMEM_DEBUG).

Hmm kmap_atomic always writes a new value to the fixmap pte, even if
it was mapping the same user pte as before.

static inline void *kmap_atomic(struct page *page, enum km_type type)
{
	enum fixed_addresses idx;
	unsigned long vaddr;

	if (page < highmem_start_page)
	   return page_address(page);

	   idx = type + KM_TYPE_NR*smp_processor_id();
	   vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
#if HIGHMEM_DEBUG
    if (!pte_none(*(kmap_pte-idx)))
       out_of_line_bug();
#endif
	set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
	__flush_tlb_one(vaddr);

	return (void*) vaddr;
}

In 2.6 does too, because it does the debug pte_clear in kunmap_atomic.

In theory even host could do pte_same() and avoid an invlpg if it
didn't change, but I'm unsure how frequently we remap the same page,
the pte loops like mprotect will map the 4k large pte, and loop over
it once it's mapped by the fixmap virtual address. So frequent
repetitions of remapping of the same page with kmap_atomic sounds
unlikely.