From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@qumranet.com>
Subject: Re: [kvm-devel] performance with guests running 2.4 kernels	(specifically
 RHEL3)
Date: Sat, 31 May 2008 10:39:17 +0300
Message-ID: <484100A5.2070704@qumranet.com>
References: <4835EEF5.9010600@cisco.com> <483D391F.7050007@qumranet.com> <483D6898.2050605@cisco.com> <20080528144850.GX27375@duo.random> <483D7C45.5020300@qumranet.com> <483D7D8D.3030309@cisco.com> <20080528170410.GC8086@duo.random> <483E7EE2.8010508@qumranet.com> <20080529142703.GJ8086@duo.random> <483EC8E7.4010501@qumranet.com> <20080530131238.GB3118@duo.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "David S. Ahern" <daahern@cisco.com>, kvm@vger.kernel.org
To: Andrea Arcangeli <andrea@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from bzq-179-150-194.static.bezeqint.net ([212.179.150.194]:31470
	"EHLO il.qumranet.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750710AbYEaHjS (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sat, 31 May 2008 03:39:18 -0400
In-Reply-To: <20080530131238.GB3118@duo.random>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote:
>   
>> Yes.  We need a fault in order to set the guest accessed bit.
>>     
>
> So what I'm missing now is how the spte corresponding to the user pte
> that is under test_and_clear to clear the accessed bit, will not the
> zapped immediately. If we don't zap it immediately, how do we set the
> accessed bit again on the user pte, when the user program returned
> running and used that shadow pte to access the program data after the
> kscand pass?
>
>   

The spte is zapped unconditionally in kvm_mmu_pte_write(), and not 
re-established in mmu_pte_write_new_pte() due to the missing accessed bit.

The question is whether to tear down the shadow page it is contained in, 
or not.

> Or am I missing something?
>
>   
>> Unshadowing a page is expensive, both in immediate cost, and in future cost 
>> of reshadowing the page and taking faults.  It's worthwhile to be sure the 
>> guest really doesn't want it as a page table.
>>     
>
> Ok that makes sense, but can we defer the unshadowing while still
> emulating the accessed bit correctly on the user pte?
>
>   

We do, unless there's a bad bug somewhere.

>> If the pages are not scanned linearly, then unshadowing may not help.
>>     
>
> It should help the second time kscand runs, for the user ptes that
> aren't shadowed anymore, the second pass won't require any emulation
> for test_and_bit because the spte of the fixmap area will be
> read-write. The bug that passes the anonymous pages number instead of
> the cache number will lead to many more test_and_clear than needed,
> and not all user ptes may be used in between two different kscand passes.
>
>   

We still need 3 emulations per pte to set the fixmap entry.  Unshadowing 
saves one emulation on the pte itself.


>> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables.  
>>     
>
> There are likely 1500 ptes in highmem. (ram isn't the most important factor)
>
>   

I use 'pte' in the Intel manual sense (page table entry), not the Linux 
sense (page table).

I mentioned these numbers to see the worst case behavior.

Non-highmem:

   - with unshadow: O(500) accesses to unshadow the page tables, then 
native speed
   - without unshadow: O(250000) accesses to modify the ptes

Highmem:
   - with unshadow: O(250000) accesses to update the fixmap entry
   - with unshadow: O(250000) accesses to update the fixmap entry and to 
modify the ptes
 

>> Well, then after 4000 scans we ought to have unshadowed everything.  So I 
>> guess per-page-pte-history is broken, can't explain it otherwise.
>>     
>
> Yes, we should have unshadowed all user ptes after 4000 scans and then
> the test_and_clear shouldn't require any more emulation, there will be
> only 3 emulations for each kmap_atomic/kunmap_atomic.
>
>   

So we save 25%.  It's still bad even if everything is working correctly.

>
> I think it should be clear that by now, we're trying to be
> bug-compatile like the host here, and optimizing for 2.6 kmaps.
>   

Don't understand.


I'm guessing esx gets its good performance by special-casing something.  
For example, they can keep the fixmap page never shadowed, always 
emulate accesses through the fixmap page, and recompile instructions 
that go through fixmap to issue a hypercall.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.