From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@qumranet.com>
Subject: Re: [patch] kvm with mmu notifier v18
Date: Fri, 06 Jun 2008 23:09:55 +0300
Message-ID: <48499993.20409@qumranet.com>
References: <20080605002626.GA15502@duo.random> <48480C36.6050309@qumranet.com> <20080605164717.GH15502@duo.random> <4848FA13.6040204@qumranet.com> <20080606125019.GN15502@duo.random> <484967DC.901@qumranet.com> <20080606173752.GA8010@duo.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Andrea Arcangeli <andrea@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from il.qumranet.com ([212.179.150.194]:42889 "EHLO il.qumranet.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756422AbYFFUJv (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 6 Jun 2008 16:09:51 -0400
In-Reply-To: <20080606173752.GA8010@duo.random>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Andrea Arcangeli wrote:
>> memslot and set it for slots which do not have aliases.  That makes the 
>> loop terminate soon again.
>>     
>
> So we need to make sure that aliases (gart-like) aren't common, or if
> we've an alias on ram we go back to scanning the whole list all the time.
>
>   

They aren't.  The two cases are the VGA window at 0xa0000-0xc0000 (the 
motivation for aliasing support; before we had that, Windows would take 
several seconds to clear the screen which switching to graphics mode) 
and the BIOS at 0xe000 (which ought to be aliased to 0xfffe0000, but 
isn't).  Both are very rarely used.

>> Any pointer-based data structure is bound to be much slower than a list 
>> with such a small number of elements.
>>     
>
> Tree can only be slower than a list if there is the bitflag to signal
> there is no alias so the list will break the loop always at the first
> step. If you remove that bitflag, tree lookup can't be slower than
> walking the entire list, even if there are only 3/4 elements
> queued. Only the no_alias bitflag allows the list to be faster.
>
>   

No.  List cost is C1*N, while tree cost is C2*log(N).  If N is small and 
C2/C1 is sufficiently large, the list wins.

Modern processors will increase C2/C1, since for trees there are data 
dependencies which reduce parallelism.  For a tree, you need to chase 
pointers (and the processor doesn't know which data to fetch until it 
loads the pointer), and also the pointer depends on the result of the 
comparison, further reducing speculation.

>> btw, on 64-bit userspace we can arrange the entire physical address space 
>> in a contiguous region (some of it unmapped) and have just one slot for the 
>> guest.
>>     
>
> I thought mmio regions would need to be separated for 64bit too? I
> mean what's the point of the memslots in the first place if there's
> only one for the whole physical address space?
>   

Right.  Scratch that.

>> Okay.  It's sad, but I don't see any choice.
>>
>> If anyone from Intel is listening, please give us an accessed bit (and a 
>> dirty bit, too).
>>     
>
> Seconded.
>
> The other thing we could do would be to mark the spte invalid, and
> return 1, and then at the second ->clear_test_young if the spte is
> still invalid, we return 0. That way we would limit the accessed bit
> refresh to a kvm page fault without tearing down the linux pte (so
> follow_page would be enough then). While if we return 0, if the linux
> pte is already old, the page will be unmapped and go in swapcache and
> follow_page won't be enough and get_user_pages will have to call
> do_swap_page minor fault.
>
> However to do the above, we would need to track with rmap non present
> sptes, and that'd require changes to the kvm rmap logic. so initially
> returning 0 is simpler.
>   

This may be a good compromise.  We'll have to measure and see.

Perhaps we can use a few bits in the spte to keep a counter of accessed 
bit checks, and only return 0 after it has been incremented a few times.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.