From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@qumranet.com>
Subject: Re: [patch] kvm with mmu notifier v18
Date: Fri, 06 Jun 2008 19:37:48 +0300
Message-ID: <484967DC.901@qumranet.com>
References: <20080605002626.GA15502@duo.random> <48480C36.6050309@qumranet.com> <20080605164717.GH15502@duo.random> <4848FA13.6040204@qumranet.com> <20080606125019.GN15502@duo.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: Andrea Arcangeli <andrea@qumranet.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from il.qumranet.com ([212.179.150.194]:37574 "EHLO il.qumranet.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750987AbYFFQho (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 6 Jun 2008 12:37:44 -0400
In-Reply-To: <20080606125019.GN15502@duo.random>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Andrea Arcangeli wrote:
> On Fri, Jun 06, 2008 at 11:49:23AM +0300, Avi Kivity wrote:
>   
>> Andrea Arcangeli wrote:
>>     
>>>> One day we want to sort the slots according to size.  We'll need better 
>>>> locking then (rcu, likely).
>>>>     
>>>>         
>>> I think it's more interesting to sort them on their start/end gfn
>>> address, prio_tree would be the optimal structure for me to use in the
>>> mmu notifier invalidates as I could ask the prio tree to show me in
>>> O(log(N)) (N number of slots) all the slots that overlap with the
>>> invalidated start/end range. However I'm afraid there aren't enough
>>> slots for this to matter... but OTOH there aren't frequent
>>> modifications either, so it may be a microoptimization (if there were
>>> frequent modifications with such a small number of slots it likely
>>> would be slower than a list).
>>>   
>>>       
>> There are only two interesting slots, 1G-end_of_mem and 0-pci_hole.  A 
>> sorted array gives an average of less than two lookups to find the slot, 
>> with the smallest possible constant.  Any other algorithm will give more 
>> lookups with a far higher constant.
>>
>> Sometimes linear search is the best algorithm.
>>     
>
> But we can't break the loop after the first match! I think you're not
> taking into account the aliasing done with overlapping gfn in the
> memslots (which is not forbidden apparently and you planned to
> obsolete the explicit aliasing logic with it)

For the gfn->hva (the common case) we can break immediately.  For the 
hva->gfn, in general we cannot, but we can add an "unaliased" flag to 
the memslot and set it for slots which do not have aliases.  That makes 
the loop terminate soon again.

> . Only prio_tree can
> reduce the number of walks in the most optimal way because of the gfn
> overlaps and search by start-end address. Sorting by size won't be
> useful because we can't break the loop after the first found match. If
> prio tree is too heavyweight the next best would be sorting with a
> single-index tree like rbtree by start address as it'll at least
> eliminate all the memslots that have a start address higher than the
> end of the invalidate range. But prio_tree will eliminate them all in
> O(log(N)) so on the paper prio_tree is best (but perhaps in practice
> rbtree is better, dunno).
>   

Any pointer-based data structure is bound to be much slower than a list 
with such a small number of elements.

btw, on 64-bit userspace we can arrange the entire physical address 
space in a contiguous region (some of it unmapped) and have just one 
slot for the guest.

>> I forgot, last round:  EPT doesn't have PT_ACCESSED_MASK, so you're reading 
>> (and clearing) some random bit.  We can easily qualify the test by looking 
>> at shadow_accessed_mask; if zero, the hardware doesn't have a young bit.
>>
>> There's a bigger question of what to do in the case of EPT.  We can always 
>> return true, or return true and also unmap.  The first means we lie to the 
>> vm, the second means we take unnecessary exits to maintain age information.
>>     
>
> If there's no access bit (and we can't use the guest pte access bit
> because EPT represent a physical page and not a guest pte anymore)
> we're forced to return 0 without doing anything at all. Returning 1 is
> unsafe as it'd pin the page in host ram.
>
> If there's no access bit all we can do is to wait the VM to unmap the
> page, mark the spte not present during the invalidate_page mmu
> notifier, and wait the guest to trigger a minor fault from swapcache.
>   

Okay.  It's sad, but I don't see any choice.

If anyone from Intel is listening, please give us an accessed bit (and a 
dirty bit, too).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.