Re: [patch] kvm with mmu notifier v18

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@qumranet.com>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: kvm@vger.kernel.org
Subject: Re: [patch] kvm with mmu notifier v18
Date: Fri, 06 Jun 2008 19:37:48 +0300	[thread overview]
Message-ID: <484967DC.901@qumranet.com> (raw)
In-Reply-To: <20080606125019.GN15502@duo.random>

Andrea Arcangeli wrote:
> On Fri, Jun 06, 2008 at 11:49:23AM +0300, Avi Kivity wrote:
>   
>> Andrea Arcangeli wrote:
>>     
>>>> One day we want to sort the slots according to size.  We'll need better 
>>>> locking then (rcu, likely).
>>>>     
>>>>         
>>> I think it's more interesting to sort them on their start/end gfn
>>> address, prio_tree would be the optimal structure for me to use in the
>>> mmu notifier invalidates as I could ask the prio tree to show me in
>>> O(log(N)) (N number of slots) all the slots that overlap with the
>>> invalidated start/end range. However I'm afraid there aren't enough
>>> slots for this to matter... but OTOH there aren't frequent
>>> modifications either, so it may be a microoptimization (if there were
>>> frequent modifications with such a small number of slots it likely
>>> would be slower than a list).
>>>   
>>>       
>> There are only two interesting slots, 1G-end_of_mem and 0-pci_hole.  A 
>> sorted array gives an average of less than two lookups to find the slot, 
>> with the smallest possible constant.  Any other algorithm will give more 
>> lookups with a far higher constant.
>>
>> Sometimes linear search is the best algorithm.
>>     
>
> But we can't break the loop after the first match! I think you're not
> taking into account the aliasing done with overlapping gfn in the
> memslots (which is not forbidden apparently and you planned to
> obsolete the explicit aliasing logic with it)

For the gfn->hva (the common case) we can break immediately.  For the 
hva->gfn, in general we cannot, but we can add an "unaliased" flag to 
the memslot and set it for slots which do not have aliases.  That makes 
the loop terminate soon again.

> . Only prio_tree can
> reduce the number of walks in the most optimal way because of the gfn
> overlaps and search by start-end address. Sorting by size won't be
> useful because we can't break the loop after the first found match. If
> prio tree is too heavyweight the next best would be sorting with a
> single-index tree like rbtree by start address as it'll at least
> eliminate all the memslots that have a start address higher than the
> end of the invalidate range. But prio_tree will eliminate them all in
> O(log(N)) so on the paper prio_tree is best (but perhaps in practice
> rbtree is better, dunno).
>   

Any pointer-based data structure is bound to be much slower than a list 
with such a small number of elements.

btw, on 64-bit userspace we can arrange the entire physical address 
space in a contiguous region (some of it unmapped) and have just one 
slot for the guest.

>> I forgot, last round:  EPT doesn't have PT_ACCESSED_MASK, so you're reading 
>> (and clearing) some random bit.  We can easily qualify the test by looking 
>> at shadow_accessed_mask; if zero, the hardware doesn't have a young bit.
>>
>> There's a bigger question of what to do in the case of EPT.  We can always 
>> return true, or return true and also unmap.  The first means we lie to the 
>> vm, the second means we take unnecessary exits to maintain age information.
>>     
>
> If there's no access bit (and we can't use the guest pte access bit
> because EPT represent a physical page and not a guest pte anymore)
> we're forced to return 0 without doing anything at all. Returning 1 is
> unsafe as it'd pin the page in host ram.
>
> If there's no access bit all we can do is to wait the VM to unmap the
> page, mark the spte not present during the invalidate_page mmu
> notifier, and wait the guest to trigger a minor fault from swapcache.
>   

Okay.  It's sad, but I don't see any choice.

If anyone from Intel is listening, please give us an accessed bit (and a 
dirty bit, too).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

next prev parent reply	other threads:[~2008-06-06 16:37 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-05  0:26 [patch] kvm with mmu notifier v18 Andrea Arcangeli
2008-06-05 15:54 ` Avi Kivity
2008-06-05 16:47   ` Andrea Arcangeli
2008-06-06  8:49     ` Avi Kivity
2008-06-06 12:50       ` Andrea Arcangeli
2008-06-06 16:37         ` Avi Kivity [this message]
2008-06-06 17:37           ` Andrea Arcangeli
2008-06-06 20:09             ` Avi Kivity
2008-06-10 20:41     ` Marcelo Tosatti
2008-06-12  1:33       ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=484967DC.901@qumranet.com \
    --to=avi@qumranet.com \
    --cc=andrea@qumranet.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.