From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [patch] kvm with mmu notifier v18 Date: Fri, 06 Jun 2008 19:37:48 +0300 Message-ID: <484967DC.901@qumranet.com> References: <20080605002626.GA15502@duo.random> <48480C36.6050309@qumranet.com> <20080605164717.GH15502@duo.random> <4848FA13.6040204@qumranet.com> <20080606125019.GN15502@duo.random> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org To: Andrea Arcangeli Return-path: Received: from il.qumranet.com ([212.179.150.194]:37574 "EHLO il.qumranet.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750987AbYFFQho (ORCPT ); Fri, 6 Jun 2008 12:37:44 -0400 In-Reply-To: <20080606125019.GN15502@duo.random> Sender: kvm-owner@vger.kernel.org List-ID: Andrea Arcangeli wrote: > On Fri, Jun 06, 2008 at 11:49:23AM +0300, Avi Kivity wrote: > >> Andrea Arcangeli wrote: >> >>>> One day we want to sort the slots according to size. We'll need better >>>> locking then (rcu, likely). >>>> >>>> >>> I think it's more interesting to sort them on their start/end gfn >>> address, prio_tree would be the optimal structure for me to use in the >>> mmu notifier invalidates as I could ask the prio tree to show me in >>> O(log(N)) (N number of slots) all the slots that overlap with the >>> invalidated start/end range. However I'm afraid there aren't enough >>> slots for this to matter... but OTOH there aren't frequent >>> modifications either, so it may be a microoptimization (if there were >>> frequent modifications with such a small number of slots it likely >>> would be slower than a list). >>> >>> >> There are only two interesting slots, 1G-end_of_mem and 0-pci_hole. A >> sorted array gives an average of less than two lookups to find the slot, >> with the smallest possible constant. Any other algorithm will give more >> lookups with a far higher constant. >> >> Sometimes linear search is the best algorithm. >> > > But we can't break the loop after the first match! I think you're not > taking into account the aliasing done with overlapping gfn in the > memslots (which is not forbidden apparently and you planned to > obsolete the explicit aliasing logic with it) For the gfn->hva (the common case) we can break immediately. For the hva->gfn, in general we cannot, but we can add an "unaliased" flag to the memslot and set it for slots which do not have aliases. That makes the loop terminate soon again. > . Only prio_tree can > reduce the number of walks in the most optimal way because of the gfn > overlaps and search by start-end address. Sorting by size won't be > useful because we can't break the loop after the first found match. If > prio tree is too heavyweight the next best would be sorting with a > single-index tree like rbtree by start address as it'll at least > eliminate all the memslots that have a start address higher than the > end of the invalidate range. But prio_tree will eliminate them all in > O(log(N)) so on the paper prio_tree is best (but perhaps in practice > rbtree is better, dunno). > Any pointer-based data structure is bound to be much slower than a list with such a small number of elements. btw, on 64-bit userspace we can arrange the entire physical address space in a contiguous region (some of it unmapped) and have just one slot for the guest. >> I forgot, last round: EPT doesn't have PT_ACCESSED_MASK, so you're reading >> (and clearing) some random bit. We can easily qualify the test by looking >> at shadow_accessed_mask; if zero, the hardware doesn't have a young bit. >> >> There's a bigger question of what to do in the case of EPT. We can always >> return true, or return true and also unmap. The first means we lie to the >> vm, the second means we take unnecessary exits to maintain age information. >> > > If there's no access bit (and we can't use the guest pte access bit > because EPT represent a physical page and not a guest pte anymore) > we're forced to return 0 without doing anything at all. Returning 1 is > unsafe as it'd pin the page in host ram. > > If there's no access bit all we can do is to wait the VM to unmap the > page, mark the spte not present during the invalidate_page mmu > notifier, and wait the guest to trigger a minor fault from swapcache. > Okay. It's sad, but I don't see any choice. If anyone from Intel is listening, please give us an accessed bit (and a dirty bit, too). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.