From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [RFC PATCH 0/3] Weight-balanced binary tree + KVM growable memory
 slots using wbtree
Date: Wed, 02 Mar 2011 15:31:32 +0200
Message-ID: <4D6E46B4.7030909@redhat.com>
References: <1298386481.5764.60.camel@x201>	 <20110222183822.22026.62832.stgit@s20.home>  <4D6507C9.1000906@redhat.com>	 <1298484395.18387.28.camel@x201> <1298489332.18387.56.camel@x201>	 <4D662DBF.2020706@redhat.com> <1298568944.6140.21.camel@x201>	 <4D6A1F55.7080804@redhat.com> <1298934271.4177.19.camel@x201>	 <4D6D0ADF.1050107@redhat.com> <1299003632.4177.66.camel@x201>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	mtosatti@redhat.com, xiaoguangrong@cn.fujitsu.com
To: Alex Williamson <alex.williamson@redhat.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <1299003632.4177.66.camel@x201>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

On 03/01/2011 08:20 PM, Alex Williamson wrote:
> >  >  It seems like we need a good mixed workload benchmark.  So far we've
> >  >  only tested worst case, with a pure emulated I/O test, and best case,
> >  >  with a pure memory test.  Ordering an array only helps the latter, and
> >  >  only barely beats the tree, so I suspect overall performance would be
> >  >  better with a tree.
> >
> >  But if we cache the missed-all-memslots result in the spte, we eliminate
> >  the worst case, and are left with just the best case.
>
> There's potentially a lot of entries between best case and worst case.

The mid case is where we have a lot of small slots which are 
continuously flushed.  That would be (ept=0 && new mappings continuously 
established) || (lots of small mappings && lots of host paging 
activity).  I don't know of any guests that continuously reestablish BAR 
mappings; and host paging activity doesn't apply to device assignment.  
What are we left with?

> >
> >  The problem here is that all workloads will cache all memslots very
> >  quickly into sptes and all lookups will be misses.  There are two cases
> >  where we have lookups that hit the memslots structure: ept=0, and host
> >  swap.  Neither are things we want to optimize too heavily.
>
> Which seems to suggest that:
>
>       A. making those misses fast = win
>       B. making those misses fast + caching misses = win++
>       C. we don't care if the sorted array is subtly faster for ept=0
>
> Sound right?  So is the question whether cached misses alone gets us 99%
> of the improvement since hits are already getting cached in sptes for
> cases we care about?

Yes, that's my feeling.  Caching those misses is a lot more important 
than speeding them up, since the cache will stay valid for long periods, 
and since the hit rate will be very high.

Cache+anything=O(1)
no-cache+tree=O(log(n))
no-cache+array=O(n)

-- 
error compiling committee.c: too many arguments to function