From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dan Magenheimer <dan.magenheimer@oracle.com>
Subject: Re: Proposed new "memory capacity claim"
	hypercall/feature
Date: Wed, 31 Oct 2012 09:04:47 -0700 (PDT)
Message-ID: <83bb902d-8e49-41cf-ad1e-c07c62d6e5f8@default>
References: <60d00f38-98a3-4ec2-acbd-b49dafaada56@default>
	<20121029223555.GA24388@ocelot.phlegethon.org>
	<eca242fa-20a1-42b1-943c-ca2ab1f5c4c4@default>
	<508F9DE902000078000A5565@nat28.tlf.novell.com>
	<a79693b2-ff7f-4b45-9eee-6ddb1bbcb84a@default>
	<509008A502000078000A584E@nat28.tlf.novell.com>
	<da9ef0d0-71d4-4645-a296-5c11fede8be7@default>
	<5090EBFE02000078000A59DD@nat28.tlf.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <5090EBFE02000078000A59DD@nat28.tlf.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>, "Keir(Xen.org)" <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>, Olaf Hering <olaf@aepfle.de>, IanCampbell <Ian.Campbell@citrix.com>, Konrad Wilk <konrad.wilk@oracle.com>, GeorgeDunlap <George.Dunlap@eu.citrix.com>, IanJackson <Ian.Jackson@eu.citrix.com>, George Shuklin <george.shuklin@gmail.com>, xen-devel@lists.xen.org, DarioFaggioli <raistlin@linux.it>, Kurt Hackel <kurt.hackel@oracle.com>, Zhigang Wang <zhigang.x.wang@oracle.com>
List-Id: xen-devel@lists.xenproject.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]

(NOTE TO KEIR: Input from you requested in first stanza below.)

Hi Jan --

Thanks for the continued feedback!

I've slightly re-ordered the email to focus on the problem
(moved tmem-specific discussion to the end).

> As long as the allocation times can get brought down to an
> acceptable level, I continue to not see a need for the extra
> "claim" approach you're proposing. So working on that one (or
> showing that without unreasonable effort this cannot be
> further improved) would be a higher priority thing from my pov
> (without anyone arguing about its usefulness).

Fair enough.  I will do some measurement and analysis of this
code.  However, let me ask something of you and Keir as well:
Please estimate how long (in usec) you think it is acceptable
to hold the heap_lock.  If your limit is very small (as I expect),
doing anything "N" times in a loop with the lock held (for N==2^26,
which is a 256GB domain) may make the analysis moot.

> But yes, with all the factors you mention brought in, there is
> certainly some improvement needed (whether your "claim"
> proposal is a the right thing is another question, not to mention
> that I currently don't see how this would get implemented in
> a consistent way taking several orders of magnitude less time
> to carry out).

OK, I will start on the next step... proof-of-concept.
I'm envisioning simple arithmetic, but maybe you are
right and arithmetic will not be sufficient.

> > Suppose you have a huge 256GB machine and you have already launched
> > a 64GB tmem guest "A".  The guest is idle for now, so slowly
> > selfballoons down to maybe 4GB.  You start to launch another 64GB
> > guest "B" which, as we know, is going to take some time to complete.
> > In the middle of launching "B", "A" suddenly gets very active and
> > needs to balloon up as quickly as possible or it can't balloon fast
> > enough (or at all if "frozen" as suggested) so starts swapping (and,
> > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
> > memory).  But ballooning and tmem are both blocked and so the
> > guest swaps its poor little butt off even though there's >100GB
> > of free physical memory available.
> 
> That's only one side of the overcommit situation you're striving
> to get work right here: That same self ballooning guest, after
> sufficiently more guest got started so that the rest of the memory
> got absorbed by them would suffer the very same problems in
> the described situation, so it has to be prepared for this case
> anyway.

The tmem design does ensure the guest is prepared for this case
anyway... the guest swaps.  And, unlike page-sharing, the guest
determines which pages to swap, not the host, and there is no
possibility of double-paging.

In your scenario, the host memory is truly oversubscribed.  This
scenario is ultimately a weakness of virtualization in general;
trying to statistically-share an oversubscribed fixed resource
among a number of guests will sometimes cause a performance
degradation, whether the resource is CPU or LAN bandwidth or,
in this case, physical memory.  That very generic problem
is I think not one any of us can solve.  Toolstacks need to
be able to recognize the problem (whether CPU, LAN, or memory)
and act accordingly (report, or auto-migrate).

In my scenario, guest performance is hammered only because of
the unfortunate deficiency in the existing hypervisor memory
allocation mechanisms, namely that small allocations must
be artificially "frozen" until a large allocation can complete.
That specific problem is one I am trying to solve.

BTW, with tmem, some future toolstack might monitor various
available tmem statistics and predict/avoid your scenario.

Dan