From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dan Magenheimer <dan.magenheimer@oracle.com>
Subject: Re: domain creation vs querying free memory (xend and
	xl)
Date: Thu, 4 Oct 2012 10:18:29 -0700 (PDT)
Message-ID: <16aa5326-5fd9-4f9c-b0fe-6196c6379991@default>
References: <53b8c758-2675-42a7-b63f-4f9ad0006d84@default>
	<20581.55931.246130.308384@mariner.uk.xensource.com>
	<8ba2021c-1095-4fd1-98a5-f6eec8a3498b@default>
	<20121002091017.GA95926@ocelot.phlegethon.org>
	<66cc0085-1216-40f7-8059-eaf615202c12@default>
	<20121002201624.GA98445@ocelot.phlegethon.org>
	<dad711f1-c63c-4958-888d-baba3f89a261@default>
	<20121004100645.GC38243@ocelot.phlegethon.org>
	<51CA094C-870A-4772-A22E-4CB151E854F2@gridcentric.ca>
	<48a08581-faa9-40a0-8afd-dc334ab82e43@default>
	<91C20B33-83CF-4B7A-A5AC-E61421B00536@gridcentric.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <91C20B33-83CF-4B7A-A5AC-E61421B00536@gridcentric.ca>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andres Lagar-Cavilla <andreslc@gridcentric.ca>
Cc: Olaf Hering <olaf@aepfle.de>, Keir Fraser <keir@xen.org>, Konrad Wilk <konrad.wilk@oracle.com>, George Dunlap <George.Dunlap@eu.citrix.com>, Kurt Hackel <kurt.hackel@oracle.com>, Tim Deegan <tim@xen.org>, xen-devel@lists.xen.org, George Shuklin <george.shuklin@gmail.com>, Dario Faggioli <raistlin@linux.it>, Ian Jackson <Ian.Jackson@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)
> 
> 
> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
> 
> >> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)
> >>
> >> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
> >>
> >>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
> >>>> Tmem argues that doing "memory capacity transfers" at a page granularity
> >>>> can only be done efficiently in the hypervisor.  This is true for
> >>>> page-sharing when it breaks a "share" also... it can't go ask the
> >>>> toolstack to approve allocation of a new page every time a write to a shared
> >>>> page occurs.
> >>>>
> >>>> Does that make sense?
> >>>
> >>> Yes.  The page-sharing version can be handled by having a pool of
> >>> dedicated memory for breaking shares, and the toolstack asynchronously
> >>> replenish that, rather than allowing CoW to use up all memory in the
> >>> system.
> >>
> >> That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I
> don't
> >> see how it would altogether avoid it.
> >
> > Agreed, so it doesn't really solve the problem.  (See longer reply
> > to Tim.)
> >
> >> If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW
> >> unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap
> today
> >> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back
> >> into action at a later point.
> >
> > But IIRC isn't it (2) that has given VMware memory overcommit a bad name?
> > Any significant memory pressure due to overcommit leads to double-swapping,
> > which leads to horrible performance?
> 
> The little that I've been able to read from their published results is that a "lot" of CPU is consumed
> scanning memory and fingerprinting, which leads to a massive assault on micro-architectural caches.
> 
> I don't know if that equates to a "bad name", but I don't think that is a productive discussion
> either.

Sorry, I wasn't intending that to be snarky, but on re-read I guess it
did sound snarky.  What I meant is: Is this just a manual version of what
VMware does automatically? Or is there something I am misunderstanding?
(I think you answered that below.)

> (2) doesn't mean swapping. Note that d->max_pages can be set artificially low by an admin, raised
> again. etc. It's just a mechanism to keep a VM at bay while corrective measures of any kind are taken.
> It's really up to a higher level controller whether you accept allocations and later reach a point of
> thrashing.
> 
> I understand this is partly where your discussion is headed, but certainly fixing the primary issue of
> nominal vanilla allocations preempting each other looks fairly critical to begin with.

OK.  I _think_ the design I proposed helps in systems that are using
page-sharing/host-swapping as well... I assume share-breaking just
calls the normal hypervisor allocator interface to allocate a
new page (if available)?  If you could review and comment on
the design from a page-sharing/host-swapping perspective, I would
appreciate it.

Thanks,
Dan