From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Magenheimer Subject: Re: Proposed new "memory capacity claim" hypercall/feature Date: Wed, 31 Oct 2012 09:04:47 -0700 (PDT) Message-ID: <83bb902d-8e49-41cf-ad1e-c07c62d6e5f8@default> References: <60d00f38-98a3-4ec2-acbd-b49dafaada56@default> <20121029223555.GA24388@ocelot.phlegethon.org> <508F9DE902000078000A5565@nat28.tlf.novell.com> <509008A502000078000A584E@nat28.tlf.novell.com> <5090EBFE02000078000A59DD@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5090EBFE02000078000A59DD@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich , "Keir(Xen.org)" Cc: Tim Deegan , Olaf Hering , IanCampbell , Konrad Wilk , GeorgeDunlap , IanJackson , George Shuklin , xen-devel@lists.xen.org, DarioFaggioli , Kurt Hackel , Zhigang Wang List-Id: xen-devel@lists.xenproject.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 30.10.12 at 18:13, Dan Magenheimer wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] (NOTE TO KEIR: Input from you requested in first stanza below.) Hi Jan -- Thanks for the continued feedback! I've slightly re-ordered the email to focus on the problem (moved tmem-specific discussion to the end). > As long as the allocation times can get brought down to an > acceptable level, I continue to not see a need for the extra > "claim" approach you're proposing. So working on that one (or > showing that without unreasonable effort this cannot be > further improved) would be a higher priority thing from my pov > (without anyone arguing about its usefulness). Fair enough. I will do some measurement and analysis of this code. However, let me ask something of you and Keir as well: Please estimate how long (in usec) you think it is acceptable to hold the heap_lock. If your limit is very small (as I expect), doing anything "N" times in a loop with the lock held (for N==2^26, which is a 256GB domain) may make the analysis moot. > But yes, with all the factors you mention brought in, there is > certainly some improvement needed (whether your "claim" > proposal is a the right thing is another question, not to mention > that I currently don't see how this would get implemented in > a consistent way taking several orders of magnitude less time > to carry out). OK, I will start on the next step... proof-of-concept. I'm envisioning simple arithmetic, but maybe you are right and arithmetic will not be sufficient. > > Suppose you have a huge 256GB machine and you have already launched > > a 64GB tmem guest "A". The guest is idle for now, so slowly > > selfballoons down to maybe 4GB. You start to launch another 64GB > > guest "B" which, as we know, is going to take some time to complete. > > In the middle of launching "B", "A" suddenly gets very active and > > needs to balloon up as quickly as possible or it can't balloon fast > > enough (or at all if "frozen" as suggested) so starts swapping (and, > > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem > > memory). But ballooning and tmem are both blocked and so the > > guest swaps its poor little butt off even though there's >100GB > > of free physical memory available. > > That's only one side of the overcommit situation you're striving > to get work right here: That same self ballooning guest, after > sufficiently more guest got started so that the rest of the memory > got absorbed by them would suffer the very same problems in > the described situation, so it has to be prepared for this case > anyway. The tmem design does ensure the guest is prepared for this case anyway... the guest swaps. And, unlike page-sharing, the guest determines which pages to swap, not the host, and there is no possibility of double-paging. In your scenario, the host memory is truly oversubscribed. This scenario is ultimately a weakness of virtualization in general; trying to statistically-share an oversubscribed fixed resource among a number of guests will sometimes cause a performance degradation, whether the resource is CPU or LAN bandwidth or, in this case, physical memory. That very generic problem is I think not one any of us can solve. Toolstacks need to be able to recognize the problem (whether CPU, LAN, or memory) and act accordingly (report, or auto-migrate). In my scenario, guest performance is hammered only because of the unfortunate deficiency in the existing hypervisor memory allocation mechanisms, namely that small allocations must be artificially "frozen" until a large allocation can complete. That specific problem is one I am trying to solve. BTW, with tmem, some future toolstack might monitor various available tmem statistics and predict/avoid your scenario. Dan