xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Dan Magenheimer <dan.magenheimer@oracle.com>
To: Jan Beulich <JBeulich@suse.com>, "Keir(Xen.org)" <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>, Olaf Hering <olaf@aepfle.de>,
	IanCampbell <Ian.Campbell@citrix.com>,
	Konrad Wilk <konrad.wilk@oracle.com>,
	GeorgeDunlap <George.Dunlap@eu.citrix.com>,
	IanJackson <Ian.Jackson@eu.citrix.com>,
	George Shuklin <george.shuklin@gmail.com>,
	xen-devel@lists.xen.org, DarioFaggioli <raistlin@linux.it>,
	Kurt Hackel <kurt.hackel@oracle.com>,
	Zhigang Wang <zhigang.x.wang@oracle.com>
Subject: Re: Proposed new "memory capacity claim" hypercall/feature
Date: Wed, 31 Oct 2012 09:04:47 -0700 (PDT)	[thread overview]
Message-ID: <83bb902d-8e49-41cf-ad1e-c07c62d6e5f8@default> (raw)
In-Reply-To: <5090EBFE02000078000A59DD@nat28.tlf.novell.com>

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> 
> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]

(NOTE TO KEIR: Input from you requested in first stanza below.)

Hi Jan --

Thanks for the continued feedback!

I've slightly re-ordered the email to focus on the problem
(moved tmem-specific discussion to the end).

> As long as the allocation times can get brought down to an
> acceptable level, I continue to not see a need for the extra
> "claim" approach you're proposing. So working on that one (or
> showing that without unreasonable effort this cannot be
> further improved) would be a higher priority thing from my pov
> (without anyone arguing about its usefulness).

Fair enough.  I will do some measurement and analysis of this
code.  However, let me ask something of you and Keir as well:
Please estimate how long (in usec) you think it is acceptable
to hold the heap_lock.  If your limit is very small (as I expect),
doing anything "N" times in a loop with the lock held (for N==2^26,
which is a 256GB domain) may make the analysis moot.

> But yes, with all the factors you mention brought in, there is
> certainly some improvement needed (whether your "claim"
> proposal is a the right thing is another question, not to mention
> that I currently don't see how this would get implemented in
> a consistent way taking several orders of magnitude less time
> to carry out).

OK, I will start on the next step... proof-of-concept.
I'm envisioning simple arithmetic, but maybe you are
right and arithmetic will not be sufficient.

> > Suppose you have a huge 256GB machine and you have already launched
> > a 64GB tmem guest "A".  The guest is idle for now, so slowly
> > selfballoons down to maybe 4GB.  You start to launch another 64GB
> > guest "B" which, as we know, is going to take some time to complete.
> > In the middle of launching "B", "A" suddenly gets very active and
> > needs to balloon up as quickly as possible or it can't balloon fast
> > enough (or at all if "frozen" as suggested) so starts swapping (and,
> > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
> > memory).  But ballooning and tmem are both blocked and so the
> > guest swaps its poor little butt off even though there's >100GB
> > of free physical memory available.
> 
> That's only one side of the overcommit situation you're striving
> to get work right here: That same self ballooning guest, after
> sufficiently more guest got started so that the rest of the memory
> got absorbed by them would suffer the very same problems in
> the described situation, so it has to be prepared for this case
> anyway.

The tmem design does ensure the guest is prepared for this case
anyway... the guest swaps.  And, unlike page-sharing, the guest
determines which pages to swap, not the host, and there is no
possibility of double-paging.

In your scenario, the host memory is truly oversubscribed.  This
scenario is ultimately a weakness of virtualization in general;
trying to statistically-share an oversubscribed fixed resource
among a number of guests will sometimes cause a performance
degradation, whether the resource is CPU or LAN bandwidth or,
in this case, physical memory.  That very generic problem
is I think not one any of us can solve.  Toolstacks need to
be able to recognize the problem (whether CPU, LAN, or memory)
and act accordingly (report, or auto-migrate).

In my scenario, guest performance is hammered only because of
the unfortunate deficiency in the existing hypervisor memory
allocation mechanisms, namely that small allocations must
be artificially "frozen" until a large allocation can complete.
That specific problem is one I am trying to solve.

BTW, with tmem, some future toolstack might monitor various
available tmem statistics and predict/avoid your scenario.

Dan

  reply	other threads:[~2012-10-31 16:04 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-29 17:06 Proposed new "memory capacity claim" hypercall/feature Dan Magenheimer
2012-10-29 18:24 ` Keir Fraser
2012-10-29 21:08   ` Dan Magenheimer
2012-10-29 22:22     ` Keir Fraser
2012-10-29 23:03       ` Dan Magenheimer
2012-10-29 23:17         ` Keir Fraser
2012-10-30 15:13           ` Dan Magenheimer
2012-10-30 14:43             ` Keir Fraser
2012-10-30 16:33               ` Dan Magenheimer
2012-10-30  9:11         ` George Dunlap
2012-10-30 16:13           ` Dan Magenheimer
2012-10-29 22:35 ` Tim Deegan
2012-10-29 23:21   ` Dan Magenheimer
2012-10-30  8:13     ` Tim Deegan
2012-10-30 15:26       ` Dan Magenheimer
2012-10-30  8:29     ` Jan Beulich
2012-10-30 15:43       ` Dan Magenheimer
2012-10-30 16:04         ` Jan Beulich
2012-10-30 17:13           ` Dan Magenheimer
2012-10-31  8:14             ` Jan Beulich
2012-10-31 16:04               ` Dan Magenheimer [this message]
2012-10-31 16:19                 ` Jan Beulich
2012-10-31 16:51                   ` Dan Magenheimer
2012-11-02  9:01                     ` Jan Beulich
2012-11-02  9:30                       ` Keir Fraser
2012-11-04 19:43                         ` Dan Magenheimer
2012-11-04 20:35                           ` Tim Deegan
2012-11-05  0:23                             ` Dan Magenheimer
2012-11-05 10:29                               ` Ian Campbell
2012-11-05 14:54                                 ` Dan Magenheimer
2012-11-05 22:24                                   ` Ian Campbell
2012-11-05 22:58                                     ` Zhigang Wang
2012-11-05 22:58                                     ` Dan Magenheimer
2012-11-06 13:23                                       ` Ian Campbell
2012-11-05 22:33                             ` Dan Magenheimer
2012-11-06 10:49                               ` Jan Beulich
2012-11-05  9:16                           ` Jan Beulich
2012-11-07 22:17                             ` Dan Magenheimer
2012-11-08  7:36                               ` Keir Fraser
2012-11-08 10:11                                 ` Ian Jackson
2012-11-08 10:57                                   ` Keir Fraser
2012-11-08 21:45                                   ` Dan Magenheimer
2012-11-12 11:03                                     ` Ian Jackson
2012-11-08  8:00                               ` Jan Beulich
2012-11-08  8:18                                 ` Keir Fraser
2012-11-08  8:54                                   ` Jan Beulich
2012-11-08  9:12                                     ` Keir Fraser
2012-11-08  9:47                                       ` Jan Beulich
2012-11-08 10:50                                         ` Keir Fraser
2012-11-08 13:48                                           ` Jan Beulich
2012-11-08 19:16                                             ` Dan Magenheimer
2012-11-08 22:32                                               ` Keir Fraser
2012-11-09  8:47                                               ` Jan Beulich
2012-11-08 18:38                                 ` Dan Magenheimer
2012-11-05 17:14         ` George Dunlap
2012-11-05 18:21           ` Dan Magenheimer
2012-11-01  2:13   ` Dario Faggioli
2012-11-01 15:51     ` Dan Magenheimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83bb902d-8e49-41cf-ad1e-c07c62d6e5f8@default \
    --to=dan.magenheimer@oracle.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=george.shuklin@gmail.com \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=kurt.hackel@oracle.com \
    --cc=olaf@aepfle.de \
    --cc=raistlin@linux.it \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    --cc=zhigang.x.wang@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).