All of lore.kernel.org
 help / color / mirror / Atom feed
From: George Dunlap <george.dunlap@eu.citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>,
	"Keir (Xen.org)" <keir@xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Andres Lagar-Cavilla <andreslc@gridcentric.ca>,
	"Tim (Xen.org)" <tim@xen.org>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Konrad Rzeszutek Wilk <konrad@kernel.org>,
	Jan Beulich <JBeulich@suse.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>
Subject: Re: Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
Date: Mon, 14 Jan 2013 18:28:48 +0000	[thread overview]
Message-ID: <50F44E60.4090904@eu.citrix.com> (raw)
In-Reply-To: <20130102215901.GA16093@phenom.dumpdata.com>

On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote:
> Thanks for the clarification. I am not that fluent in the OCaml code.

I'm not fluent in OCaml either, I'm mainly going from memory based on 
the discussions I had with the author when it was being designed, as 
well as discussions with the xapi team when dealing with bugs at later 
points.

>> When a request comes in for a certain amount of memory, it will go
>> and set each VM's max_pages, and the max tmem pool size.  It can
>> then check whether there is enough free memory to complete the
>> allocation or not (since there's a race between checking how much
>> memory a guest is using and setting max_pages).  If that succeeds,
>> it can return "success".  If, while that VM is being built, another
>> request comes in, it can again go around and set the max sizes
>> lower.  It has to know how much of the memory is "reserved" for the
>> first guest being built, but if there's enough left after that, it
>> can return "success" and allow the second VM to start being built.
>>
>> After the VMs are built, the toolstack can remove the limits again
>> if it wants, again allowing the free flow of memory.
> This sounds to me like what Xapi does?

No, AFAIK xapi always sets the max_pages to what it wants the guest to 
be using at any given time.  I talked about removing the limits (and 
about operating without limits in the normal case) because it seems like 
something that Oracle wants (having to do with tmem).
>> Do you see any problems with this scheme?  All it requires is for
>> the toolstack to be able to temporarliy set limits on both guests
>> ballooning up and on tmem allocating more than a certain amount of
>> memory.  We already have mechanisms for the first, so if we had a
>> "max_pages" for tmem, then you'd have all the tools you need to
>> implement it.
> Of the top of my hat the thing that come in my mind are:
>   - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
>     looks to solve the launching in parallel of guests.
>     It will allow us to launch multiple guests - but it will also
>     suppressing the tmem asynchronous calls and having to balloon up/down
>     the guests. The claim hypercall does not do any of those and
>     gives a definite 'yes' or 'no'.

So when you say, "tmem freeze", are you specifically talking about not 
allowing tmem to allocate more memory (what I called a "max_pages" for 
tmem)?  Or is there more to it?

Secondly, just to clarify: when a guest is using memory from the tmem 
pool, is that added to tot_pages?

I'm not sure what "gives a definite yes or no" is supposed to mean -- 
the scheme I described also gives a definite yes or no.

In any case, your point about ballooning is taken: if we set max_pages 
for a VM and just leave it there while VMs are being built, then VMs 
cannot balloon up, even if there is "free" memory (i.e., memory that 
will not be used for the currently-building VM), and cannot be moved 
*bewteen* VMs either (i.e., by ballooning down one and ballooning the 
other up).  Both of these be done by extending the toolstack with a 
memory model (see below), but that adds an extra level of complication.

>   - Complex code that has to keep track of this in the user-space.
>     It also has to know of the extra 'reserved' space that is associated
>     with a guest. I am not entirely sure how that would couple with
>     PCI passthrough. The claim hypercall is fairly simple - albeit
>     having it extended to do Super pages and 32-bit guests could make this
>     longer.

What do you mean by the extra 'reserved' space?  And what potential 
issues are there with PCI passthrough?

To be accepted, the reservation hypercall will certainly have to be 
extended to do superpages and 32-bit guests, so that's the case we 
should be considering.

>   - I am not sure whether the toolstack can manage all the memory
>     allocation. It sounds like it could but I am just wondering if there
>     are some extra corners that we hadn't thought off.

Wouldn't the same argument apply to the reservation hypercall? Suppose 
that there was enough domain memory but not enough Xen heap memory, or 
enough of some other resource -- the hypercall might succeed, but then 
the domain build still fail at some later point when the other resource 
allocation failed.

>   - Latency. With the locks being placed on the pools of memory the
>     existing workload can be negatively affected. Say that this means we
>     need to balloon down a couple hundred guests, then launch the new
>     guest. This process of 'lower all of them by X', lets check the
>     'free amount'. Oh nope - not enougth - lets do this again. That would
>     delay the creation process.
>
>     The claim hypercall will avoid all of that by just declaring:
>     "This is how much you will get." without having to balloon the rest
>     of the guests.
>
>     Here is how I see what your toolstack would do:
>
>       [serial]
> 	1). Figure out how much memory we need for X guests.
> 	2). round-robin existing guests to decrease their memory
> 	    consumption (if they can be ballooned down). Or this
> 	    can be exectued in parallel for the guests.
> 	3). check if the amount of free memory is at least X
> 	    [this check has to be done in serial]
>       [parallel]
> 	4). launch multiple guests at the same time.
>
>     The claim hypercall would avoid the '3' part b/c it is inherently
>     part of the Xen's MM bureaucracy. It would allow:
>
>       [parallel]
> 	1). claim hypercall for X guest.
> 	2). if any of the claim's return 0 (so success), then launch guest
> 	3). if the errno was -ENOMEM then:
>       [serial]
>          3a). round-robin existing guests to decrease their memory
>               consumption if allowed. Goto 1).
>
>     So the 'error-case' only has to run in the slow-serial case.
Hmm, I don't think what you wrote about mine is quite right.  Here's 
what I had in mind for mine (let me call it "limit-and-check"):

[serial]
1). Set limits on all guests, and tmem, and see how much memory is left.
2) Read free memory
[parallel]
2a) Claim memory for each guest from freshly-calculated pool of free memory.
3) For each claim that can be satisfied, launch a guest
4) If there are guests that can't be satisfied with the current free 
memory, then:
[serial]
4a) round-robin existing guests to decrease their memory consumption if 
allowed. Goto 2.
5) Remove limits on guests.

Note that 1 would only be done for the first such "request", and 5 would 
only be done after all such requests have succeeded or failed.  Also 
note that steps 1 and 5 are only necessary if you want to go without 
such limits -- xapi doesn't do them, because it always keeps max_pages 
set to what it wants the guest to be using.

Also, note that the "claiming" (2a for mine above and 1 for yours) has 
to be serialized with other "claims" in both cases (in the reservation 
hypercall case, with a lock inside the hypervisor), but that the 
building can begin in parallel with the "claiming" in both cases.

But I think I do see what you're getting at.  The "free memory" 
measurement has to be taken when the system is in a "quiescent" state -- 
or at least a "grow only" state -- otherwise it's meaningless.  So #4a 
should really be:

4a) Round-robin existing guests to decrease their memory consumption if 
allowed.
4b) Wait for currently-building guests to finish building (if any), then 
go to #2.

So suppose the following cases, in which several requests for guest 
creation come in over a short period of time (not necessarily all at once):
A. There is enough memory for all requested VMs to be built without 
ballooning / something else
B. There is enough for some, but not all of the VMs to be built without 
ballooning / something else

In case A, then I think "limit-and-check" and "reservation hypercall" 
should perform the same.  For each new request that comes in, the 
toolstack can say, "Well, when I checked I had 64GiB free; then I 
started to build a 16GiB VM.  So I should have 48GiB left, enough to 
build this 32GiB VM."  "Well, when I checked I had 64GiB free; then I 
started to build a 16GiB VM and a 32GiB VM, so I should have 16GiB left, 
enough to be able to build this 16GiB VM."

The main difference comes in case B.  The "reservation hypercall" method 
will not have to wait until all existing guests have finished building 
to be able to start subsequent guests; but "limit-and-check" would have 
to wait until the currently-building guests are finished before doing 
another check.

This limitation doesn't apply to xapi, because it doesn't use the 
hypervisor's free memory as a measure of the memory it has available to 
it.  Instead, it keeps an internal model of the free memory the 
hypervisor has available.  This is based on MAX(current_target, 
tot_pages) of each guest (where "current_target" for a domain in the 
process of being built is the amount of memory it will have 
eventually).  We might call this the "model" approach.

We could extend "limit-and-check" to "limit-check-and-model" (i.e., 
estimate how much memory is really free after ballooning based on how 
much the guests' tot_pages), or "limit-model" (basically, fully switch 
to a xapi-style "model" approach while you're doing domain creation).  
That would be significantly more complicated.  On the other hand, a lot 
of the work has already been done by the XenServer team, and (I believe) 
the code in question is all GPL'ed, so Oracle could just take the 
algorithms and adapt them with just a bit if tweaking (and a bit of code 
translation).  It seems to me that he "model" approach brings a lot of 
other benefits as well.

But at any rate -- without debating the value or cost of the "model" 
approach, would you agree with my analysis and conclusions?  Namely:

a. "limit-and-check" and "reservation hypercall" are similar wrt guest 
creation when there is enough memory currently free to build all 
requested guests
b. "limit-and-check" may be slower if some guests can succeed in being 
built but others must wait for memory to be freed up, since the "check" 
has to wait for current guests to finish building
c. (From further back) One downside of a pure "limit-and-check" approach 
is that while VMs are being built, VMs cannot increase in size, even if 
there is "free" memory (not being used to build the currently-building 
domain(s)) or if another VM can be ballooned down.
d. "model"-based approaches can mitigate b and c, at the cost of a more 
complicated algorithm

>   - This still has the race issue - how much memory you see vs the
>     moment you launch it. Granted you can avoid it by having a "fudge"
>     factor (so when a guest says it wants 1G you know it actually
>     needs an extra 100MB on top of the 1GB or so). The claim hypercall
>     would count all of that for you so you don't have to race.
I'm sorry, what race / fudge factor are you talking about?

  -George

  reply	other threads:[~2013-01-14 18:28 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <mailman.18000.1354568068.1399.xen-devel@lists.xen.org>
2012-12-04  3:24 ` Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions Andres Lagar-Cavilla
2012-12-18 22:17   ` Konrad Rzeszutek Wilk
2012-12-19 12:53     ` George Dunlap
2012-12-19 13:48       ` George Dunlap
2013-01-03 20:38         ` Dan Magenheimer
2013-01-02 21:59       ` Konrad Rzeszutek Wilk
2013-01-14 18:28         ` George Dunlap [this message]
2013-01-22 21:57           ` Konrad Rzeszutek Wilk
2013-01-23 18:36             ` Dave Scott
2013-02-12 15:38               ` Konrad Rzeszutek Wilk
2012-12-20 16:04     ` Tim Deegan
2013-01-02 15:31       ` Andres Lagar-Cavilla
2013-01-02 21:43         ` Dan Magenheimer
2013-01-03 16:25           ` Andres Lagar-Cavilla
2013-01-03 18:49             ` Dan Magenheimer
2013-01-07 14:43               ` Ian Campbell
2013-01-07 18:41                 ` Dan Magenheimer
2013-01-08  9:03                   ` Ian Campbell
2013-01-08 19:41                     ` Dan Magenheimer
2013-01-09 10:41                       ` Ian Campbell
2013-01-09 14:44                         ` Dan Magenheimer
2013-01-09 14:58                           ` Ian Campbell
2013-01-14 15:45                           ` George Dunlap
2013-01-14 18:18                             ` Dan Magenheimer
2013-01-14 19:42                               ` George Dunlap
2013-01-14 23:14                                 ` Dan Magenheimer
2013-01-23 12:18                                   ` Ian Campbell
2013-01-23 17:34                                     ` Dan Magenheimer
2013-02-12 16:18                                     ` Konrad Rzeszutek Wilk
2013-01-10 10:31                       ` Ian Campbell
2013-01-10 18:42                         ` Dan Magenheimer
2013-01-02 21:38       ` Dan Magenheimer
2013-01-03 16:24         ` Andres Lagar-Cavilla
2013-01-03 18:33           ` Dan Magenheimer
2013-01-10 17:13         ` Tim Deegan
2013-01-10 21:43           ` Dan Magenheimer
2013-01-17 15:12             ` Tim Deegan
2013-01-17 15:26               ` Andres Lagar-Cavilla
2013-01-22 19:22               ` Dan Magenheimer
2013-01-23 12:18                 ` Ian Campbell
2013-01-23 16:05                   ` Dan Magenheimer
2013-01-02 15:29     ` Andres Lagar-Cavilla
2013-01-11 16:03       ` Konrad Rzeszutek Wilk
2013-01-11 16:13         ` Andres Lagar-Cavilla
2013-01-11 19:08           ` Konrad Rzeszutek Wilk
2013-01-14 16:00             ` George Dunlap
2013-01-14 16:11               ` Andres Lagar-Cavilla
2013-01-17 15:16             ` Tim Deegan
2013-01-18 21:45               ` Konrad Rzeszutek Wilk
2013-01-21 10:29                 ` Tim Deegan
2013-02-12 15:54                   ` Konrad Rzeszutek Wilk
2013-02-14 13:32                     ` Konrad Rzeszutek Wilk
2012-12-03 20:54 Dan Magenheimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50F44E60.4090904@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andreslc@gridcentric.ca \
    --cc=dan.magenheimer@oracle.com \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=konrad@kernel.org \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.