xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Dan Magenheimer <dan.magenheimer@oracle.com>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: "Tim (Xen.org)" <tim@xen.org>, Olaf Hering <olaf@aepfle.de>,
	"Keir (Xen.org)" <keir@xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Konrad Wilk <konrad.wilk@oracle.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	George Shuklin <george.shuklin@gmail.com>,
	xen-devel@lists.xen.org, DarioFaggioli <raistlin@linux.it>,
	Jan Beulich <JBeulich@suse.com>,
	Kurt Hackel <kurt.hackel@oracle.com>,
	Zhigang Wang <zhigang.x.wang@oracle.com>
Subject: Re: Proposed new "memory capacity claim" hypercall/feature
Date: Mon, 5 Nov 2012 10:21:31 -0800 (PST)	[thread overview]
Message-ID: <ab378e95-ecd4-423e-95e4-e1c8b8eee88f@default> (raw)
In-Reply-To: <5097F3E9.1060404@eu.citrix.com>

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> 
> On 30/10/12 15:43, Dan Magenheimer wrote:
> > a) Truly free memory (each free page is on the hypervisor free list)
> > b) Freeable memory ("ephmeral" memory managed by tmem)
> > c) Owned memory (pages allocated by the hypervisor or for a domain)
> >
> > The sum of these three is always a constant: The total number of
> > RAM pages in the system.  However, when tmem is active, the values
> > of all _three_ of these change constantly.  So if at the start of a
> > domain launch, the sum of free+freeable exceeds the intended size
> > of the domain, the domain allocation/launch can start.

> (And please don't start another rant about the bold new world of peace
> and love.  Give me a freaking *technical* answer.)

<grin> /Me removes seventies-style tie-dye tshirt with peace logo
and sadly withdraws single daisy previously extended to George.

> Why free+freeable, rather than just "free"?

A free page is a page that is not used for anything at all.
It is on the hypervisor's free list.  A freeable page contains tmem
ephemeral data stored on behalf of a domain (or, if dedup'ing
is enabled, on behalf of one or more domains).  More specifically
for a tmem-enabled Linux guest, a freeable page contains a clean
page cache page that the Linux guest OS has asked the hypervisor
(via the tmem ABI) to hold if it can for as long as it can.
The specific clean page cache pages are chosen and the call is
done on the Linux side via "cleancache".

So, when tmem is working optimally, there are few or no free
pages and many many freeable pages (perhaps half of physical
RAM or more).

Freeable pages across all tmem-enabled guests are kept in a single
LRU queue.  When a request is made to the hypervisor allocator for
a free page and its free list is empty, the allocator will force
tmem to relinquish an ephemeral page (in LRU order).  Because
this is entirely up to the hypervisor and can happen at any
time, freeable pages are not counted as "owned" by a domain but
still have some value to a domain.

So, in essence, a "free" page has zero value and a "freeable"
page has a small, but non-zero value that decays over time.
So it's useful for a toolstack to know both quantities.

(And, since this thread has gone in many directions, let me
reiterate that all of this has been working in the hypervisor
since 4.0 in 2009, and cleancache in Linux since mid-2011.)
 
> >   But then
> > if "owned" increases enough, there may no longer be enough memory
> > and the domain launch will fail.
> 
> Again, "owned" would not increase at all if the guest weren't handing
> memory back to Xen.  Why is that necessary, or even helpful?

The guest _is_ handing memory back to Xen.  This is the other half
of the tmem functionality, persistent pages.

Answering your second question is going to require a little more
background.

Since nobody, not even the guest kernel, can guess the future
needs of its workload, there are two choices: (1) allocate enough
RAM so that the supply always exceeds max-demand, or (2) aggressively
reduce RAM to a reasonable guess for a target and prepare for the
probability that, sometimes, available RAM won't be enough.  Tmem does
choice #2; self-ballooning aggressively drives RAM (or "current memory"
as the hypervisor sees it) to a target level: in Linux, to Committed_AS
modified by a formula similar to the one Novell derived for a minimum
ballooning safety level.  The target level changes constantly, but the
selfballooning code samples and adjusts only periodically.  If, during
the time interval between samples, memory demand spikes, Linux
has a memory shortage and responds as it must, namely by swapping.

The frontswap code in Linux "intercepts" this swapping so that,
in most cases, it goes to a Xen tmem persistent pool instead of
to a (virtual or physical) swap disk.  Data in persistent pools,
unlike ephemeral pools, are guaranteed to be maintained by the
hypervisor until the guest invalidates it or until the guest dies.
As a result, pages allocated for persistent pools increase the count
of pages "owned" by the domain that requested the pages, until the guest
explicitly invalidates them (or dies).  The accounting also ensures
that malicious domains can't absorb memory beyond the toolset-specified
limit ("maxmem").

Note that, if compression is enabled, a domain _may_ "logically"
exceed maxmem, as long as it does not physically exceed it.

(And, again, all of this too has been in Xen since 4.0 in 2009,
and selfballooning has been in Linux since mid-2011, but frontswap
finally was accepted into Linux earlier in 2012.)

Ok, George, does that answer your questions, _technically_?  I'll
be happy to answer any others.

Thanks,
Dan

  reply	other threads:[~2012-11-05 18:21 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-29 17:06 Proposed new "memory capacity claim" hypercall/feature Dan Magenheimer
2012-10-29 18:24 ` Keir Fraser
2012-10-29 21:08   ` Dan Magenheimer
2012-10-29 22:22     ` Keir Fraser
2012-10-29 23:03       ` Dan Magenheimer
2012-10-29 23:17         ` Keir Fraser
2012-10-30 15:13           ` Dan Magenheimer
2012-10-30 14:43             ` Keir Fraser
2012-10-30 16:33               ` Dan Magenheimer
2012-10-30  9:11         ` George Dunlap
2012-10-30 16:13           ` Dan Magenheimer
2012-10-29 22:35 ` Tim Deegan
2012-10-29 23:21   ` Dan Magenheimer
2012-10-30  8:13     ` Tim Deegan
2012-10-30 15:26       ` Dan Magenheimer
2012-10-30  8:29     ` Jan Beulich
2012-10-30 15:43       ` Dan Magenheimer
2012-10-30 16:04         ` Jan Beulich
2012-10-30 17:13           ` Dan Magenheimer
2012-10-31  8:14             ` Jan Beulich
2012-10-31 16:04               ` Dan Magenheimer
2012-10-31 16:19                 ` Jan Beulich
2012-10-31 16:51                   ` Dan Magenheimer
2012-11-02  9:01                     ` Jan Beulich
2012-11-02  9:30                       ` Keir Fraser
2012-11-04 19:43                         ` Dan Magenheimer
2012-11-04 20:35                           ` Tim Deegan
2012-11-05  0:23                             ` Dan Magenheimer
2012-11-05 10:29                               ` Ian Campbell
2012-11-05 14:54                                 ` Dan Magenheimer
2012-11-05 22:24                                   ` Ian Campbell
2012-11-05 22:58                                     ` Zhigang Wang
2012-11-05 22:58                                     ` Dan Magenheimer
2012-11-06 13:23                                       ` Ian Campbell
2012-11-05 22:33                             ` Dan Magenheimer
2012-11-06 10:49                               ` Jan Beulich
2012-11-05  9:16                           ` Jan Beulich
2012-11-07 22:17                             ` Dan Magenheimer
2012-11-08  7:36                               ` Keir Fraser
2012-11-08 10:11                                 ` Ian Jackson
2012-11-08 10:57                                   ` Keir Fraser
2012-11-08 21:45                                   ` Dan Magenheimer
2012-11-12 11:03                                     ` Ian Jackson
2012-11-08  8:00                               ` Jan Beulich
2012-11-08  8:18                                 ` Keir Fraser
2012-11-08  8:54                                   ` Jan Beulich
2012-11-08  9:12                                     ` Keir Fraser
2012-11-08  9:47                                       ` Jan Beulich
2012-11-08 10:50                                         ` Keir Fraser
2012-11-08 13:48                                           ` Jan Beulich
2012-11-08 19:16                                             ` Dan Magenheimer
2012-11-08 22:32                                               ` Keir Fraser
2012-11-09  8:47                                               ` Jan Beulich
2012-11-08 18:38                                 ` Dan Magenheimer
2012-11-05 17:14         ` George Dunlap
2012-11-05 18:21           ` Dan Magenheimer [this message]
2012-11-01  2:13   ` Dario Faggioli
2012-11-01 15:51     ` Dan Magenheimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab378e95-ecd4-423e-95e4-e1c8b8eee88f@default \
    --to=dan.magenheimer@oracle.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=george.shuklin@gmail.com \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=kurt.hackel@oracle.com \
    --cc=olaf@aepfle.de \
    --cc=raistlin@linux.it \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    --cc=zhigang.x.wang@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).