Analysis of using balloon page compaction in Xen balloon driver

All of lore.kernel.org
 help / color / mirror / Atom feed

* Analysis of using balloon page compaction in Xen balloon driver
@ 2014-10-16 17:12 Wei Liu
  2014-10-16 17:46 ` David Vrabel
  2014-10-17 12:30 ` Andrew Cooper
  0 siblings, 2 replies; 7+ messages in thread
From: Wei Liu @ 2014-10-16 17:12 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Andrew Cooper, Stefano Stabellini,
	David Vrabel, Boris Ostrovsky

This document analyses the impact of using balloon compaction
infrastructure in Xen balloon driver.

## Motives

1. Balloon pages fragments guest physical address space.
2. Balloon compaction infrastructure can migrate ballooned pages from
   start of zone to end of zone, hence creating contiguous guest physical
   address space.
3. Having contiguous guest physical address enables some options to
   improve performance.

## Benefit for auto-translated guest

HVM/PVH/ARM guest can have contiguous guest physical address space
after balloon pages are compacted, which potentially improves memory
performance provided guest makes use of huge pages, either via
Hugetlbfs or Transparent Huge Page (THP).

Consider memory access pattern of these guests, one access to guest
physical address involves several accesses to machine memory. The
total number of memory accesses can be represented as:

> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1

Hx denotes second stage page table walk levels and Gx denotes guest
page table walk levels.

By having contiguous guest physical address, guest can make use of
huge pages. This can reduce the number of G's in formula.

Reducing number of H's is another project for hypervisor side
improvement and should be decoupled from Linux side changes.

## Design and implementation

The use of balloon compaction doesn't require introducign new
interfaces between Xen balloon driver and the rest of the system. Most
changes are internal to Xen balloon driver.

Currently, Xen balloon driver gets its page directly from page
allocator. To enable balloon page migration, those pages now need to
be allocated from core balloon driver. Pages allocated from core
balloon driver are subject to balloon page compaction.

Xen balloon driver will also need to provide a callback to migrate
balloon page. In essence callback function receives "old page", which
is a already ballooned out page, and "new page", which is a page to be
ballooned out, then it inflates "old page" and deflates "new page".

The core of migration callback is XENMEM\_exchange hypercall. This
makes sure that inflation of old page and deflation of new page is
done atomically, so even if a domain is beyond its memory target and
being enforced, it can still compact memory.

## HAP table fragmentation is not made worse

*Assumption*: guest physical address space is already heavily
fragmented by balloon pages when balloon page compaction is required.

For a typical test case like ballooning up and down when doing kernel
compilation, there's usually only a handful huge pages left in the
end. So the observation matches the assumption. On the other hand, if
guest physical address space is not heavily fragmented, it's not
likely balloon page compaction will be triggered automatically.

In practice, balloon page compaction is not likely to make things
worse. Here is the analysis based on the above assumption.

Note that HAP table is already shattered by balloon pages. When a
guest page is ballooned out, the underlying HAP entry needs to be
split should that entry pointed to a huge page.

XENMEM\_exchange works as followed, "old page" is the guest page about
to get inflated and "new page" is the guest page about to get
deflated. It works like this:

1. Steal old page from domain.
2. Allocate a heap page from domheap
3. Release new page back to Xen
4. Update guest physmap, old page points to heap page, new page points
   to INVALID\_MFN.

The end result is that HAP entry for "old page" now points to a valid
MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
to INVALID\_MFN.

So for old page we're in the same position as before. HAP table is
fragmented, however it's not more fragmented than before.

For new page, the risk is that if the targeting guest new page is part
of a huge page, we need to split HAP entry, hence fragmenting HAP
table. This is valid concern. However in practice, guest address space
is already fragmented by ballooning. It's not likely we need to break
up any more huge pages, because there aren't that many left. So we're
in a position no worse than before.

Another downside is that when Xen is exchanging a page, it's possible
that Xen may need to break up a huge page to get a 4K page. Xen
domheap is fragmented. However we're not getting any worse than before
as ballooning already fragments domheap.

## Beyond Linux balloon compaction infrastructure

Currently there's no mechanism in Xen to coalesce HAP table
entries. To coalesce HAP entries we would need to make sure all
discrete entries belong to one huge page, are in correct order and
correct state.

By introducing necessary infrastructure(s) inside hypervisor (page
migration etc.), we might eventually be able to coalesce HAP entries,
hence reducing the number of H's in the aforementioned formula. This,
combined with the work on guest side, can help guest achieve best
possible performance.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu
@ 2014-10-16 17:46 ` David Vrabel
  2014-10-17  8:20   ` Ian Campbell
  2014-10-17 12:30 ` Andrew Cooper
  1 sibling, 1 reply; 7+ messages in thread
From: David Vrabel @ 2014-10-16 17:46 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Andrew Cooper, Boris Ostrovsky, Stefano Stabellini, Ian Campbell

On 16/10/14 18:12, Wei Liu wrote:
> This document analyses the impact of using balloon compaction
> infrastructure in Xen balloon driver.

Thanks for writing this.   This is a excellent starting point for a
productive design discussion.

> ## Benefit for auto-translated guest
> 
> HVM/PVH/ARM guest can have contiguous guest physical address space
> after balloon pages are compacted, which potentially improves memory
> performance provided guest makes use of huge pages, either via
> Hugetlbfs or Transparent Huge Page (THP).
> 
> Consider memory access pattern of these guests, one access to guest
> physical address involves several accesses to machine memory. The
> total number of memory accesses can be represented as:
> 
>> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> 
> Hx denotes second stage page table walk levels and Gx denotes guest
> page table walk levels.
> 
> By having contiguous guest physical address, guest can make use of
> huge pages. This can reduce the number of G's in formula.
> 
> Reducing number of H's is another project for hypervisor side
> improvement and should be decoupled from Linux side changes.

Whilst this analysis is fine, I don't think this is the real benefit of
using superpages which is reducing TLB usage and reducing the number of
TLB misses.

With fragmented stage 2 tables I don't think you will see see much
improvement in TLB usage.

> ## HAP table fragmentation is not made worse

This reasoning looks reasonable to me.  But it suggests that the balloon
compaction isn't doing it's job properly.  It seems like it should be
much more proactive in resolving fragmentation.

> ## Beyond Linux balloon compaction infrastructure
> 
> Currently there's no mechanism in Xen to coalesce HAP table
> entries. To coalesce HAP entries we would need to make sure all
> discrete entries belong to one huge page, are in correct order and
> correct state.

I would like to see a more detailed description of the Xen-side
solution.  So we can be sure the Linux half is compatible with it.

Before accepting any series I would also need to see real world
performance improvements, not just theoretical ones.

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-16 17:46 ` David Vrabel
@ 2014-10-17  8:20   ` Ian Campbell
  2014-10-17 10:07     ` David Vrabel
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2014-10-17  8:20 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel,
	Boris Ostrovsky

On Thu, 2014-10-16 at 18:46 +0100, David Vrabel wrote:
> On 16/10/14 18:12, Wei Liu wrote:
> > This document analyses the impact of using balloon compaction
> > infrastructure in Xen balloon driver.
> 
> Thanks for writing this.   This is a excellent starting point for a
> productive design discussion.
> 
> > ## Benefit for auto-translated guest
> > 
> > HVM/PVH/ARM guest can have contiguous guest physical address space
> > after balloon pages are compacted, which potentially improves memory
> > performance provided guest makes use of huge pages, either via
> > Hugetlbfs or Transparent Huge Page (THP).
> > 
> > Consider memory access pattern of these guests, one access to guest
> > physical address involves several accesses to machine memory. The
> > total number of memory accesses can be represented as:
> > 
> >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> > 
> > Hx denotes second stage page table walk levels and Gx denotes guest
> > page table walk levels.
> > 
> > By having contiguous guest physical address, guest can make use of
> > huge pages. This can reduce the number of G's in formula.
> > 
> > Reducing number of H's is another project for hypervisor side
> > improvement and should be decoupled from Linux side changes.
> 
> Whilst this analysis is fine, I don't think this is the real benefit of
> using superpages which is reducing TLB usage and reducing the number of
> TLB misses.

It depends a bit on whether the TLB caches partial walks etc, but more
importantly using super pages reduces the cost of a TLB miss, by
requiring less memory accesses on the walk.

> With fragmented stage 2 tables I don't think you will see see much
> improvement in TLB usage.

I think you will save some, by cutting off a level of the stage one
pages you avoid the need for doing a stage 2 walk at that level, which
might be 3-4 levels of lookup.

I expect it is not as significant a benefit as stage 2 superpages (which
saves you accesses at every level of the stage 1 walk), but it will be
there.

> > ## HAP table fragmentation is not made worse
> 
> This reasoning looks reasonable to me.  But it suggests that the balloon
> compaction isn't doing it's job properly.  It seems like it should be
> much more proactive in resolving fragmentation.

Speaking to some KVM folks here at plumbers it seems they find
compaction to be working pretty well for them, but there are some proc
knobs one has to twiddle to make it more aggressive. (sadly there are
none of them here right now so I can't ask for more pointers to said
knobs).

> > ## Beyond Linux balloon compaction infrastructure
> > 
> > Currently there's no mechanism in Xen to coalesce HAP table
> > entries. To coalesce HAP entries we would need to make sure all
> > discrete entries belong to one huge page, are in correct order and
> > correct state.
> 
> I would like to see a more detailed description of the Xen-side
> solution.  So we can be sure the Linux half is compatible with it.I 

I believe Dario has hacked up some prototypes (on x86) at some point.
But I don't believe there will be terribly much linkage between the
guest and hypervisor half beyond having the guest side arrange for as
many 2MB slots as possible to be completely populated such that the
hypervisor has opportunities to do compaction.

The compaction (both guest and Xen side) is not the first order issue
here though.

The first thing we should be doing is to be trying to balloon up and
down in 2M increments in the first place wherever possible. That
includes things like alloc_xenballooned_pages operating in 2M increments
under the hood such that things like grant maps which require 4K p2m
entries are condensed into the least number of 2M regions possible, IOW
having been forced to fragment a region use it for as many other 4K
mappings as possible.

However we know that we are not always going be able to allocate a 2M
page to balloon out, and we need to be prepared to mitigate this, which
is where compaction comes in.

Compaction on the guest side serves two purposes immediately even
without hypervisor side compaction: Firstly it increases the chances of
being able to allocate a 2M page when required to balloon one out,
either right now or at some point in the future, IOW it helps towards
the goal of doing as much ballooning as possible in 2M chunks.

Secondly it means that we will end up with contiguous 2M holes which
will give the opportunity for future balloon operations to up with 2M
mappings, this is useful in its own right even if it is neutral wrt the
fragmentation of the populated 2M regions right now (and we know it
can't make things worse in that regard).

I think it is important to realise that this is an independently useful
change which is also a prerequisite for some interesting future work. I
think blocking this work now pending the completion of that future
interesting work is unreasonable.

> Before accepting any series I would also need to see real world
> performance improvements, not just theoretical ones.

I think the interesting statistics here will be:

      * The numbers of 4K and 2M mappings used by the domain's p2m
        (since it is well established that 2M mappings improve
        performance in multiple workloads on multiple architectures,
        there is no need to reproduce that result yet again IMHO).
      * The numbers of completely depopulated 2M regions, which
        represent the potential for improved mappings when ballooning
        back up.
      * The numbers of completely populated 2M regions in the p2m, which
        represent opportunities for the hypervisor to make further
        improvements *in the future*.
      * The numbers of 2M regions which consist either solely of 4K
        mappings of RAM + holes or solely "special" 4K mappings (grant
        mappings etc) + holes.

Those last three are somewhat complementary I think.

The ARM p2m tracks the numbers of each size of mapping for a given
domain. The number of holes/full regions is not tracked but a debug
hypercall or console keyhandler could quite easily scan for them, based
on looking at the p2m type associated with each entry.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-17  8:20   ` Ian Campbell
@ 2014-10-17 10:07     ` David Vrabel
  0 siblings, 0 replies; 7+ messages in thread
From: David Vrabel @ 2014-10-17 10:07 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel,
	Boris Ostrovsky

On 17/10/14 09:20, Ian Campbell wrote:
> 
> I think it is important to realise that this is an independently useful
> change which is also a prerequisite for some interesting future work. I
> think blocking this work now pending the completion of that future
> interesting work is unreasonable.

I don't think requiring some initial design work for the complete
solution (including the hypervisor side compaction) is unreasonable.

This should even be straightforward given the prototyping that Dario has
already done.

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu
  2014-10-16 17:46 ` David Vrabel
@ 2014-10-17 12:30 ` Andrew Cooper
  2014-10-17 12:44   ` Ian Campbell
  2014-10-17 13:27   ` Wei Liu
  1 sibling, 2 replies; 7+ messages in thread
From: Andrew Cooper @ 2014-10-17 12:30 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Boris Ostrovsky, Ian Campbell, Stefano Stabellini, David Vrabel

On 16/10/2014 18:12, Wei Liu wrote:
> This document analyses the impact of using balloon compaction
> infrastructure in Xen balloon driver.

This is a fantastic start (and I actively recommend similar documents
from others for future work).

> ## Motives
>
> 1. Balloon pages fragments guest physical address space.
> 2. Balloon compaction infrastructure can migrate ballooned pages from
>    start of zone to end of zone, hence creating contiguous guest physical
>    address space.
> 3. Having contiguous guest physical address enables some options to
>    improve performance.
>
> ## Benefit for auto-translated guest
>
> HVM/PVH/ARM guest can have contiguous guest physical address space
> after balloon pages are compacted, which potentially improves memory
> performance provided guest makes use of huge pages, either via
> Hugetlbfs or Transparent Huge Page (THP).
>
> Consider memory access pattern of these guests, one access to guest
> physical address involves several accesses to machine memory. The
> total number of memory accesses can be represented as:
>
>> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> Hx denotes second stage page table walk levels and Gx denotes guest
> page table walk levels.

I don't think this expresses what you intend to express (or I don't
understand how you are trying to convey it).

Consider a single memory access in the guest, with no pagefaults, and
ignoring for now any TLB effects.  This description is based on my
knowledge of x86, but I assume ARM functions in a similar way.

Both the guest, G, and host, H, are 64bit, so using 4-level
translations.  Consider first, the worst case where all mappings are 4K
pages.

gcr3 needs following to find gl4.  This involves a complete host
pagetable walk (4 translations)
gl4 needs following to find gl3.  This involves a complete host
pagetable walk (4 translations)
gl3 to gl2 ...
gl2 to gl1 ...
gl1 to gpa ...

In the worst case, it takes 20 translations for a single guest memory
access.

Altering the guest to use a 2MB superpage would alleviate 4 traslations

Altering the host to use 2MB superpages would alleviate 5 translations.

It should be noted that the host pagetable walks are distinct, so a
single 2MB superpage could turn only a single host walk from 4
translations to 3 translations.  A guest can help itself substantially
by allocating its pagetables contiguously, which would cause multiple
guest translations to be contained within the same host translation,
getting much better temporal locality of reference from the TLB.

It should also be noted that Xen is in a far better position to make
easy use of 2MB and 1GB superpages, than a guest running normal
workloads is.  However, the ideal case is to make use of both host and
guest superpages.

> By having contiguous guest physical address, guest can make use of
> huge pages. This can reduce the number of G's in formula.
>
> Reducing number of H's is another project for hypervisor side
> improvement and should be decoupled from Linux side changes.
>
> ## Design and implementation
>
> The use of balloon compaction doesn't require introducign new
> interfaces between Xen balloon driver and the rest of the system. Most
> changes are internal to Xen balloon driver.
>
> Currently, Xen balloon driver gets its page directly from page
> allocator. To enable balloon page migration, those pages now need to
> be allocated from core balloon driver. Pages allocated from core
> balloon driver are subject to balloon page compaction.
>
> Xen balloon driver will also need to provide a callback to migrate
> balloon page. In essence callback function receives "old page", which
> is a already ballooned out page, and "new page", which is a page to be
> ballooned out, then it inflates "old page" and deflates "new page".
>
> The core of migration callback is XENMEM\_exchange hypercall. This
> makes sure that inflation of old page and deflation of new page is
> done atomically, so even if a domain is beyond its memory target and
> being enforced, it can still compact memory.
>
> ## HAP table fragmentation is not made worse
>
> *Assumption*: guest physical address space is already heavily
> fragmented by balloon pages when balloon page compaction is required.
>
> For a typical test case like ballooning up and down when doing kernel
> compilation, there's usually only a handful huge pages left in the
> end. So the observation matches the assumption. On the other hand, if
> guest physical address space is not heavily fragmented, it's not
> likely balloon page compaction will be triggered automatically.
>
> In practice, balloon page compaction is not likely to make things
> worse. Here is the analysis based on the above assumption.
>
> Note that HAP table is already shattered by balloon pages. When a
> guest page is ballooned out, the underlying HAP entry needs to be
> split should that entry pointed to a huge page.
>
> XENMEM\_exchange works as followed, "old page" is the guest page about
> to get inflated and "new page" is the guest page about to get
> deflated. It works like this:
>
> 1. Steal old page from domain.
> 2. Allocate a heap page from domheap
> 3. Release new page back to Xen
> 4. Update guest physmap, old page points to heap page, new page points
>    to INVALID\_MFN.
>
> The end result is that HAP entry for "old page" now points to a valid
> MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
> to INVALID\_MFN.
>
> So for old page we're in the same position as before. HAP table is
> fragmented, however it's not more fragmented than before.
>
> For new page, the risk is that if the targeting guest new page is part
> of a huge page, we need to split HAP entry, hence fragmenting HAP
> table. This is valid concern. However in practice, guest address space
> is already fragmented by ballooning. It's not likely we need to break
> up any more huge pages, because there aren't that many left. So we're
> in a position no worse than before.
>
> Another downside is that when Xen is exchanging a page, it's possible
> that Xen may need to break up a huge page to get a 4K page. Xen
> domheap is fragmented. However we're not getting any worse than before
> as ballooning already fragments domheap.
>
> ## Beyond Linux balloon compaction infrastructure
>
> Currently there's no mechanism in Xen to coalesce HAP table
> entries. To coalesce HAP entries we would need to make sure all
> discrete entries belong to one huge page, are in correct order and
> correct state.
>
> By introducing necessary infrastructure(s) inside hypervisor (page
> migration etc.), we might eventually be able to coalesce HAP entries,
> hence reducing the number of H's in the aforementioned formula. This,
> combined with the work on guest side, can help guest achieve best
> possible performance.

If I understand your proposal correctly, a VM lifetime would look like this:

1 Xen allocates pages (hopefully 2MB where possible)
2 Guest starts up, and shatters both guest and host superpages by
blindly ballooning random gfns
3a During runtime, guest spends time copying pages around in an attempt
to coalesce
3b (optionally, given Xen support) Xen spends time copying pages around
in an attempt to coalesce
4 Guest reshatters guest and host pages by more ballooning.
5 goto 3

Step 3 is an expensive (especially for Xen, which has far more important
things to be doing with its time) and which is self-perpetuating because
of the balloon driver reshattering pages.

Several factors contribute to shattering host pages.  The ones which
come to mind are:
* Differing cacheability from MTRRs
* Mapping a foreign grant into ones own physical address space
* Releasing pages back to Xen via the decrease_reservation hypercall

In addition, other factors complicate Xens ability to move pages.
* Mappings from other domains (Qemu, PV backends, etc) will pin mfns in
place
* Any IOMMU mappings will pin all (mapped) mfns in place.

As a result, by far the most efficient way of prevent superpage
fragmentation is to not shattering them in the first place.  This can be
done by changing the balloon driver in the guest to co-locate all pages
it decides to balloon, rather than taking individual pages at random
from the main memory pools.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-17 12:30 ` Andrew Cooper
@ 2014-10-17 12:44   ` Ian Campbell
  2014-10-17 13:27   ` Wei Liu
  1 sibling, 0 replies; 7+ messages in thread
From: Ian Campbell @ 2014-10-17 12:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, xen-devel, David Vrabel,
	Boris Ostrovsky

On Fri, 2014-10-17 at 13:30 +0100, Andrew Cooper wrote:
> Step 3 is an expensive (especially for Xen, which has far more important
> things to be doing with its time) and which is self-perpetuating because
> of the balloon driver reshattering pages.
> 
> Several factors contribute to shattering host pages.  The ones which
> come to mind are:
> * Differing cacheability from MTRRs
> * Mapping a foreign grant into ones own physical address space
> * Releasing pages back to Xen via the decrease_reservation hypercall
> 
> In addition, other factors complicate Xens ability to move pages.
> * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in
> place
> * Any IOMMU mappings will pin all (mapped) mfns in place.
> 
> As a result, by far the most efficient way of prevent superpage
> fragmentation is to not shattering them in the first place.  This can be
> done by changing the balloon driver in the guest to co-locate all pages
> it decides to balloon, rather than taking individual pages at random
> from the main memory pools.

Right, a necessary part of this work is to try and balloon (both up and
down) in correctly aligned 2M increments, by allocating 2M pages to free
back to Xen.

All the coalescing stuff is a bit secondary (but still necessary) and is
there to mitigate the case when a guest can't find a 2M page to allocate
(so has done some 4K ops instead, in order to meet the targets) and/or
to increase the probability it will be able to allocate one.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Analysis of using balloon page compaction in Xen balloon driver
  2014-10-17 12:30 ` Andrew Cooper
  2014-10-17 12:44   ` Ian Campbell
@ 2014-10-17 13:27   ` Wei Liu
  1 sibling, 0 replies; 7+ messages in thread
From: Wei Liu @ 2014-10-17 13:27 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, xen-devel,
	David Vrabel, Boris Ostrovsky

On Fri, Oct 17, 2014 at 01:30:21PM +0100, Andrew Cooper wrote:
> On 16/10/2014 18:12, Wei Liu wrote:
> > This document analyses the impact of using balloon compaction
> > infrastructure in Xen balloon driver.
> 
> This is a fantastic start (and I actively recommend similar documents
> from others for future work).
> 
> > ## Motives
> >
> > 1. Balloon pages fragments guest physical address space.
> > 2. Balloon compaction infrastructure can migrate ballooned pages from
> >    start of zone to end of zone, hence creating contiguous guest physical
> >    address space.
> > 3. Having contiguous guest physical address enables some options to
> >    improve performance.
> >
> > ## Benefit for auto-translated guest
> >
> > HVM/PVH/ARM guest can have contiguous guest physical address space
> > after balloon pages are compacted, which potentially improves memory
> > performance provided guest makes use of huge pages, either via
> > Hugetlbfs or Transparent Huge Page (THP).
> >
> > Consider memory access pattern of these guests, one access to guest
> > physical address involves several accesses to machine memory. The
> > total number of memory accesses can be represented as:
> >
> >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> > Hx denotes second stage page table walk levels and Gx denotes guest
> > page table walk levels.
> 
> I don't think this expresses what you intend to express (or I don't
> understand how you are trying to convey it).
> 
> Consider a single memory access in the guest, with no pagefaults, and
> ignoring for now any TLB effects.  This description is based on my
> knowledge of x86, but I assume ARM functions in a similar way.
> 
> Both the guest, G, and host, H, are 64bit, so using 4-level
> translations.  Consider first, the worst case where all mappings are 4K
> pages.
> 
> gcr3 needs following to find gl4.  This involves a complete host
> pagetable walk (4 translations)
> gl4 needs following to find gl3.  This involves a complete host
> pagetable walk (4 translations)
> gl3 to gl2 ...
> gl2 to gl1 ...
> gl1 to gpa ...
> 
> In the worst case, it takes 20 translations for a single guest memory
> access.
> 
> Altering the guest to use a 2MB superpage would alleviate 4 traslations
> 
> Altering the host to use 2MB superpages would alleviate 5 translations.
> 

That's basically what I was trying to express in that formula, only that
I missed gcr3 lookup. I think I wrote the wrong definition of Hx. Hx
should be the number of host page table walks (not levels).  And also
"reduce the number of H's" should be "make individual H smaller".

So we're still on the same page here.

> It should be noted that the host pagetable walks are distinct, so a
> single 2MB superpage could turn only a single host walk from 4
> translations to 3 translations.  A guest can help itself substantially
> by allocating its pagetables contiguously, which would cause multiple
> guest translations to be contained within the same host translation,
> getting much better temporal locality of reference from the TLB.
> 

Yes.

> It should also be noted that Xen is in a far better position to make
> easy use of 2MB and 1GB superpages, than a guest running normal
> workloads is. 

I don't think there's / should be linkage between Xen's knowledge and
guest's knowledge, is there? Guest only sees its own address space, it
can not control what types of pages are backing its address space. So
this is a moot point IMHO.

> However, the ideal case is to make use of both host and
> guest superpages.
> 

Yes. This work is to enable guest to more easily make use of huge pages.

Host part is not included and should be solved separately.

> > By having contiguous guest physical address, guest can make use of
> > huge pages. This can reduce the number of G's in formula.
> >
> > Reducing number of H's is another project for hypervisor side
> > improvement and should be decoupled from Linux side changes.
> >
> > ## Design and implementation
> >
> > The use of balloon compaction doesn't require introducign new
> > interfaces between Xen balloon driver and the rest of the system. Most
> > changes are internal to Xen balloon driver.
> >
> > Currently, Xen balloon driver gets its page directly from page
> > allocator. To enable balloon page migration, those pages now need to
> > be allocated from core balloon driver. Pages allocated from core
> > balloon driver are subject to balloon page compaction.
> >
> > Xen balloon driver will also need to provide a callback to migrate
> > balloon page. In essence callback function receives "old page", which
> > is a already ballooned out page, and "new page", which is a page to be
> > ballooned out, then it inflates "old page" and deflates "new page".
> >
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > being enforced, it can still compact memory.
> >
> > ## HAP table fragmentation is not made worse
> >
> > *Assumption*: guest physical address space is already heavily
> > fragmented by balloon pages when balloon page compaction is required.
> >
> > For a typical test case like ballooning up and down when doing kernel
> > compilation, there's usually only a handful huge pages left in the
> > end. So the observation matches the assumption. On the other hand, if
> > guest physical address space is not heavily fragmented, it's not
> > likely balloon page compaction will be triggered automatically.
> >
> > In practice, balloon page compaction is not likely to make things
> > worse. Here is the analysis based on the above assumption.
> >
> > Note that HAP table is already shattered by balloon pages. When a
> > guest page is ballooned out, the underlying HAP entry needs to be
> > split should that entry pointed to a huge page.
> >
> > XENMEM\_exchange works as followed, "old page" is the guest page about
> > to get inflated and "new page" is the guest page about to get
> > deflated. It works like this:
> >
> > 1. Steal old page from domain.
> > 2. Allocate a heap page from domheap
> > 3. Release new page back to Xen
> > 4. Update guest physmap, old page points to heap page, new page points
> >    to INVALID\_MFN.
> >
> > The end result is that HAP entry for "old page" now points to a valid
> > MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
> > to INVALID\_MFN.
> >
> > So for old page we're in the same position as before. HAP table is
> > fragmented, however it's not more fragmented than before.
> >
> > For new page, the risk is that if the targeting guest new page is part
> > of a huge page, we need to split HAP entry, hence fragmenting HAP
> > table. This is valid concern. However in practice, guest address space
> > is already fragmented by ballooning. It's not likely we need to break
> > up any more huge pages, because there aren't that many left. So we're
> > in a position no worse than before.
> >
> > Another downside is that when Xen is exchanging a page, it's possible
> > that Xen may need to break up a huge page to get a 4K page. Xen
> > domheap is fragmented. However we're not getting any worse than before
> > as ballooning already fragments domheap.
> >
> > ## Beyond Linux balloon compaction infrastructure
> >
> > Currently there's no mechanism in Xen to coalesce HAP table
> > entries. To coalesce HAP entries we would need to make sure all
> > discrete entries belong to one huge page, are in correct order and
> > correct state.
> >
> > By introducing necessary infrastructure(s) inside hypervisor (page
> > migration etc.), we might eventually be able to coalesce HAP entries,
> > hence reducing the number of H's in the aforementioned formula. This,
> > combined with the work on guest side, can help guest achieve best
> > possible performance.
> 
> If I understand your proposal correctly, a VM lifetime would look like this:
> 
> 1 Xen allocates pages (hopefully 2MB where possible)
> 2 Guest starts up, and shatters both guest and host superpages by
> blindly ballooning random gfns
> 3a During runtime, guest spends time copying pages around in an attempt
> to coalesce
> 3b (optionally, given Xen support) Xen spends time copying pages around
> in an attempt to coalesce
> 4 Guest reshatters guest and host pages by more ballooning.
> 5 goto 3
> 

Correct.

> Step 3 is an expensive (especially for Xen, which has far more important
> things to be doing with its time) and which is self-perpetuating because
> of the balloon driver reshattering pages.
> 
> Several factors contribute to shattering host pages.  The ones which
> come to mind are:
> * Differing cacheability from MTRRs
> * Mapping a foreign grant into ones own physical address space

Not quite sure about the first one but the second one is not affected by
balloon compaction as those pages, even mapped with balloon pages, are
handled separately.

> * Releasing pages back to Xen via the decrease_reservation hypercall
> 

This will affect Xen heap, however we're not making it worse. See my
analysis above.

> In addition, other factors complicate Xens ability to move pages.
> * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in
> place
> * Any IOMMU mappings will pin all (mapped) mfns in place.
> 

Right. But balloon compaction is not making things any worse either.

> As a result, by far the most efficient way of prevent superpage
> fragmentation is to not shattering them in the first place.  This can be
> done by changing the balloon driver in the guest to co-locate all pages
> it decides to balloon, rather than taking individual pages at random
> from the main memory pools.
> 

I understand your concern, I also understand the best solution so far is
to avoid shattering in the first place.  However these points won't
invalidate this work, because balloon compaction is not making things
any worse, and with a few tricks inside Xen balloon driver (as you
proposed) can make things better.

Your proposal of "changing the balloon driver in the guest to co-locate
all pages it decides to balloon", I can see two upstreamable solutions
at a quick glance:

1. Allocate / release huge page from / to HV in the first place.
2. Allocate normal page, and occasionally swap them with huge page if
   resources (both in Xen and guest) permitted.

#1 is not practical in a busy system, it also won't work against "bad
neighbor".  (I'm very happy to just change "order" in every page
allocation call if that would do what we want)

#2 requires balloon page compaction, which is exactly the series is
doing.

Wei.

> ~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-10-17 13:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu
2014-10-16 17:46 ` David Vrabel
2014-10-17  8:20   ` Ian Campbell
2014-10-17 10:07     ` David Vrabel
2014-10-17 12:30 ` Andrew Cooper
2014-10-17 12:44   ` Ian Campbell
2014-10-17 13:27   ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.