* Analysis of using balloon page compaction in Xen balloon driver
@ 2014-10-16 17:12 Wei Liu
2014-10-16 17:46 ` David Vrabel
2014-10-17 12:30 ` Andrew Cooper
0 siblings, 2 replies; 7+ messages in thread
From: Wei Liu @ 2014-10-16 17:12 UTC (permalink / raw)
To: xen-devel
Cc: wei.liu2, Ian Campbell, Andrew Cooper, Stefano Stabellini,
David Vrabel, Boris Ostrovsky
This document analyses the impact of using balloon compaction
infrastructure in Xen balloon driver.
## Motives
1. Balloon pages fragments guest physical address space.
2. Balloon compaction infrastructure can migrate ballooned pages from
start of zone to end of zone, hence creating contiguous guest physical
address space.
3. Having contiguous guest physical address enables some options to
improve performance.
## Benefit for auto-translated guest
HVM/PVH/ARM guest can have contiguous guest physical address space
after balloon pages are compacted, which potentially improves memory
performance provided guest makes use of huge pages, either via
Hugetlbfs or Transparent Huge Page (THP).
Consider memory access pattern of these guests, one access to guest
physical address involves several accesses to machine memory. The
total number of memory accesses can be represented as:
> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
Hx denotes second stage page table walk levels and Gx denotes guest
page table walk levels.
By having contiguous guest physical address, guest can make use of
huge pages. This can reduce the number of G's in formula.
Reducing number of H's is another project for hypervisor side
improvement and should be decoupled from Linux side changes.
## Design and implementation
The use of balloon compaction doesn't require introducign new
interfaces between Xen balloon driver and the rest of the system. Most
changes are internal to Xen balloon driver.
Currently, Xen balloon driver gets its page directly from page
allocator. To enable balloon page migration, those pages now need to
be allocated from core balloon driver. Pages allocated from core
balloon driver are subject to balloon page compaction.
Xen balloon driver will also need to provide a callback to migrate
balloon page. In essence callback function receives "old page", which
is a already ballooned out page, and "new page", which is a page to be
ballooned out, then it inflates "old page" and deflates "new page".
The core of migration callback is XENMEM\_exchange hypercall. This
makes sure that inflation of old page and deflation of new page is
done atomically, so even if a domain is beyond its memory target and
being enforced, it can still compact memory.
## HAP table fragmentation is not made worse
*Assumption*: guest physical address space is already heavily
fragmented by balloon pages when balloon page compaction is required.
For a typical test case like ballooning up and down when doing kernel
compilation, there's usually only a handful huge pages left in the
end. So the observation matches the assumption. On the other hand, if
guest physical address space is not heavily fragmented, it's not
likely balloon page compaction will be triggered automatically.
In practice, balloon page compaction is not likely to make things
worse. Here is the analysis based on the above assumption.
Note that HAP table is already shattered by balloon pages. When a
guest page is ballooned out, the underlying HAP entry needs to be
split should that entry pointed to a huge page.
XENMEM\_exchange works as followed, "old page" is the guest page about
to get inflated and "new page" is the guest page about to get
deflated. It works like this:
1. Steal old page from domain.
2. Allocate a heap page from domheap
3. Release new page back to Xen
4. Update guest physmap, old page points to heap page, new page points
to INVALID\_MFN.
The end result is that HAP entry for "old page" now points to a valid
MFN instead of having INVALID\_MFN; HAP entry for "new page" now points
to INVALID\_MFN.
So for old page we're in the same position as before. HAP table is
fragmented, however it's not more fragmented than before.
For new page, the risk is that if the targeting guest new page is part
of a huge page, we need to split HAP entry, hence fragmenting HAP
table. This is valid concern. However in practice, guest address space
is already fragmented by ballooning. It's not likely we need to break
up any more huge pages, because there aren't that many left. So we're
in a position no worse than before.
Another downside is that when Xen is exchanging a page, it's possible
that Xen may need to break up a huge page to get a 4K page. Xen
domheap is fragmented. However we're not getting any worse than before
as ballooning already fragments domheap.
## Beyond Linux balloon compaction infrastructure
Currently there's no mechanism in Xen to coalesce HAP table
entries. To coalesce HAP entries we would need to make sure all
discrete entries belong to one huge page, are in correct order and
correct state.
By introducing necessary infrastructure(s) inside hypervisor (page
migration etc.), we might eventually be able to coalesce HAP entries,
hence reducing the number of H's in the aforementioned formula. This,
combined with the work on guest side, can help guest achieve best
possible performance.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu @ 2014-10-16 17:46 ` David Vrabel 2014-10-17 8:20 ` Ian Campbell 2014-10-17 12:30 ` Andrew Cooper 1 sibling, 1 reply; 7+ messages in thread From: David Vrabel @ 2014-10-16 17:46 UTC (permalink / raw) To: Wei Liu, xen-devel Cc: Andrew Cooper, Boris Ostrovsky, Stefano Stabellini, Ian Campbell On 16/10/14 18:12, Wei Liu wrote: > This document analyses the impact of using balloon compaction > infrastructure in Xen balloon driver. Thanks for writing this. This is a excellent starting point for a productive design discussion. > ## Benefit for auto-translated guest > > HVM/PVH/ARM guest can have contiguous guest physical address space > after balloon pages are compacted, which potentially improves memory > performance provided guest makes use of huge pages, either via > Hugetlbfs or Transparent Huge Page (THP). > > Consider memory access pattern of these guests, one access to guest > physical address involves several accesses to machine memory. The > total number of memory accesses can be represented as: > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > > Hx denotes second stage page table walk levels and Gx denotes guest > page table walk levels. > > By having contiguous guest physical address, guest can make use of > huge pages. This can reduce the number of G's in formula. > > Reducing number of H's is another project for hypervisor side > improvement and should be decoupled from Linux side changes. Whilst this analysis is fine, I don't think this is the real benefit of using superpages which is reducing TLB usage and reducing the number of TLB misses. With fragmented stage 2 tables I don't think you will see see much improvement in TLB usage. > ## HAP table fragmentation is not made worse This reasoning looks reasonable to me. But it suggests that the balloon compaction isn't doing it's job properly. It seems like it should be much more proactive in resolving fragmentation. > ## Beyond Linux balloon compaction infrastructure > > Currently there's no mechanism in Xen to coalesce HAP table > entries. To coalesce HAP entries we would need to make sure all > discrete entries belong to one huge page, are in correct order and > correct state. I would like to see a more detailed description of the Xen-side solution. So we can be sure the Linux half is compatible with it. Before accepting any series I would also need to see real world performance improvements, not just theoretical ones. David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-16 17:46 ` David Vrabel @ 2014-10-17 8:20 ` Ian Campbell 2014-10-17 10:07 ` David Vrabel 0 siblings, 1 reply; 7+ messages in thread From: Ian Campbell @ 2014-10-17 8:20 UTC (permalink / raw) To: David Vrabel Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel, Boris Ostrovsky On Thu, 2014-10-16 at 18:46 +0100, David Vrabel wrote: > On 16/10/14 18:12, Wei Liu wrote: > > This document analyses the impact of using balloon compaction > > infrastructure in Xen balloon driver. > > Thanks for writing this. This is a excellent starting point for a > productive design discussion. > > > ## Benefit for auto-translated guest > > > > HVM/PVH/ARM guest can have contiguous guest physical address space > > after balloon pages are compacted, which potentially improves memory > > performance provided guest makes use of huge pages, either via > > Hugetlbfs or Transparent Huge Page (THP). > > > > Consider memory access pattern of these guests, one access to guest > > physical address involves several accesses to machine memory. The > > total number of memory accesses can be represented as: > > > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > > > > Hx denotes second stage page table walk levels and Gx denotes guest > > page table walk levels. > > > > By having contiguous guest physical address, guest can make use of > > huge pages. This can reduce the number of G's in formula. > > > > Reducing number of H's is another project for hypervisor side > > improvement and should be decoupled from Linux side changes. > > Whilst this analysis is fine, I don't think this is the real benefit of > using superpages which is reducing TLB usage and reducing the number of > TLB misses. It depends a bit on whether the TLB caches partial walks etc, but more importantly using super pages reduces the cost of a TLB miss, by requiring less memory accesses on the walk. > With fragmented stage 2 tables I don't think you will see see much > improvement in TLB usage. I think you will save some, by cutting off a level of the stage one pages you avoid the need for doing a stage 2 walk at that level, which might be 3-4 levels of lookup. I expect it is not as significant a benefit as stage 2 superpages (which saves you accesses at every level of the stage 1 walk), but it will be there. > > ## HAP table fragmentation is not made worse > > This reasoning looks reasonable to me. But it suggests that the balloon > compaction isn't doing it's job properly. It seems like it should be > much more proactive in resolving fragmentation. Speaking to some KVM folks here at plumbers it seems they find compaction to be working pretty well for them, but there are some proc knobs one has to twiddle to make it more aggressive. (sadly there are none of them here right now so I can't ask for more pointers to said knobs). > > ## Beyond Linux balloon compaction infrastructure > > > > Currently there's no mechanism in Xen to coalesce HAP table > > entries. To coalesce HAP entries we would need to make sure all > > discrete entries belong to one huge page, are in correct order and > > correct state. > > I would like to see a more detailed description of the Xen-side > solution. So we can be sure the Linux half is compatible with it.I I believe Dario has hacked up some prototypes (on x86) at some point. But I don't believe there will be terribly much linkage between the guest and hypervisor half beyond having the guest side arrange for as many 2MB slots as possible to be completely populated such that the hypervisor has opportunities to do compaction. The compaction (both guest and Xen side) is not the first order issue here though. The first thing we should be doing is to be trying to balloon up and down in 2M increments in the first place wherever possible. That includes things like alloc_xenballooned_pages operating in 2M increments under the hood such that things like grant maps which require 4K p2m entries are condensed into the least number of 2M regions possible, IOW having been forced to fragment a region use it for as many other 4K mappings as possible. However we know that we are not always going be able to allocate a 2M page to balloon out, and we need to be prepared to mitigate this, which is where compaction comes in. Compaction on the guest side serves two purposes immediately even without hypervisor side compaction: Firstly it increases the chances of being able to allocate a 2M page when required to balloon one out, either right now or at some point in the future, IOW it helps towards the goal of doing as much ballooning as possible in 2M chunks. Secondly it means that we will end up with contiguous 2M holes which will give the opportunity for future balloon operations to up with 2M mappings, this is useful in its own right even if it is neutral wrt the fragmentation of the populated 2M regions right now (and we know it can't make things worse in that regard). I think it is important to realise that this is an independently useful change which is also a prerequisite for some interesting future work. I think blocking this work now pending the completion of that future interesting work is unreasonable. > Before accepting any series I would also need to see real world > performance improvements, not just theoretical ones. I think the interesting statistics here will be: * The numbers of 4K and 2M mappings used by the domain's p2m (since it is well established that 2M mappings improve performance in multiple workloads on multiple architectures, there is no need to reproduce that result yet again IMHO). * The numbers of completely depopulated 2M regions, which represent the potential for improved mappings when ballooning back up. * The numbers of completely populated 2M regions in the p2m, which represent opportunities for the hypervisor to make further improvements *in the future*. * The numbers of 2M regions which consist either solely of 4K mappings of RAM + holes or solely "special" 4K mappings (grant mappings etc) + holes. Those last three are somewhat complementary I think. The ARM p2m tracks the numbers of each size of mapping for a given domain. The number of holes/full regions is not tracked but a debug hypercall or console keyhandler could quite easily scan for them, based on looking at the p2m type associated with each entry. Ian. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-17 8:20 ` Ian Campbell @ 2014-10-17 10:07 ` David Vrabel 0 siblings, 0 replies; 7+ messages in thread From: David Vrabel @ 2014-10-17 10:07 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, xen-devel, Boris Ostrovsky On 17/10/14 09:20, Ian Campbell wrote: > > I think it is important to realise that this is an independently useful > change which is also a prerequisite for some interesting future work. I > think blocking this work now pending the completion of that future > interesting work is unreasonable. I don't think requiring some initial design work for the complete solution (including the hypervisor side compaction) is unreasonable. This should even be straightforward given the prototyping that Dario has already done. David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu 2014-10-16 17:46 ` David Vrabel @ 2014-10-17 12:30 ` Andrew Cooper 2014-10-17 12:44 ` Ian Campbell 2014-10-17 13:27 ` Wei Liu 1 sibling, 2 replies; 7+ messages in thread From: Andrew Cooper @ 2014-10-17 12:30 UTC (permalink / raw) To: Wei Liu, xen-devel Cc: Boris Ostrovsky, Ian Campbell, Stefano Stabellini, David Vrabel On 16/10/2014 18:12, Wei Liu wrote: > This document analyses the impact of using balloon compaction > infrastructure in Xen balloon driver. This is a fantastic start (and I actively recommend similar documents from others for future work). > ## Motives > > 1. Balloon pages fragments guest physical address space. > 2. Balloon compaction infrastructure can migrate ballooned pages from > start of zone to end of zone, hence creating contiguous guest physical > address space. > 3. Having contiguous guest physical address enables some options to > improve performance. > > ## Benefit for auto-translated guest > > HVM/PVH/ARM guest can have contiguous guest physical address space > after balloon pages are compacted, which potentially improves memory > performance provided guest makes use of huge pages, either via > Hugetlbfs or Transparent Huge Page (THP). > > Consider memory access pattern of these guests, one access to guest > physical address involves several accesses to machine memory. The > total number of memory accesses can be represented as: > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > Hx denotes second stage page table walk levels and Gx denotes guest > page table walk levels. I don't think this expresses what you intend to express (or I don't understand how you are trying to convey it). Consider a single memory access in the guest, with no pagefaults, and ignoring for now any TLB effects. This description is based on my knowledge of x86, but I assume ARM functions in a similar way. Both the guest, G, and host, H, are 64bit, so using 4-level translations. Consider first, the worst case where all mappings are 4K pages. gcr3 needs following to find gl4. This involves a complete host pagetable walk (4 translations) gl4 needs following to find gl3. This involves a complete host pagetable walk (4 translations) gl3 to gl2 ... gl2 to gl1 ... gl1 to gpa ... In the worst case, it takes 20 translations for a single guest memory access. Altering the guest to use a 2MB superpage would alleviate 4 traslations Altering the host to use 2MB superpages would alleviate 5 translations. It should be noted that the host pagetable walks are distinct, so a single 2MB superpage could turn only a single host walk from 4 translations to 3 translations. A guest can help itself substantially by allocating its pagetables contiguously, which would cause multiple guest translations to be contained within the same host translation, getting much better temporal locality of reference from the TLB. It should also be noted that Xen is in a far better position to make easy use of 2MB and 1GB superpages, than a guest running normal workloads is. However, the ideal case is to make use of both host and guest superpages. > By having contiguous guest physical address, guest can make use of > huge pages. This can reduce the number of G's in formula. > > Reducing number of H's is another project for hypervisor side > improvement and should be decoupled from Linux side changes. > > ## Design and implementation > > The use of balloon compaction doesn't require introducign new > interfaces between Xen balloon driver and the rest of the system. Most > changes are internal to Xen balloon driver. > > Currently, Xen balloon driver gets its page directly from page > allocator. To enable balloon page migration, those pages now need to > be allocated from core balloon driver. Pages allocated from core > balloon driver are subject to balloon page compaction. > > Xen balloon driver will also need to provide a callback to migrate > balloon page. In essence callback function receives "old page", which > is a already ballooned out page, and "new page", which is a page to be > ballooned out, then it inflates "old page" and deflates "new page". > > The core of migration callback is XENMEM\_exchange hypercall. This > makes sure that inflation of old page and deflation of new page is > done atomically, so even if a domain is beyond its memory target and > being enforced, it can still compact memory. > > ## HAP table fragmentation is not made worse > > *Assumption*: guest physical address space is already heavily > fragmented by balloon pages when balloon page compaction is required. > > For a typical test case like ballooning up and down when doing kernel > compilation, there's usually only a handful huge pages left in the > end. So the observation matches the assumption. On the other hand, if > guest physical address space is not heavily fragmented, it's not > likely balloon page compaction will be triggered automatically. > > In practice, balloon page compaction is not likely to make things > worse. Here is the analysis based on the above assumption. > > Note that HAP table is already shattered by balloon pages. When a > guest page is ballooned out, the underlying HAP entry needs to be > split should that entry pointed to a huge page. > > XENMEM\_exchange works as followed, "old page" is the guest page about > to get inflated and "new page" is the guest page about to get > deflated. It works like this: > > 1. Steal old page from domain. > 2. Allocate a heap page from domheap > 3. Release new page back to Xen > 4. Update guest physmap, old page points to heap page, new page points > to INVALID\_MFN. > > The end result is that HAP entry for "old page" now points to a valid > MFN instead of having INVALID\_MFN; HAP entry for "new page" now points > to INVALID\_MFN. > > So for old page we're in the same position as before. HAP table is > fragmented, however it's not more fragmented than before. > > For new page, the risk is that if the targeting guest new page is part > of a huge page, we need to split HAP entry, hence fragmenting HAP > table. This is valid concern. However in practice, guest address space > is already fragmented by ballooning. It's not likely we need to break > up any more huge pages, because there aren't that many left. So we're > in a position no worse than before. > > Another downside is that when Xen is exchanging a page, it's possible > that Xen may need to break up a huge page to get a 4K page. Xen > domheap is fragmented. However we're not getting any worse than before > as ballooning already fragments domheap. > > ## Beyond Linux balloon compaction infrastructure > > Currently there's no mechanism in Xen to coalesce HAP table > entries. To coalesce HAP entries we would need to make sure all > discrete entries belong to one huge page, are in correct order and > correct state. > > By introducing necessary infrastructure(s) inside hypervisor (page > migration etc.), we might eventually be able to coalesce HAP entries, > hence reducing the number of H's in the aforementioned formula. This, > combined with the work on guest side, can help guest achieve best > possible performance. If I understand your proposal correctly, a VM lifetime would look like this: 1 Xen allocates pages (hopefully 2MB where possible) 2 Guest starts up, and shatters both guest and host superpages by blindly ballooning random gfns 3a During runtime, guest spends time copying pages around in an attempt to coalesce 3b (optionally, given Xen support) Xen spends time copying pages around in an attempt to coalesce 4 Guest reshatters guest and host pages by more ballooning. 5 goto 3 Step 3 is an expensive (especially for Xen, which has far more important things to be doing with its time) and which is self-perpetuating because of the balloon driver reshattering pages. Several factors contribute to shattering host pages. The ones which come to mind are: * Differing cacheability from MTRRs * Mapping a foreign grant into ones own physical address space * Releasing pages back to Xen via the decrease_reservation hypercall In addition, other factors complicate Xens ability to move pages. * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in place * Any IOMMU mappings will pin all (mapped) mfns in place. As a result, by far the most efficient way of prevent superpage fragmentation is to not shattering them in the first place. This can be done by changing the balloon driver in the guest to co-locate all pages it decides to balloon, rather than taking individual pages at random from the main memory pools. ~Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-17 12:30 ` Andrew Cooper @ 2014-10-17 12:44 ` Ian Campbell 2014-10-17 13:27 ` Wei Liu 1 sibling, 0 replies; 7+ messages in thread From: Ian Campbell @ 2014-10-17 12:44 UTC (permalink / raw) To: Andrew Cooper Cc: Wei Liu, Stefano Stabellini, xen-devel, David Vrabel, Boris Ostrovsky On Fri, 2014-10-17 at 13:30 +0100, Andrew Cooper wrote: > Step 3 is an expensive (especially for Xen, which has far more important > things to be doing with its time) and which is self-perpetuating because > of the balloon driver reshattering pages. > > Several factors contribute to shattering host pages. The ones which > come to mind are: > * Differing cacheability from MTRRs > * Mapping a foreign grant into ones own physical address space > * Releasing pages back to Xen via the decrease_reservation hypercall > > In addition, other factors complicate Xens ability to move pages. > * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in > place > * Any IOMMU mappings will pin all (mapped) mfns in place. > > As a result, by far the most efficient way of prevent superpage > fragmentation is to not shattering them in the first place. This can be > done by changing the balloon driver in the guest to co-locate all pages > it decides to balloon, rather than taking individual pages at random > from the main memory pools. Right, a necessary part of this work is to try and balloon (both up and down) in correctly aligned 2M increments, by allocating 2M pages to free back to Xen. All the coalescing stuff is a bit secondary (but still necessary) and is there to mitigate the case when a guest can't find a 2M page to allocate (so has done some 4K ops instead, in order to meet the targets) and/or to increase the probability it will be able to allocate one. Ian. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Analysis of using balloon page compaction in Xen balloon driver 2014-10-17 12:30 ` Andrew Cooper 2014-10-17 12:44 ` Ian Campbell @ 2014-10-17 13:27 ` Wei Liu 1 sibling, 0 replies; 7+ messages in thread From: Wei Liu @ 2014-10-17 13:27 UTC (permalink / raw) To: Andrew Cooper Cc: Wei Liu, Ian Campbell, Stefano Stabellini, xen-devel, David Vrabel, Boris Ostrovsky On Fri, Oct 17, 2014 at 01:30:21PM +0100, Andrew Cooper wrote: > On 16/10/2014 18:12, Wei Liu wrote: > > This document analyses the impact of using balloon compaction > > infrastructure in Xen balloon driver. > > This is a fantastic start (and I actively recommend similar documents > from others for future work). > > > ## Motives > > > > 1. Balloon pages fragments guest physical address space. > > 2. Balloon compaction infrastructure can migrate ballooned pages from > > start of zone to end of zone, hence creating contiguous guest physical > > address space. > > 3. Having contiguous guest physical address enables some options to > > improve performance. > > > > ## Benefit for auto-translated guest > > > > HVM/PVH/ARM guest can have contiguous guest physical address space > > after balloon pages are compacted, which potentially improves memory > > performance provided guest makes use of huge pages, either via > > Hugetlbfs or Transparent Huge Page (THP). > > > > Consider memory access pattern of these guests, one access to guest > > physical address involves several accesses to machine memory. The > > total number of memory accesses can be represented as: > > > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > > Hx denotes second stage page table walk levels and Gx denotes guest > > page table walk levels. > > I don't think this expresses what you intend to express (or I don't > understand how you are trying to convey it). > > Consider a single memory access in the guest, with no pagefaults, and > ignoring for now any TLB effects. This description is based on my > knowledge of x86, but I assume ARM functions in a similar way. > > Both the guest, G, and host, H, are 64bit, so using 4-level > translations. Consider first, the worst case where all mappings are 4K > pages. > > gcr3 needs following to find gl4. This involves a complete host > pagetable walk (4 translations) > gl4 needs following to find gl3. This involves a complete host > pagetable walk (4 translations) > gl3 to gl2 ... > gl2 to gl1 ... > gl1 to gpa ... > > In the worst case, it takes 20 translations for a single guest memory > access. > > Altering the guest to use a 2MB superpage would alleviate 4 traslations > > Altering the host to use 2MB superpages would alleviate 5 translations. > That's basically what I was trying to express in that formula, only that I missed gcr3 lookup. I think I wrote the wrong definition of Hx. Hx should be the number of host page table walks (not levels). And also "reduce the number of H's" should be "make individual H smaller". So we're still on the same page here. > It should be noted that the host pagetable walks are distinct, so a > single 2MB superpage could turn only a single host walk from 4 > translations to 3 translations. A guest can help itself substantially > by allocating its pagetables contiguously, which would cause multiple > guest translations to be contained within the same host translation, > getting much better temporal locality of reference from the TLB. > Yes. > It should also be noted that Xen is in a far better position to make > easy use of 2MB and 1GB superpages, than a guest running normal > workloads is. I don't think there's / should be linkage between Xen's knowledge and guest's knowledge, is there? Guest only sees its own address space, it can not control what types of pages are backing its address space. So this is a moot point IMHO. > However, the ideal case is to make use of both host and > guest superpages. > Yes. This work is to enable guest to more easily make use of huge pages. Host part is not included and should be solved separately. > > By having contiguous guest physical address, guest can make use of > > huge pages. This can reduce the number of G's in formula. > > > > Reducing number of H's is another project for hypervisor side > > improvement and should be decoupled from Linux side changes. > > > > ## Design and implementation > > > > The use of balloon compaction doesn't require introducign new > > interfaces between Xen balloon driver and the rest of the system. Most > > changes are internal to Xen balloon driver. > > > > Currently, Xen balloon driver gets its page directly from page > > allocator. To enable balloon page migration, those pages now need to > > be allocated from core balloon driver. Pages allocated from core > > balloon driver are subject to balloon page compaction. > > > > Xen balloon driver will also need to provide a callback to migrate > > balloon page. In essence callback function receives "old page", which > > is a already ballooned out page, and "new page", which is a page to be > > ballooned out, then it inflates "old page" and deflates "new page". > > > > The core of migration callback is XENMEM\_exchange hypercall. This > > makes sure that inflation of old page and deflation of new page is > > done atomically, so even if a domain is beyond its memory target and > > being enforced, it can still compact memory. > > > > ## HAP table fragmentation is not made worse > > > > *Assumption*: guest physical address space is already heavily > > fragmented by balloon pages when balloon page compaction is required. > > > > For a typical test case like ballooning up and down when doing kernel > > compilation, there's usually only a handful huge pages left in the > > end. So the observation matches the assumption. On the other hand, if > > guest physical address space is not heavily fragmented, it's not > > likely balloon page compaction will be triggered automatically. > > > > In practice, balloon page compaction is not likely to make things > > worse. Here is the analysis based on the above assumption. > > > > Note that HAP table is already shattered by balloon pages. When a > > guest page is ballooned out, the underlying HAP entry needs to be > > split should that entry pointed to a huge page. > > > > XENMEM\_exchange works as followed, "old page" is the guest page about > > to get inflated and "new page" is the guest page about to get > > deflated. It works like this: > > > > 1. Steal old page from domain. > > 2. Allocate a heap page from domheap > > 3. Release new page back to Xen > > 4. Update guest physmap, old page points to heap page, new page points > > to INVALID\_MFN. > > > > The end result is that HAP entry for "old page" now points to a valid > > MFN instead of having INVALID\_MFN; HAP entry for "new page" now points > > to INVALID\_MFN. > > > > So for old page we're in the same position as before. HAP table is > > fragmented, however it's not more fragmented than before. > > > > For new page, the risk is that if the targeting guest new page is part > > of a huge page, we need to split HAP entry, hence fragmenting HAP > > table. This is valid concern. However in practice, guest address space > > is already fragmented by ballooning. It's not likely we need to break > > up any more huge pages, because there aren't that many left. So we're > > in a position no worse than before. > > > > Another downside is that when Xen is exchanging a page, it's possible > > that Xen may need to break up a huge page to get a 4K page. Xen > > domheap is fragmented. However we're not getting any worse than before > > as ballooning already fragments domheap. > > > > ## Beyond Linux balloon compaction infrastructure > > > > Currently there's no mechanism in Xen to coalesce HAP table > > entries. To coalesce HAP entries we would need to make sure all > > discrete entries belong to one huge page, are in correct order and > > correct state. > > > > By introducing necessary infrastructure(s) inside hypervisor (page > > migration etc.), we might eventually be able to coalesce HAP entries, > > hence reducing the number of H's in the aforementioned formula. This, > > combined with the work on guest side, can help guest achieve best > > possible performance. > > If I understand your proposal correctly, a VM lifetime would look like this: > > 1 Xen allocates pages (hopefully 2MB where possible) > 2 Guest starts up, and shatters both guest and host superpages by > blindly ballooning random gfns > 3a During runtime, guest spends time copying pages around in an attempt > to coalesce > 3b (optionally, given Xen support) Xen spends time copying pages around > in an attempt to coalesce > 4 Guest reshatters guest and host pages by more ballooning. > 5 goto 3 > Correct. > Step 3 is an expensive (especially for Xen, which has far more important > things to be doing with its time) and which is self-perpetuating because > of the balloon driver reshattering pages. > > Several factors contribute to shattering host pages. The ones which > come to mind are: > * Differing cacheability from MTRRs > * Mapping a foreign grant into ones own physical address space Not quite sure about the first one but the second one is not affected by balloon compaction as those pages, even mapped with balloon pages, are handled separately. > * Releasing pages back to Xen via the decrease_reservation hypercall > This will affect Xen heap, however we're not making it worse. See my analysis above. > In addition, other factors complicate Xens ability to move pages. > * Mappings from other domains (Qemu, PV backends, etc) will pin mfns in > place > * Any IOMMU mappings will pin all (mapped) mfns in place. > Right. But balloon compaction is not making things any worse either. > As a result, by far the most efficient way of prevent superpage > fragmentation is to not shattering them in the first place. This can be > done by changing the balloon driver in the guest to co-locate all pages > it decides to balloon, rather than taking individual pages at random > from the main memory pools. > I understand your concern, I also understand the best solution so far is to avoid shattering in the first place. However these points won't invalidate this work, because balloon compaction is not making things any worse, and with a few tricks inside Xen balloon driver (as you proposed) can make things better. Your proposal of "changing the balloon driver in the guest to co-locate all pages it decides to balloon", I can see two upstreamable solutions at a quick glance: 1. Allocate / release huge page from / to HV in the first place. 2. Allocate normal page, and occasionally swap them with huge page if resources (both in Xen and guest) permitted. #1 is not practical in a busy system, it also won't work against "bad neighbor". (I'm very happy to just change "order" in every page allocation call if that would do what we want) #2 requires balloon page compaction, which is exactly the series is doing. Wei. > ~Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-10-17 13:27 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-16 17:12 Analysis of using balloon page compaction in Xen balloon driver Wei Liu 2014-10-16 17:46 ` David Vrabel 2014-10-17 8:20 ` Ian Campbell 2014-10-17 10:07 ` David Vrabel 2014-10-17 12:30 ` Andrew Cooper 2014-10-17 12:44 ` Ian Campbell 2014-10-17 13:27 ` Wei Liu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.