Xen balloon driver improvement (version 1)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Xen balloon driver improvement (version 1)
@ 2014-10-22 16:29 Wei Liu
  2014-10-22 17:32 ` Andrew Cooper
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-22 16:29 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	David Vrabel, Boris Ostrovsky

Hi all

This is my initial design to improve Xen balloon driver.

PDF version with graphs can be found at

http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf

% Xen Balloon Driver Improvement
% Wei Liu <<wei.liu2@citrix.com>>

-------------------------------------------
Version     Date         Changes
-------     ----         ------------------
  1         22/10/2014   Initial version.
-------------------------------------------

## Motives

1. Balloon pages fragments guest physical address space.
1. Balloon compaction infrastructure can migrate ballooned pages from
   start of zone to end of zone, hence creating contiguous guest physical
   address space.
1. Having contiguous guest physical address enables some options to
   improve performance.

## Goal of improvement

Balloon driver makes use of as many huge pages as possible,
defragmenting both guest address space and Xen pages. This should be
achieved without any particular hypervisor side feature.

## Design and implementation

When balloon driver is asked to increase / decrease reservation, it
will always start with huge page. However, due to resource
availability in both hypervisor and guest, it's not always possible to
get hold of a huge page. In that case the driver will fall back to use
normal size page. Balloon driver later will try to coalesce small size
pages into huge page. As time goes by, both Xen and guest should use
more and more huge pages.

To achieve the said goal, several changes will be made:

1. Make use of balloon page compaction.
1. Maintain multiple queues for pages of different sizes and purposes.
1. Periodically exchange normal size pages with huge pages.

### Make use of balloon page compaction

Balloon page migration moves balloon pages from start of zone to end
of zone, making guest physical address space contiguous. This gives
worker thread to allocate huge pages in order to coalesce small pages.

Currently, Xen balloon driver gets its page directly from page
allocator. To enable balloon page migration, those pages now need to
be allocated from core balloon driver. Pages allocated from core
balloon driver are subject to balloon page compaction.

The use of balloon compaction doesn't require introducign new
interfaces between Xen balloon driver and the rest of the system. Most
changes are internal to Xen balloon driver.

Xen balloon driver will also need to provide a callback to migrate
balloon page. In essence callback function receives "old page", which
is a already ballooned out page, and "new page", which is a page to be
ballooned out, then it inflates "old page" and deflates "new page".

The core of migration callback is XENMEM\_exchange hypercall. This
makes sure that inflation of old page and deflation of new page is
done atomically, so even if a domain is beyond its memory target and
being enforced, it can still compact memory.

### Maintain multiple queues for pages of different sizes and purposes

We maintain multiple queues for pages of different sizes inside Xen
balloon driver, so that Xen balloon worker thread can coalesce smaller
size pages into one larger size page. Queues for special purposed
pages, such as balloon pages used to map foreign pages, are also
maintained. These special purposed pages are not subject to migration
and page coalescence.

For instance, balloon driver can maintain three queues:

1. queue for 2 MB pages
1. queue for 4 KB pages (delegated to core balloon driver)
1. queue for pages used to mapped pages from other domain

More queues can be added when necessary, but for now one queue for
normal pages and one queue for huge page should be enough.

### Worker thread to coalesce small size pages

Worker thread wakes up periodically to check if there's enough pages
in normal size page queue to coalesce into a huge page. If so, it will
try to exchange that huge page into a number of normal size pages with
XENMEM\_exchange hypercall.

## Flowcharts

These flowcharts assume normal page size is 4K and huge page size is
2M.  They show how two queues are maintained.

![Increase Reservation](increase-reservation.png)

![Decrease Reservation](decrease-reservation.png)

![Exchange Pages](exchange-pages.png)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
@ 2014-10-22 17:32 ` Andrew Cooper
  2014-10-22 18:29   ` Wei Liu
  2014-10-23 10:09 ` David Vrabel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2014-10-22 17:32 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Boris Ostrovsky, Stefano Stabellini, David Vrabel, Ian Campbell

On 22/10/14 17:29, Wei Liu wrote:
> Hi all
>
> This is my initial design to improve Xen balloon driver.
>
> PDF version with graphs can be found at
>
> http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf

+1 pandoc  (makes far nicer pdfs)

I have a few style recommendations, and some grammar issues

>
> % Xen Balloon Driver Improvement
> % Wei Liu <<wei.liu2@citrix.com>>

% Draft $N

Helps here.

>
> -------------------------------------------
> Version     Date         Changes
> -------     ----         ------------------
>   1         22/10/2014   Initial version.
> -------------------------------------------
>
> ## Motives
>
> 1. Balloon pages fragments guest physical address space.

[style] For the benefit of people reading the text rather than the pdf,
properly numbering lists like these is nice.

> 1. Balloon compaction infrastructure can migrate ballooned pages from
>    start of zone to end of zone, hence creating contiguous guest physical
>    address space.

What does start of zone and end of zone mean?  While this document is
labelled "Xen balloon driver improvements", it applies mainly to the
Linux balloon driver, and should be noted as such.

> 1. Having contiguous guest physical address enables some options to
>    improve performance.
>
> ## Goal of improvement
>
> Balloon driver makes use of as many huge pages as possible,

"The balloon driver"

> defragmenting both guest address space and Xen pages. This should be

Defragmentation (i.e. actively undoing fragmentation) is different to
"preventing fragmentation".

In this case, the balloon driver defragmenting the guest physical
address space permits superpage ballooning which helps prevent host
fragmentation.

I think you should also note that both guest and host address
fragmentation cause performance issues, and this design covers
specifically guest fragmentation.

> achieved without any particular hypervisor side feature.
>
> ## Design and implementation
>
> When balloon driver is asked to increase / decrease reservation, it

"When the balloon driver"

> will always start with huge page. However, due to resource

"always start with a huge page"

> availability in both hypervisor and guest, it's not always possible to
> get hold of a huge page. In that case the driver will fall back to use
> normal size page. Balloon driver later will try to coalesce small size
> pages into huge page. As time goes by, both Xen and guest should use
> more and more huge pages.
>
> To achieve the said goal, several changes will be made:
>
> 1. Make use of balloon page compaction.
> 1. Maintain multiple queues for pages of different sizes and purposes.
> 1. Periodically exchange normal size pages with huge pages.
>
> ### Make use of balloon page compaction
>
> Balloon page migration moves balloon pages from start of zone to end
> of zone, making guest physical address space contiguous. This gives
> worker thread to allocate huge pages in order to coalesce small pages.

I can't parse this final sentence.

>
> Currently, Xen balloon driver gets its page directly from page
> allocator. To enable balloon page migration, those pages now need to
> be allocated from core balloon driver. Pages allocated from core
> balloon driver are subject to balloon page compaction.
>
> The use of balloon compaction doesn't require introducign new
> interfaces between Xen balloon driver and the rest of the system. Most
> changes are internal to Xen balloon driver.

The Linux driver.  The principle applies to each OS, but interaction
with the core memory allocator is very certainly system specific.

>
> Xen balloon driver will also need to provide a callback to migrate
> balloon page. In essence callback function receives "old page", which
> is a already ballooned out page, and "new page", which is a page to be
> ballooned out, then it inflates "old page" and deflates "new page".
>
> The core of migration callback is XENMEM\_exchange hypercall. This
> makes sure that inflation of old page and deflation of new page is
> done atomically, so even if a domain is beyond its memory target and
> being enforced, it can still compact memory.

"and the target is being enforced"

>
> ### Maintain multiple queues for pages of different sizes and purposes
>
> We maintain multiple queues for pages of different sizes inside Xen
> balloon driver, so that Xen balloon worker thread can coalesce smaller
> size pages into one larger size page. Queues for special purposed
> pages, such as balloon pages used to map foreign pages, are also
> maintained. These special purposed pages are not subject to migration
> and page coalescence.
>
> For instance, balloon driver can maintain three queues:
>
> 1. queue for 2 MB pages
> 1. queue for 4 KB pages (delegated to core balloon driver)
> 1. queue for pages used to mapped pages from other domain

What about 1GB pages?

>
> More queues can be added when necessary, but for now one queue for
> normal pages and one queue for huge page should be enough.
>
> ### Worker thread to coalesce small size pages
>
> Worker thread wakes up periodically to check if there's enough pages

"there are enough pages"

> in normal size page queue to coalesce into a huge page. If so, it will
> try to exchange that huge page into a number of normal size pages with
> XENMEM\_exchange hypercall.

Your diagram says "exchange 2M and 4K pages".  Exchange how, because you
cannot exchange a set of scattered 4K pages for a 2M contiguous one in
an HVM guest.

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 17:32 ` Andrew Cooper
@ 2014-10-22 18:29   ` Wei Liu
  2014-10-23 11:00     ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Wei Liu @ 2014-10-22 18:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, xen-devel,
	David Vrabel, Boris Ostrovsky

On Wed, Oct 22, 2014 at 06:32:57PM +0100, Andrew Cooper wrote:
> On 22/10/14 17:29, Wei Liu wrote:
> > Hi all
> >
> > This is my initial design to improve Xen balloon driver.
> >
> > PDF version with graphs can be found at
> >
> > http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> 
> +1 pandoc  (makes far nicer pdfs)
> 
> I have a few style recommendations, and some grammar issues
> 

I will fix them.

[...]
> > 1. Balloon compaction infrastructure can migrate ballooned pages from
> >    start of zone to end of zone, hence creating contiguous guest physical
> >    address space.
> 
> What does start of zone and end of zone mean?  While this document is

It's Linux concept. Zone is just ... memory zone. :-)

> labelled "Xen balloon driver improvements", it applies mainly to the
> Linux balloon driver, and should be noted as such.
> 

Sure.

> > 1. Having contiguous guest physical address enables some options to
> >    improve performance.
> >
> > ## Goal of improvement
> >
[...]
> > Balloon page migration moves balloon pages from start of zone to end
> > of zone, making guest physical address space contiguous. This gives
> > worker thread to allocate huge pages in order to coalesce small pages.
> 
> I can't parse this final sentence.
> 

This gives worker thread *a chance* to allocate huge pages in order to
coalesce small pages.

> >
> > Currently, Xen balloon driver gets its page directly from page
[...]
> > ### Maintain multiple queues for pages of different sizes and purposes
> >
> > We maintain multiple queues for pages of different sizes inside Xen
> > balloon driver, so that Xen balloon worker thread can coalesce smaller
> > size pages into one larger size page. Queues for special purposed
> > pages, such as balloon pages used to map foreign pages, are also
> > maintained. These special purposed pages are not subject to migration
> > and page coalescence.
> >
> > For instance, balloon driver can maintain three queues:
> >
> > 1. queue for 2 MB pages
> > 1. queue for 4 KB pages (delegated to core balloon driver)
> > 1. queue for pages used to mapped pages from other domain
> 
> What about 1GB pages?
> 

I wouldn't bother with 1GB pages here.

It would require too much work to coalesce 4KB pages to 1GB pages. 

And probably due to resource limitation, 1GB allocation is likely to
fail. If you have a guest with hundreds of GB ram you wouldn't bother
using balloon compaction in the first place.

So having a 1GB queue seems too much effort with too little gain.

> >
> > More queues can be added when necessary, but for now one queue for
> > normal pages and one queue for huge page should be enough.
> >
> > ### Worker thread to coalesce small size pages
> >
> > Worker thread wakes up periodically to check if there's enough pages
> 
> "there are enough pages"
> 
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> 
> Your diagram says "exchange 2M and 4K pages".  Exchange how, because you
> cannot exchange a set of scattered 4K pages for a 2M contiguous one in
> an HVM guest.
> 

2M page is populated, while 4K pages are not.

So for the exchange hypercall, 2M is exchange.in and the list of 4K
pages is the output. In the end, 2M page becomes unpopulated, 4K pages
are populated.

I read the implementation of XENMEM_exchange, it looks OK for
me to do so. Did I overlook something?

Wei.

> ~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 18:29   ` Wei Liu
@ 2014-10-23 11:00     ` Ian Campbell
  2014-10-23 11:05       ` Wei Liu
  2014-10-23 11:42       ` Andrew Cooper
  0 siblings, 2 replies; 24+ messages in thread
From: Ian Campbell @ 2014-10-23 11:00 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, xen-devel, David Vrabel,
	Boris Ostrovsky

On Wed, 2014-10-22 at 19:29 +0100, Wei Liu wrote:

> > > For instance, balloon driver can maintain three queues:
> > >
> > > 1. queue for 2 MB pages
> > > 1. queue for 4 KB pages (delegated to core balloon driver)
> > > 1. queue for pages used to mapped pages from other domain
> > 
> > What about 1GB pages?
> > 
> 
> I wouldn't bother with 1GB pages here.

Guests which don't have special privileges are limited to 2M contiguous
allocations anyway, to stop them from consuming "precious" higher order
mappings.

I'm not sure that's still worthwhile (e.g. is it valid on ARM or
shadow/HAP x86? I'm not sure).

> It would require too much work to coalesce 4KB pages to 1GB pages. 

FWIW in practice You'd probably coalesce 4K pages into 2M  and then 2M
into 1G.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 11:00     ` Ian Campbell
@ 2014-10-23 11:05       ` Wei Liu
  2014-10-23 11:42       ` Andrew Cooper
  1 sibling, 0 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-23 11:05 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 12:00:40PM +0100, Ian Campbell wrote:
> On Wed, 2014-10-22 at 19:29 +0100, Wei Liu wrote:
> 
> > > > For instance, balloon driver can maintain three queues:
> > > >
> > > > 1. queue for 2 MB pages
> > > > 1. queue for 4 KB pages (delegated to core balloon driver)
> > > > 1. queue for pages used to mapped pages from other domain
> > > 
> > > What about 1GB pages?
> > > 
> > 
> > I wouldn't bother with 1GB pages here.
> 
> Guests which don't have special privileges are limited to 2M contiguous
> allocations anyway, to stop them from consuming "precious" higher order
> mappings.
> 
> I'm not sure that's still worthwhile (e.g. is it valid on ARM or
> shadow/HAP x86? I'm not sure).
> 
> > It would require too much work to coalesce 4KB pages to 1GB pages. 
> 
> FWIW in practice You'd probably coalesce 4K pages into 2M  and then 2M
> into 1G.
> 

That's what I meant. Still, that's a very big number.

Wei.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 11:00     ` Ian Campbell
  2014-10-23 11:05       ` Wei Liu
@ 2014-10-23 11:42       ` Andrew Cooper
  2014-10-23 11:44         ` David Vrabel
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2014-10-23 11:42 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: Boris Ostrovsky, Stefano Stabellini, David Vrabel, xen-devel

On 23/10/14 12:00, Ian Campbell wrote:
> On Wed, 2014-10-22 at 19:29 +0100, Wei Liu wrote:
>
>>>> For instance, balloon driver can maintain three queues:
>>>>
>>>> 1. queue for 2 MB pages
>>>> 1. queue for 4 KB pages (delegated to core balloon driver)
>>>> 1. queue for pages used to mapped pages from other domain
>>> What about 1GB pages?
>>>
>> I wouldn't bother with 1GB pages here.
> Guests which don't have special privileges are limited to 2M contiguous
> allocations anyway, to stop them from consuming "precious" higher order
> mappings.
>
> I'm not sure that's still worthwhile (e.g. is it valid on ARM or
> shadow/HAP x86? I'm not sure).
>
>> It would require too much work to coalesce 4KB pages to 1GB pages. 
> FWIW in practice You'd probably coalesce 4K pages into 2M  and then 2M
> into 1G.
>
>

Wouldn't it be wonderful to be able to run a 1GB PVH guest on a 1GB HAP
mapping, with the PVH guest making use of 2MB mapping where possible.

PVH (ought) to be able to do away with the MTRR caching issues, the
legacy IO regions, and so long as the guest doesn't balloon pages out or
map a foreign grant, it won't shatter the host superpage.

But in principle, I agree that making better use of 2MB pages is more
important than considering 1GB pages at the moment.

~Andrew

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 11:42       ` Andrew Cooper
@ 2014-10-23 11:44         ` David Vrabel
  0 siblings, 0 replies; 24+ messages in thread
From: David Vrabel @ 2014-10-23 11:44 UTC (permalink / raw)
  To: Andrew Cooper, Ian Campbell, Wei Liu
  Cc: Boris Ostrovsky, Stefano Stabellini, xen-devel

On 23/10/14 12:42, Andrew Cooper wrote:
> On 23/10/14 12:00, Ian Campbell wrote:
>> On Wed, 2014-10-22 at 19:29 +0100, Wei Liu wrote:
>>
>>>>> For instance, balloon driver can maintain three queues:
>>>>>
>>>>> 1. queue for 2 MB pages
>>>>> 1. queue for 4 KB pages (delegated to core balloon driver)
>>>>> 1. queue for pages used to mapped pages from other domain
>>>> What about 1GB pages?
>>>>
>>> I wouldn't bother with 1GB pages here.
>> Guests which don't have special privileges are limited to 2M contiguous
>> allocations anyway, to stop them from consuming "precious" higher order
>> mappings.
>>
>> I'm not sure that's still worthwhile (e.g. is it valid on ARM or
>> shadow/HAP x86? I'm not sure).
>>
>>> It would require too much work to coalesce 4KB pages to 1GB pages. 
>> FWIW in practice You'd probably coalesce 4K pages into 2M  and then 2M
>> into 1G.
>>
>>
> 
> Wouldn't it be wonderful to be able to run a 1GB PVH guest on a 1GB HAP
> mapping, with the PVH guest making use of 2MB mapping where possible.
> 
> PVH (ought) to be able to do away with the MTRR caching issues, the
> legacy IO regions, and so long as the guest doesn't balloon pages out or
> map a foreign grant, it won't shatter the host superpage.
> 
> But in principle, I agree that making better use of 2MB pages is more
> important than considering 1GB pages at the moment.

IMO, if you're that concerned about eking out the last bit of
performance out of a guest you probably won't be ballooning.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
  2014-10-22 17:32 ` Andrew Cooper
@ 2014-10-23 10:09 ` David Vrabel
  2014-10-23 10:52   ` Stefano Stabellini
                     ` (2 more replies)
  2014-10-23 11:59 ` Ian Campbell
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: David Vrabel @ 2014-10-23 10:09 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Andrew Cooper, Boris Ostrovsky, David Vrabel, Ian Campbell,
	Stefano Stabellini

On 22/10/14 17:29, Wei Liu wrote:
> 
> ### Make use of balloon page compaction
[...]
> The core of migration callback is XENMEM\_exchange hypercall. This
> makes sure that inflation of old page and deflation of new page is
> done atomically, so even if a domain is beyond its memory target and
> being enforced, it can still compact memory.

XENMEM_exchange doesn't really have the behaviour that is needed here.

Page migration splits the memory map into two parts, the populated area
at the bottom and the balloon area.  The populated area is fragmented by
ballooned pages, and the balloon area is fragmented by populated pages.

Consider a single ballooned page in the middle of an otherwise intact
superframe.  Page migration wants to populate this page and depopulate a
different page from the balloon area.

A hypercall that can do an atomic populate and depopulate will allow xen
to easily recreate the superframe (if the missing frame is free).
XENMEM_exchange will leave the superframe fragmented.

XENMEM_exchange would be an acceptable fallback when this new hypercall
is not availble.

> ### Maintain multiple queues for pages of different sizes and purposes
> 
> We maintain multiple queues for pages of different sizes inside Xen
> balloon driver, so that Xen balloon worker thread can coalesce smaller
> size pages into one larger size page. Queues for special purposed
> pages, such as balloon pages used to map foreign pages, are also
> maintained. These special purposed pages are not subject to migration
> and page coalescence.
> 
> For instance, balloon driver can maintain three queues:
> 
> 1. queue for 2 MB pages
> 1. queue for 4 KB pages (delegated to core balloon driver)
> 1. queue for pages used to mapped pages from other domain
> 
> More queues can be added when necessary, but for now one queue for
> normal pages and one queue for huge page should be enough.

Can you explain why is this specific to Xen and why other hypervisors
wouldn't want to make use of all this huge page infrastructure?

> ### Worker thread to coalesce small size pages
> 
> Worker thread wakes up periodically to check if there's enough pages
> in normal size page queue to coalesce into a huge page. If so, it will
> try to exchange that huge page into a number of normal size pages with
> XENMEM\_exchange hypercall.

I don't think you need a new worker thread for this,  the existing page
migration is already trying to keep the ballooned zone contiguous so
after migrating pages you need only try and move contiguous ballooned 4k
pages to the 2M list.

> ## Flowcharts
> 
> These flowcharts assume normal page size is 4K and huge page size is
> 2M.  They show how two queues are maintained.

Having to break 2M pages into 4k ones to meet a target suggests that the
toolstack should allocate a domain with 2M multiples and should set the
target in 2M multiples only.  The autoballoon driver will also need to
set the target in 2M multiples.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 10:09 ` David Vrabel
@ 2014-10-23 10:52   ` Stefano Stabellini
  2014-10-23 10:58     ` David Vrabel
  2014-10-23 11:04   ` Wei Liu
  2014-10-27 11:29   ` Wei Liu
  2 siblings, 1 reply; 24+ messages in thread
From: Stefano Stabellini @ 2014-10-23 10:52 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	xen-devel, Boris Ostrovsky

On Thu, 23 Oct 2014, David Vrabel wrote:
> On 22/10/14 17:29, Wei Liu wrote:
> > 
> > ### Make use of balloon page compaction
> [...]
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > being enforced, it can still compact memory.
> 
> XENMEM_exchange doesn't really have the behaviour that is needed here.
> 
> Page migration splits the memory map into two parts, the populated area
> at the bottom and the balloon area.  The populated area is fragmented by
> ballooned pages, and the balloon area is fragmented by populated pages.
> 
> Consider a single ballooned page in the middle of an otherwise intact
> superframe.  Page migration wants to populate this page and depopulate a
> different page from the balloon area.
> 
> A hypercall that can do an atomic populate and depopulate will allow xen
> to easily recreate the superframe (if the missing frame is free).
> XENMEM_exchange will leave the superframe fragmented.
 
XENMEM_exchange should be capable of doing that. If it is not today, it
could be fixed. Am I missing something? What the problem at the
interface level with XENMEM_exchange?



> XENMEM_exchange would be an acceptable fallback when this new hypercall
> is not availble.
>
> > ## Flowcharts
> > 
> > These flowcharts assume normal page size is 4K and huge page size is
> > 2M.  They show how two queues are maintained.
> 
> Having to break 2M pages into 4k ones to meet a target suggests that the
> toolstack should allocate a domain with 2M multiples and should set the
> target in 2M multiples only.  The autoballoon driver will also need to
> set the target in 2M multiples.

That's a good low hanging fruit to pick

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 10:52   ` Stefano Stabellini
@ 2014-10-23 10:58     ` David Vrabel
  0 siblings, 0 replies; 24+ messages in thread
From: David Vrabel @ 2014-10-23 10:58 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Andrew Cooper, Boris Ostrovsky, Wei Liu, Ian Campbell, xen-devel

On 23/10/14 11:52, Stefano Stabellini wrote:
> On Thu, 23 Oct 2014, David Vrabel wrote:
>> On 22/10/14 17:29, Wei Liu wrote:
>>>
>>> ### Make use of balloon page compaction
>> [...]
>>> The core of migration callback is XENMEM\_exchange hypercall. This
>>> makes sure that inflation of old page and deflation of new page is
>>> done atomically, so even if a domain is beyond its memory target and
>>> being enforced, it can still compact memory.
>>
>> XENMEM_exchange doesn't really have the behaviour that is needed here.
>>
>> Page migration splits the memory map into two parts, the populated area
>> at the bottom and the balloon area.  The populated area is fragmented by
>> ballooned pages, and the balloon area is fragmented by populated pages.
>>
>> Consider a single ballooned page in the middle of an otherwise intact
>> superframe.  Page migration wants to populate this page and depopulate a
>> different page from the balloon area.
>>
>> A hypercall that can do an atomic populate and depopulate will allow xen
>> to easily recreate the superframe (if the missing frame is free).
>> XENMEM_exchange will leave the superframe fragmented.
>  
> XENMEM_exchange should be capable of doing that. If it is not today, it
> could be fixed. Am I missing something? What the problem at the
> interface level with XENMEM_exchange?

I'm probably not understanding what XENMEM_exchange does.  If it has the
correct behaviour already then that's fine.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 10:09 ` David Vrabel
  2014-10-23 10:52   ` Stefano Stabellini
@ 2014-10-23 11:04   ` Wei Liu
  2014-10-27 11:29   ` Wei Liu
  2 siblings, 0 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-23 11:04 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	xen-devel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 11:09:19AM +0100, David Vrabel wrote:
> On 22/10/14 17:29, Wei Liu wrote:
> > 
> > ### Make use of balloon page compaction
> [...]
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > being enforced, it can still compact memory.
> 
> XENMEM_exchange doesn't really have the behaviour that is needed here.
> 

Atomicity is guaranteed, isn't it?

> Page migration splits the memory map into two parts, the populated area
> at the bottom and the balloon area.  The populated area is fragmented by
> ballooned pages, and the balloon area is fragmented by populated pages.
> 
> Consider a single ballooned page in the middle of an otherwise intact
> superframe.  Page migration wants to populate this page and depopulate a
> different page from the balloon area.
> 
> A hypercall that can do an atomic populate and depopulate will allow xen
> to easily recreate the superframe (if the missing frame is free).
> XENMEM_exchange will leave the superframe fragmented.
> 

It's true that host superframe is fragmented, but how is it worse than
before? Balloon page compaction is meant to defragment guest address
space.  I think it's acceptable as long as it doesn't make host frame
fragmentation worse.

> XENMEM_exchange would be an acceptable fallback when this new hypercall
> is not availble.
> 

What I'm trying to do here is to build a cycle of balloon compaction /
page coalescence that can converge on both host and guest defragmenting
their address space.

Adding new hypercall is orthogonal to this approach. It might be
more efficient, but it also means to use this guests are tied to new
hypervisor.

Further more, we can probably consider changing XENMEM_exchange to
achieve the functionality you need under the hood without guest
intervention.

> > ### Maintain multiple queues for pages of different sizes and purposes
> > 
> > We maintain multiple queues for pages of different sizes inside Xen
> > balloon driver, so that Xen balloon worker thread can coalesce smaller
> > size pages into one larger size page. Queues for special purposed
> > pages, such as balloon pages used to map foreign pages, are also
> > maintained. These special purposed pages are not subject to migration
> > and page coalescence.
> > 
> > For instance, balloon driver can maintain three queues:
> > 
> > 1. queue for 2 MB pages
> > 1. queue for 4 KB pages (delegated to core balloon driver)
> > 1. queue for pages used to mapped pages from other domain
> > 
> > More queues can be added when necessary, but for now one queue for
> > normal pages and one queue for huge page should be enough.
> 
> Can you explain why is this specific to Xen and why other hypervisors
> wouldn't want to make use of all this huge page infrastructure?
> 

Linux as hypervisor can use huge page infrastructure and page migration.

I think you're taking about balloon page compaction in guest?  As a
guest, it uses balloon compaction. However, the host is capable of doing
page migration all by itself, and if configured, uses THP to back guest
address space, so the balloon driver in guest has less burden, which
means it doesn't have to actively ask hypervisor to back its address
space with huge pages. Xen is less capable in this area.

> > ### Worker thread to coalesce small size pages
> > 
> > Worker thread wakes up periodically to check if there's enough pages
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> 
> I don't think you need a new worker thread for this,  the existing page
> migration is already trying to keep the ballooned zone contiguous so
> after migrating pages you need only try and move contiguous ballooned 4k
> pages to the 2M list.
> 

That's an idea. I will take this into consideration.

> > ## Flowcharts
> > 
> > These flowcharts assume normal page size is 4K and huge page size is
> > 2M.  They show how two queues are maintained.
> 
> Having to break 2M pages into 4k ones to meet a target suggests that the
> toolstack should allocate a domain with 2M multiples and should set the
> target in 2M multiples only.  The autoballoon driver will also need to
> set the target in 2M multiples.
> 

This only happens if the amount is multiples of 2M. Otherwise it just
works as before -- use 4K pages.

Wei.

> David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 10:09 ` David Vrabel
  2014-10-23 10:52   ` Stefano Stabellini
  2014-10-23 11:04   ` Wei Liu
@ 2014-10-27 11:29   ` Wei Liu
  2 siblings, 0 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-27 11:29 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	xen-devel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 11:09:19AM +0100, David Vrabel wrote:
[...]
> > ### Worker thread to coalesce small size pages
> > 
> > Worker thread wakes up periodically to check if there's enough pages
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> 
> I don't think you need a new worker thread for this,  the existing page
> migration is already trying to keep the ballooned zone contiguous so
> after migrating pages you need only try and move contiguous ballooned 4k
> pages to the 2M list.
> 

After some more thought on this, a new worker thread is not needed. It's
possible that current balloon thread does both ballooning work and
coalescing work, as they are mutually exclusive workload so one thread
should be enough.

As for moving contiguous ballooned pages from 4K list to 2M list,
unfortunately I see a problem with this proposal: The 4K pages list is
not sorted. Sorting it requires hooking into core balloon driver -- that
is, to grab multiple locks to avoid racing with page migration thread,
which is prone to error.

Wei.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
  2014-10-22 17:32 ` Andrew Cooper
  2014-10-23 10:09 ` David Vrabel
@ 2014-10-23 11:59 ` Ian Campbell
  2014-10-23 12:17   ` Wei Liu
  2014-10-23 14:30 ` Roger Pau Monné
  2014-10-24 13:54 ` Dario Faggioli
  4 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-10-23 11:59 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, xen-devel, David Vrabel,
	Boris Ostrovsky

On Wed, 2014-10-22 at 17:29 +0100, Wei Liu wrote:

> For instance, balloon driver can maintain three queues:
> 
> 1. queue for 2 MB pages
> 1. queue for 4 KB pages (delegated to core balloon driver)
> 1. queue for pages used to mapped pages from other domain

I think I'd describe this last one as "pages used to provide empty
address ranges to drivers" or something like that. Yes, they will
probably be used for mappings, but I don't think that is the only use of
the alloc_xenballooned_pages interface.

On that subject, how do you handle alloc_xenballooned_pages calls of
non-2M alignment? Would it be best to do a 2M balloon and queue the rest
for use on future similar allocations?

If so then I'm wondering if it might make sense to keep the spare 4K
pages from doing this on a separate queue to the normal 4K queue, in
order to keep these sorts pages isolated into 2M regions -- because I
expect that they cannot be compacted without cooperation with the driver
which allocated them (which I expect won't even be possible in many
cases).

> These flowcharts assume normal page size is 4K and huge page size is
> 2M.  They show how two queues are maintained.
> 
> ![Increase Reservation](increase-reservation.png)

There's a few implicit "requeue on failure" arcs missing on some of
these, I think adding them would make the picture hard to follow, but
perhaps a footnote?

On both here and decrease there are attempts to allocate (from either
the queue or the kernel, depending) 2M which could fail. I wonder if it
is worth inserting "kick THP and give it a chance"? It's a question of
tradeoffs in the latency of a ballooning operation vs the efficiency
with which we can use 2M allocations.

> ![Decrease Reservation](decrease-reservation.png)
> 
> ![Exchange Pages](exchange-pages.png)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 11:59 ` Ian Campbell
@ 2014-10-23 12:17   ` Wei Liu
  2014-10-23 12:27     ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Wei Liu @ 2014-10-23 12:17 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 12:59:17PM +0100, Ian Campbell wrote:
> On Wed, 2014-10-22 at 17:29 +0100, Wei Liu wrote:
> 
> > For instance, balloon driver can maintain three queues:
> > 
> > 1. queue for 2 MB pages
> > 1. queue for 4 KB pages (delegated to core balloon driver)
> > 1. queue for pages used to mapped pages from other domain
> 
> I think I'd describe this last one as "pages used to provide empty
> address ranges to drivers" or something like that. Yes, they will
> probably be used for mappings, but I don't think that is the only use of
> the alloc_xenballooned_pages interface.
> 

OK. That's more sensible.

> On that subject, how do you handle alloc_xenballooned_pages calls of
> non-2M alignment? Would it be best to do a 2M balloon and queue the rest
> for use on future similar allocations?
> 
> If so then I'm wondering if it might make sense to keep the spare 4K
> pages from doing this on a separate queue to the normal 4K queue, in
> order to keep these sorts pages isolated into 2M regions -- because I
> expect that they cannot be compacted without cooperation with the driver
> which allocated them (which I expect won't even be possible in many
> cases).
> 

Yes, it requires cooperation from the driver, and I don't think it's a
good idea because that would mean drivers need to do weird things which
hinder performance and increase complexity. 

I intend to not touch them, just leave them in separate queue.

> > These flowcharts assume normal page size is 4K and huge page size is
> > 2M.  They show how two queues are maintained.
> > 
> > ![Increase Reservation](increase-reservation.png)
> 
> There's a few implicit "requeue on failure" arcs missing on some of
> these, I think adding them would make the picture hard to follow, but
> perhaps a footnote?
>  

Correct. I omitted "requeue on failure". I will add a footnote.

> On both here and decrease there are attempts to allocate (from either
> the queue or the kernel, depending) 2M which could fail. I wonder if it
> is worth inserting "kick THP and give it a chance"? It's a question of
> tradeoffs in the latency of a ballooning operation vs the efficiency
> with which we can use 2M allocations.
> 

No, THP is not involved. I think you mean balloon page compaction.

I'm not sure if it's a good idea because we're now introducing latency
that we can't control. At the very least, if guest admin really cares he
/ she should be able to kick off compaction voluntarily.

Wei.

> > ![Decrease Reservation](decrease-reservation.png)
> > 
> > ![Exchange Pages](exchange-pages.png)
> > 
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 12:17   ` Wei Liu
@ 2014-10-23 12:27     ` Ian Campbell
  2014-10-23 13:00       ` Wei Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-10-23 12:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, xen-devel, David Vrabel,
	Boris Ostrovsky

On Thu, 2014-10-23 at 13:17 +0100, Wei Liu wrote:
> On Thu, Oct 23, 2014 at 12:59:17PM +0100, Ian Campbell wrote:
> > On that subject, how do you handle alloc_xenballooned_pages calls of
> > non-2M alignment? Would it be best to do a 2M balloon and queue the rest
> > for use on future similar allocations?
> > 
> > If so then I'm wondering if it might make sense to keep the spare 4K
> > pages from doing this on a separate queue to the normal 4K queue, in
> > order to keep these sorts pages isolated into 2M regions -- because I
> > expect that they cannot be compacted without cooperation with the driver
> > which allocated them (which I expect won't even be possible in many
> > cases).
> > 
> 
> Yes, it requires cooperation from the driver, and I don't think it's a
> good idea because that would mean drivers need to do weird things which
> hinder performance and increase complexity. 

I have a feeling it may even be impossible in some cases.

> I intend to not touch them, just leave them in separate queue.

i.e. a separate one from the "unusued ballooned 4k"?

> > On both here and decrease there are attempts to allocate (from either
> > the queue or the kernel, depending) 2M which could fail. I wonder if it
> > is worth inserting "kick THP and give it a chance"? It's a question of
> > tradeoffs in the latency of a ballooning operation vs the efficiency
> > with which we can use 2M allocations.
> > 
> 
> No, THP is not involved. I think you mean balloon page compaction.

I meant whichever global kernel facility exists to try and increase the
supply of 2M pages for allocations, which certainly includes balloon
page compaction, but also includes all the other types of compaction
which can be done, no? Maybe THP isn't the right name, or perhaps I
think the kernel has more functionality than it does in reality?

> I'm not sure if it's a good idea because we're now introducing latency
> that we can't control.

By "give it a chance" I meant "wait some period which we control", if it
can't succeed in that timescale then give up. So the latency would be
bound by us (or our supplied control knob, or whatever) not by the "THP"
side of things.

>  At the very least, if guest admin really cares he
> / she should be able to kick off compaction voluntarily.

If the kernel can compact e.g. normal user pages then it would be better
to do that than to balloon and compact those later, I think? (which is
contingent on my not having overestimated what the kernel is capable
of...)

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 12:27     ` Ian Campbell
@ 2014-10-23 13:00       ` Wei Liu
  2014-10-23 14:29         ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Wei Liu @ 2014-10-23 13:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 01:27:45PM +0100, Ian Campbell wrote:
> On Thu, 2014-10-23 at 13:17 +0100, Wei Liu wrote:
> > On Thu, Oct 23, 2014 at 12:59:17PM +0100, Ian Campbell wrote:
> > > On that subject, how do you handle alloc_xenballooned_pages calls of
> > > non-2M alignment? Would it be best to do a 2M balloon and queue the rest
> > > for use on future similar allocations?
> > > 
> > > If so then I'm wondering if it might make sense to keep the spare 4K
> > > pages from doing this on a separate queue to the normal 4K queue, in
> > > order to keep these sorts pages isolated into 2M regions -- because I
> > > expect that they cannot be compacted without cooperation with the driver
> > > which allocated them (which I expect won't even be possible in many
> > > cases).
> > > 
> > 
> > Yes, it requires cooperation from the driver, and I don't think it's a
> > good idea because that would mean drivers need to do weird things which
> > hinder performance and increase complexity. 
> 
> I have a feeling it may even be impossible in some cases.
> 
> > I intend to not touch them, just leave them in separate queue.
> 
> i.e. a separate one from the "unusued ballooned 4k"?
> 

Yes. A separate one -- if the "unused ballooned 4k" queue refers to the
queue that holds balloon pages which are subject to balloon page
compaction.

> > > On both here and decrease there are attempts to allocate (from either
> > > the queue or the kernel, depending) 2M which could fail. I wonder if it
> > > is worth inserting "kick THP and give it a chance"? It's a question of
> > > tradeoffs in the latency of a ballooning operation vs the efficiency
> > > with which we can use 2M allocations.
> > > 
> > 
> > No, THP is not involved. I think you mean balloon page compaction.
> 
> I meant whichever global kernel facility exists to try and increase the
> supply of 2M pages for allocations, which certainly includes balloon
> page compaction, but also includes all the other types of compaction
> which can be done, no? Maybe THP isn't the right name, or perhaps I
> think the kernel has more functionality than it does in reality?
> 
> > I'm not sure if it's a good idea because we're now introducing latency
> > that we can't control.
> 
> By "give it a chance" I meant "wait some period which we control", if it
> can't succeed in that timescale then give up. So the latency would be
> bound by us (or our supplied control knob, or whatever) not by the "THP"
> side of things.
> 
> >  At the very least, if guest admin really cares he
> > / she should be able to kick off compaction voluntarily.
> 
> If the kernel can compact e.g. normal user pages then it would be better
> to do that than to balloon and compact those later, I think? (which is
> contingent on my not having overestimated what the kernel is capable
> of...)
> 

OK, I think there's some misunderstanding here. When kernel tries to
allocate high order page, it already kicks of compaction (including
normal page and balloon page compaction) when fast path fails.

So I was thinking about something like calling compact_zone or rolling
our own implementation when I saw your reply. That's a very time
consuming operation and time varies depending on kernel parameters and
the status of memory fragmentation.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 13:00       ` Wei Liu
@ 2014-10-23 14:29         ` Ian Campbell
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Campbell @ 2014-10-23 14:29 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, Andrew Cooper, xen-devel, David Vrabel,
	Boris Ostrovsky

On Thu, 2014-10-23 at 14:00 +0100, Wei Liu wrote:
> On Thu, Oct 23, 2014 at 01:27:45PM +0100, Ian Campbell wrote:
> > On Thu, 2014-10-23 at 13:17 +0100, Wei Liu wrote:
> > > On Thu, Oct 23, 2014 at 12:59:17PM +0100, Ian Campbell wrote:
> > > > On that subject, how do you handle alloc_xenballooned_pages calls of
> > > > non-2M alignment? Would it be best to do a 2M balloon and queue the rest
> > > > for use on future similar allocations?
> > > > 
> > > > If so then I'm wondering if it might make sense to keep the spare 4K
> > > > pages from doing this on a separate queue to the normal 4K queue, in
> > > > order to keep these sorts pages isolated into 2M regions -- because I
> > > > expect that they cannot be compacted without cooperation with the driver
> > > > which allocated them (which I expect won't even be possible in many
> > > > cases).
> > > > 
> > > 
> > > Yes, it requires cooperation from the driver, and I don't think it's a
> > > good idea because that would mean drivers need to do weird things which
> > > hinder performance and increase complexity. 
> > 
> > I have a feeling it may even be impossible in some cases.
> > 
> > > I intend to not touch them, just leave them in separate queue.
> > 
> > i.e. a separate one from the "unusued ballooned 4k"?
> > 
> 
> Yes. A separate one -- if the "unused ballooned 4k" queue refers to the
> queue that holds balloon pages which are subject to balloon page
> compaction.

Correct.

> OK, I think there's some misunderstanding here. When kernel tries to
> allocate high order page, it already kicks of compaction (including
> normal page and balloon page compaction) when fast path fails.

Aha, that was what I missed, thanks!

> So I was thinking about something like calling compact_zone or rolling
> our own implementation when I saw your reply. That's a very time
> consuming operation and time varies depending on kernel parameters and
> the status of memory fragmentation.

Right, that doesn't sound desirable.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
                   ` (2 preceding siblings ...)
  2014-10-23 11:59 ` Ian Campbell
@ 2014-10-23 14:30 ` Roger Pau Monné
  2014-10-23 15:23   ` Wei Liu
  2014-10-24 13:54 ` Dario Faggioli
  4 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2014-10-23 14:30 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Andrew Cooper, Boris Ostrovsky, David Vrabel, Ian Campbell,
	Stefano Stabellini

El 22/10/14 a les 18.29, Wei Liu ha escrit:
> Hi all
> 
> This is my initial design to improve Xen balloon driver.
> 
> PDF version with graphs can be found at
> 
> http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> 
> % Xen Balloon Driver Improvement
> % Wei Liu <<wei.liu2@citrix.com>>
> 
> -------------------------------------------
> Version     Date         Changes
> -------     ----         ------------------
>   1         22/10/2014   Initial version.
> -------------------------------------------
> 
> ## Motives
> 
> 1. Balloon pages fragments guest physical address space.
> 1. Balloon compaction infrastructure can migrate ballooned pages from
>    start of zone to end of zone, hence creating contiguous guest physical
>    address space.
> 1. Having contiguous guest physical address enables some options to
>    improve performance.
> 
> ## Goal of improvement
> 
> Balloon driver makes use of as many huge pages as possible,
> defragmenting both guest address space and Xen pages. This should be
> achieved without any particular hypervisor side feature.
> 
> ## Design and implementation
> 
> When balloon driver is asked to increase / decrease reservation, it
> will always start with huge page. However, due to resource
> availability in both hypervisor and guest, it's not always possible to
> get hold of a huge page. In that case the driver will fall back to use
> normal size page. Balloon driver later will try to coalesce small size
> pages into huge page. As time goes by, both Xen and guest should use
> more and more huge pages.

All this looks quite complicated IMHO, it's adding a lot of logic to the
balloon driver. Can't you just ask the memory subsystem to allocate a
page (or pages) from a specific physical range, and force it to page
out/move what's there at allocation time?

For example I know FreeBSD has contigmalloc(9)[1] which I think could be
used to achieve this. You could start asking for pages starting at
maxpfn and go down from there, keeping fragmentation at a minimum.

[1]
https://www.freebsd.org/cgi/man.cgi?query=contigmalloc&apropos=0&sektion=0&manpath=FreeBSD+10.1-RELEASE&arch=default&format=html

Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 14:30 ` Roger Pau Monné
@ 2014-10-23 15:23   ` Wei Liu
  2014-10-23 15:57     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Wei Liu @ 2014-10-23 15:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	xen-devel, David Vrabel, Boris Ostrovsky

On Thu, Oct 23, 2014 at 04:30:24PM +0200, Roger Pau Monné wrote:
> El 22/10/14 a les 18.29, Wei Liu ha escrit:
> > Hi all
> > 
> > This is my initial design to improve Xen balloon driver.
> > 
> > PDF version with graphs can be found at
> > 
> > http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> > 
> > % Xen Balloon Driver Improvement
> > % Wei Liu <<wei.liu2@citrix.com>>
> > 
> > -------------------------------------------
> > Version     Date         Changes
> > -------     ----         ------------------
> >   1         22/10/2014   Initial version.
> > -------------------------------------------
> > 
> > ## Motives
> > 
> > 1. Balloon pages fragments guest physical address space.
> > 1. Balloon compaction infrastructure can migrate ballooned pages from
> >    start of zone to end of zone, hence creating contiguous guest physical
> >    address space.
> > 1. Having contiguous guest physical address enables some options to
> >    improve performance.
> > 
> > ## Goal of improvement
> > 
> > Balloon driver makes use of as many huge pages as possible,
> > defragmenting both guest address space and Xen pages. This should be
> > achieved without any particular hypervisor side feature.
> > 
> > ## Design and implementation
> > 
> > When balloon driver is asked to increase / decrease reservation, it
> > will always start with huge page. However, due to resource
> > availability in both hypervisor and guest, it's not always possible to
> > get hold of a huge page. In that case the driver will fall back to use
> > normal size page. Balloon driver later will try to coalesce small size
> > pages into huge page. As time goes by, both Xen and guest should use
> > more and more huge pages.
> 
> All this looks quite complicated IMHO, it's adding a lot of logic to the
> balloon driver. Can't you just ask the memory subsystem to allocate a
> page (or pages) from a specific physical range, and force it to page
> out/move what's there at allocation time?
> 
> For example I know FreeBSD has contigmalloc(9)[1] which I think could be
> used to achieve this. You could start asking for pages starting at
> maxpfn and go down from there, keeping fragmentation at a minimum.
> 
> [1]
> https://www.freebsd.org/cgi/man.cgi?query=contigmalloc&apropos=0&sektion=0&manpath=FreeBSD+10.1-RELEASE&arch=default&format=html
> 

Good point. Just that Linux doesn't have a counterpart, not that I know
of. Memblock infrastructure looks similar but it's supposed to be used
when initialising kernel.

Even if Linux has similar API, it's still less desirable because to
satisfy a contiguous PA allocation, the system needs to be relative
quiet (if NO_WAIT / ATOMIC is set), or the API needs to sleep for
indefinite period (wait for memory subsystem to squeeze out pages).

Wei.

> Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 15:23   ` Wei Liu
@ 2014-10-23 15:57     ` Roger Pau Monné
  2014-10-23 16:04       ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2014-10-23 15:57 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky

El 23/10/14 a les 17.23, Wei Liu ha escrit:
> On Thu, Oct 23, 2014 at 04:30:24PM +0200, Roger Pau Monné wrote:
>> El 22/10/14 a les 18.29, Wei Liu ha escrit:
>>> Hi all
>>>
>>> This is my initial design to improve Xen balloon driver.
>>>
>>> PDF version with graphs can be found at
>>>
>>> http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
>>>
>>> % Xen Balloon Driver Improvement
>>> % Wei Liu <<wei.liu2@citrix.com>>
>>>
>>> -------------------------------------------
>>> Version     Date         Changes
>>> -------     ----         ------------------
>>>   1         22/10/2014   Initial version.
>>> -------------------------------------------
>>>
>>> ## Motives
>>>
>>> 1. Balloon pages fragments guest physical address space.
>>> 1. Balloon compaction infrastructure can migrate ballooned pages from
>>>    start of zone to end of zone, hence creating contiguous guest physical
>>>    address space.
>>> 1. Having contiguous guest physical address enables some options to
>>>    improve performance.
>>>
>>> ## Goal of improvement
>>>
>>> Balloon driver makes use of as many huge pages as possible,
>>> defragmenting both guest address space and Xen pages. This should be
>>> achieved without any particular hypervisor side feature.
>>>
>>> ## Design and implementation
>>>
>>> When balloon driver is asked to increase / decrease reservation, it
>>> will always start with huge page. However, due to resource
>>> availability in both hypervisor and guest, it's not always possible to
>>> get hold of a huge page. In that case the driver will fall back to use
>>> normal size page. Balloon driver later will try to coalesce small size
>>> pages into huge page. As time goes by, both Xen and guest should use
>>> more and more huge pages.
>>
>> All this looks quite complicated IMHO, it's adding a lot of logic to the
>> balloon driver. Can't you just ask the memory subsystem to allocate a
>> page (or pages) from a specific physical range, and force it to page
>> out/move what's there at allocation time?
>>
>> For example I know FreeBSD has contigmalloc(9)[1] which I think could be
>> used to achieve this. You could start asking for pages starting at
>> maxpfn and go down from there, keeping fragmentation at a minimum.
>>
>> [1]
>> https://www.freebsd.org/cgi/man.cgi?query=contigmalloc&apropos=0&sektion=0&manpath=FreeBSD+10.1-RELEASE&arch=default&format=html
>>
> 
> Good point. Just that Linux doesn't have a counterpart, not that I know
> of. Memblock infrastructure looks similar but it's supposed to be used
> when initialising kernel.
> 
> Even if Linux has similar API, it's still less desirable because to
> satisfy a contiguous PA allocation, the system needs to be relative
> quiet (if NO_WAIT / ATOMIC is set), or the API needs to sleep for
> indefinite period (wait for memory subsystem to squeeze out pages).

There's no restriction on the time it might take for a guest to balloon
out. IMHO I would rather add a new interface to the Linux VM subsystem
that tries to accomplish this rather than adding a bunch of logic
specific to the balloon driver.

In general you should be able to reclaim memory quite fast (by either
moving it to another region or swapping it to disk). In case of finding
a page that's wired I would just leave it as is, since I guess this
would not be quite common, and maybe retry after a certain period.

Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 15:57     ` Roger Pau Monné
@ 2014-10-23 16:04       ` Ian Campbell
  2014-10-23 16:12         ` Wei Liu
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2014-10-23 16:04 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky

On Thu, 2014-10-23 at 17:57 +0200, Roger Pau Monné wrote:
> El 23/10/14 a les 17.23, Wei Liu ha escrit:
> > On Thu, Oct 23, 2014 at 04:30:24PM +0200, Roger Pau Monné wrote:
> >> El 22/10/14 a les 18.29, Wei Liu ha escrit:
> >>> Hi all
> >>>
> >>> This is my initial design to improve Xen balloon driver.
> >>>
> >>> PDF version with graphs can be found at
> >>>
> >>> http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> >>>
> >>> % Xen Balloon Driver Improvement
> >>> % Wei Liu <<wei.liu2@citrix.com>>
> >>>
> >>> -------------------------------------------
> >>> Version     Date         Changes
> >>> -------     ----         ------------------
> >>>   1         22/10/2014   Initial version.
> >>> -------------------------------------------
> >>>
> >>> ## Motives
> >>>
> >>> 1. Balloon pages fragments guest physical address space.
> >>> 1. Balloon compaction infrastructure can migrate ballooned pages from
> >>>    start of zone to end of zone, hence creating contiguous guest physical
> >>>    address space.
> >>> 1. Having contiguous guest physical address enables some options to
> >>>    improve performance.
> >>>
> >>> ## Goal of improvement
> >>>
> >>> Balloon driver makes use of as many huge pages as possible,
> >>> defragmenting both guest address space and Xen pages. This should be
> >>> achieved without any particular hypervisor side feature.
> >>>
> >>> ## Design and implementation
> >>>
> >>> When balloon driver is asked to increase / decrease reservation, it
> >>> will always start with huge page. However, due to resource
> >>> availability in both hypervisor and guest, it's not always possible to
> >>> get hold of a huge page. In that case the driver will fall back to use
> >>> normal size page. Balloon driver later will try to coalesce small size
> >>> pages into huge page. As time goes by, both Xen and guest should use
> >>> more and more huge pages.
> >>
> >> All this looks quite complicated IMHO, it's adding a lot of logic to the
> >> balloon driver. Can't you just ask the memory subsystem to allocate a
> >> page (or pages) from a specific physical range, and force it to page
> >> out/move what's there at allocation time?
> >>
> >> For example I know FreeBSD has contigmalloc(9)[1] which I think could be
> >> used to achieve this. You could start asking for pages starting at
> >> maxpfn and go down from there, keeping fragmentation at a minimum.
> >>
> >> [1]
> >> https://www.freebsd.org/cgi/man.cgi?query=contigmalloc&apropos=0&sektion=0&manpath=FreeBSD+10.1-RELEASE&arch=default&format=html
> >>
> > 
> > Good point. Just that Linux doesn't have a counterpart, not that I know
> > of. Memblock infrastructure looks similar but it's supposed to be used
> > when initialising kernel.
> > 
> > Even if Linux has similar API, it's still less desirable because to
> > satisfy a contiguous PA allocation, the system needs to be relative
> > quiet (if NO_WAIT / ATOMIC is set), or the API needs to sleep for
> > indefinite period (wait for memory subsystem to squeeze out pages).
> 
> There's no restriction on the time it might take for a guest to balloon
> out. IMHO I would rather add a new interface to the Linux VM subsystem
> that tries to accomplish this rather than adding a bunch of logic
> specific to the balloon driver.

AIUI this already exists and Wei is simply hooking up the necessary
callbacks into the balloon driver.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-23 16:04       ` Ian Campbell
@ 2014-10-23 16:12         ` Wei Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-23 16:12 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky, Roger Pau Monné

On Thu, Oct 23, 2014 at 05:04:50PM +0100, Ian Campbell wrote:
[...]
> > >> For example I know FreeBSD has contigmalloc(9)[1] which I think could be
> > >> used to achieve this. You could start asking for pages starting at
> > >> maxpfn and go down from there, keeping fragmentation at a minimum.
> > >>
> > >> [1]
> > >> https://www.freebsd.org/cgi/man.cgi?query=contigmalloc&apropos=0&sektion=0&manpath=FreeBSD+10.1-RELEASE&arch=default&format=html
> > >>
> > > 
> > > Good point. Just that Linux doesn't have a counterpart, not that I know
> > > of. Memblock infrastructure looks similar but it's supposed to be used
> > > when initialising kernel.
> > > 
> > > Even if Linux has similar API, it's still less desirable because to
> > > satisfy a contiguous PA allocation, the system needs to be relative
> > > quiet (if NO_WAIT / ATOMIC is set), or the API needs to sleep for
> > > indefinite period (wait for memory subsystem to squeeze out pages).
> > 
> > There's no restriction on the time it might take for a guest to balloon
> > out. IMHO I would rather add a new interface to the Linux VM subsystem
> > that tries to accomplish this rather than adding a bunch of logic
> > specific to the balloon driver.
> 
> AIUI this already exists and Wei is simply hooking up the necessary
> callbacks into the balloon driver.
> 

That's right.

> Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
                   ` (3 preceding siblings ...)
  2014-10-23 14:30 ` Roger Pau Monné
@ 2014-10-24 13:54 ` Dario Faggioli
  2014-10-24 14:04   ` Wei Liu
  4 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2014-10-24 13:54 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, xen-devel,
	David Vrabel, Boris Ostrovsky


[-- Attachment #1.1: Type: text/plain, Size: 1182 bytes --]

On Wed, 2014-10-22 at 17:29 +0100, Wei Liu wrote:
> Hi all
> 
> This is my initial design to improve Xen balloon driver.
> 
> PDF version with graphs can be found at
> 
> http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> 
Just FYI, there's a "NUMA angle" of the whole balloon driver improvement
process, the bulk of which is how to make the balloon driver (v-)NUMA
aware.

Some (quite interesting, actually) discussion on that happened already,
and can be found on this message, and in the subthread it generated:

  http://lists.xen.org/archives/html/xen-devel/2013-08/msg01691.html

It's probably something almost completely orthogonal to what's discussed
here, and, more important, vNUMA topology support in the guest is a
pre-requisite for that but, especially considering you've been involved
in vNUMA too, I thought I would at least mention it. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Xen balloon driver improvement (version 1)
  2014-10-24 13:54 ` Dario Faggioli
@ 2014-10-24 14:04   ` Wei Liu
  0 siblings, 0 replies; 24+ messages in thread
From: Wei Liu @ 2014-10-24 14:04 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	xen-devel, David Vrabel, Boris Ostrovsky

On Fri, Oct 24, 2014 at 03:54:22PM +0200, Dario Faggioli wrote:
> On Wed, 2014-10-22 at 17:29 +0100, Wei Liu wrote:
> > Hi all
> > 
> > This is my initial design to improve Xen balloon driver.
> > 
> > PDF version with graphs can be found at
> > 
> > http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf
> > 
> Just FYI, there's a "NUMA angle" of the whole balloon driver improvement
> process, the bulk of which is how to make the balloon driver (v-)NUMA
> aware.
> 
> Some (quite interesting, actually) discussion on that happened already,
> and can be found on this message, and in the subthread it generated:
> 
>   http://lists.xen.org/archives/html/xen-devel/2013-08/msg01691.html
> 
> It's probably something almost completely orthogonal to what's discussed
> here, and, more important, vNUMA topology support in the guest is a
> pre-requisite for that but, especially considering you've been involved
> in vNUMA too, I thought I would at least mention it. :-)
> 

Thanks for the pointer.

This is indeed orthogonal to vNUMA. The only connection that I can think
of is the exchange operation should consider the node mapping as well.
This should be easy once necessary information is in place.

I will mention vNUMA in later version.

Wei.

> Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-10-27 11:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-22 16:29 Xen balloon driver improvement (version 1) Wei Liu
2014-10-22 17:32 ` Andrew Cooper
2014-10-22 18:29   ` Wei Liu
2014-10-23 11:00     ` Ian Campbell
2014-10-23 11:05       ` Wei Liu
2014-10-23 11:42       ` Andrew Cooper
2014-10-23 11:44         ` David Vrabel
2014-10-23 10:09 ` David Vrabel
2014-10-23 10:52   ` Stefano Stabellini
2014-10-23 10:58     ` David Vrabel
2014-10-23 11:04   ` Wei Liu
2014-10-27 11:29   ` Wei Liu
2014-10-23 11:59 ` Ian Campbell
2014-10-23 12:17   ` Wei Liu
2014-10-23 12:27     ` Ian Campbell
2014-10-23 13:00       ` Wei Liu
2014-10-23 14:29         ` Ian Campbell
2014-10-23 14:30 ` Roger Pau Monné
2014-10-23 15:23   ` Wei Liu
2014-10-23 15:57     ` Roger Pau Monné
2014-10-23 16:04       ` Ian Campbell
2014-10-23 16:12         ` Wei Liu
2014-10-24 13:54 ` Dario Faggioli
2014-10-24 14:04   ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.