Linux Xen Balloon Driver Improvement (Draft 2)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Linux Xen Balloon Driver Improvement (Draft 2)
@ 2014-10-27 12:33 Wei Liu
  2014-10-27 14:23 ` David Vrabel
  2014-12-15 10:52 ` David Vrabel
  0 siblings, 2 replies; 11+ messages in thread
From: Wei Liu @ 2014-10-27 12:33 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, David Vrabel, Boris Ostrovsky

Hi all

This is the draft 2 of the design.

PDF version can be found at

http://xenbits.xen.org/people/liuw/xen-balloon-driver-improvement.pdf

Changes in this version:

1. Style, grammar and typo fixes.
2. Make this document Linux centric.
3. Add a new section for NUMA-aware ballooning.

% Linux Xen Balloon Driver Improvement
% Wei Liu <<wei.liu2@citrix.com>>
% Draft 2

----------------------------------------------------------
Version     Date         Changes
-------     ---------    ---------------------------------
2           24/10/2014   Style fixes, more clarifications.

1           22/10/2014   Initial version.
----------------------------------------------------------

## Introduction

This document describe a design to improve Xen balloon driver in Linux.

## Motives

1. Balloon pages fragments guest physical address space.
2. Balloon compaction infrastructure can migrate ballooned pages from
   start of Linux memory zone to end of zone, hence creating
   contiguous guest physical address space.
3. Having contiguous guest physical address enables some options to
   improve performance.

## Goal of improvement

The balloon driver makes use of as many huge pages as possible,
defragmenting guest address space. Contiguous guest address space
permits huge page ballooning which helps prevent host address space
fragmentation.

This should be achieved without any particular hypervisor side
feature.

## Design and implementation

When the balloon driver is asked to increase / decrease reservation,
it will always start with a huge page. However, due to resource
availability in both hypervisor and guest, it's not always possible to
get hold of a huge page. In that case the driver will fall back to use
normal size page. Balloon driver later will try to coalesce small size
pages into huge page. As time goes by, both Xen and guest should use
more and more huge pages.

To achieve the said goal, several changes will be made:

1. Make use of balloon page compaction.
2. Maintain multiple queues for pages of different sizes and purposes.
3. Periodically exchange normal size pages with huge pages.

### Make use of balloon page compaction

Balloon page migration moves balloon pages from start of zone to end
of zone, making guest physical address space contiguous. This gives
balloon driver a chance to allocate huge pages in order to coalesce
small pages.

Currently, Xen balloon driver gets its page directly from page
allocator. To enable balloon page migration, those pages now need to
be allocated from core balloon driver. Pages allocated from core
balloon driver are subject to balloon page compaction.

The use of Linux balloon page compaction doesn't require introducing
new interfaces between Xen balloon driver and the rest of the
system. Most changes are internal to Xen balloon driver.

Xen balloon driver will also need to provide a callback to migrate
balloon page. In essence callback function receives "old page", which
is a already ballooned out page, and "new page", which is a page to be
ballooned out, then it inflates "old page" and deflates "new page".

The core of migration callback is XENMEM\_exchange hypercall. This
makes sure that inflation of old page and deflation of new page is
done atomically, so even if a domain is beyond its memory target and
the target is being enforced, it can still compact memory.

### Maintain multiple queues for pages of different sizes and purposes

We maintain multiple queues for pages of different sizes inside Xen
balloon driver, so that Xen balloon worker thread can coalesce smaller
size pages into one larger size page. Queues for special purposed
pages, such as balloon pages used to map foreign pages, are also
maintained. These special purposed pages are not subject to migration
and page coalescence.

For instance, balloon driver can maintain three queues:

1. queue for 2 MB pages
2. queue for 4 KB pages (delegated to core balloon driver)
3. queue for pages used to mapped pages from other domain

More queues can be added when necessary, but for now one queue for
normal pages and one queue for huge page should be enough.

### Periodically exchange normal size pages with huge pages

Worker thread wakes up periodically to check if there are enough pages
in normal size page queue to coalesce into a huge page. If so, it will
try to exchange that huge page into a number of normal size pages with
XENMEM\_exchange hypercall.

## Relationship with NUMA-aware ballooning

Another orthogonal improvement to Linux balloon driver is NUMA-aware
ballooning.

The use of balloon page compaction will not interfere with NUMA-ware
ballooning because balloon compaction, which is part of Linux's memory
subsystem, is already NUMA-aware.

All the changes proposed in this design can be made NUMA-aware
provided virtual NUMA topology information is in place.

## Flowcharts

These flowcharts assume normal page size is 4K and huge page size is
2M.  They show how two queues are maintained. Please note that
"requeue on failure" is not drawn on the flowcharts to make the
flowcharts easier to reason.

![Increase Reservation](increase-reservation.png)

![Decrease Reservation](decrease-reservation.png)

![Exchange Pages](exchange-pages.png)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 12:33 Linux Xen Balloon Driver Improvement (Draft 2) Wei Liu
@ 2014-10-27 14:23 ` David Vrabel
  2014-10-27 16:29   ` Wei Liu
  2014-12-15 10:52 ` David Vrabel
  1 sibling, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-10-27 14:23 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Dario Faggioli,
	David Vrabel, Boris Ostrovsky

On 27/10/14 12:33, Wei Liu wrote:
> 
> Changes in this version:
> 
> 1. Style, grammar and typo fixes.
> 2. Make this document Linux centric.
> 3. Add a new section for NUMA-aware ballooning.

You've not included the required changes to the toolstack and
autoballoon driver to always use 2M multiples when creating VMs and
setting targets.

> ## Introduction
> 
> This document describe a design to improve Xen balloon driver in Linux.

"Linux balloon driver for Xen guests"?

> ## Goal of improvement
> 
> The balloon driver makes use of as many huge pages as possible,
> defragmenting guest address space. Contiguous guest address space
> permits huge page ballooning which helps prevent host address space
> fragmentation.
> 
> This should be achieved without any particular hypervisor side
> feature.

I really think you need to be taking whole-system view and not focusing
on just the guest balloon driver.

> ### Make use of balloon page compaction
> 
> The core of migration callback is XENMEM\_exchange hypercall. This
> makes sure that inflation of old page and deflation of new page is
> done atomically, so even if a domain is beyond its memory target and
> the target is being enforced, it can still compact memory.

Having looked at what XENMEM_exchange actually does, I can't see how
you're using it to give this behaviour.

IMO, XEMMEM_exchange should probably be renamed XENMEM_repopulate or
something.

> ### Periodically exchange normal size pages with huge pages
> 
> Worker thread wakes up periodically to check if there are enough pages
> in normal size page queue to coalesce into a huge page. If so, it will
> try to exchange that huge page into a number of normal size pages with
> XENMEM\_exchange hypercall.

I don't see what this is supposed to achieve.  This is going to take a
(potentially) non-fragmented superpage and fragment it.

Your set of 512 4k ballooned pages needs to be ordered, contiguous and
superpage aligned, for this to be any use.

> ## Relationship with NUMA-aware ballooning
> 
> Another orthogonal improvement to Linux balloon driver is NUMA-aware
> ballooning.
> 
> The use of balloon page compaction will not interfere with NUMA-ware
> ballooning because balloon compaction, which is part of Linux's memory
> subsystem, is already NUMA-aware.
> 
> All the changes proposed in this design can be made NUMA-aware
> provided virtual NUMA topology information is in place.

How?

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 14:23 ` David Vrabel
@ 2014-10-27 16:29   ` Wei Liu
  2014-10-27 17:29     ` David Vrabel
  0 siblings, 1 reply; 11+ messages in thread
From: Wei Liu @ 2014-10-27 16:29 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, xen-devel, Boris Ostrovsky

On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
> On 27/10/14 12:33, Wei Liu wrote:
> > 
> > Changes in this version:
> > 
> > 1. Style, grammar and typo fixes.
> > 2. Make this document Linux centric.
> > 3. Add a new section for NUMA-aware ballooning.
> 
> You've not included the required changes to the toolstack and
> autoballoon driver to always use 2M multiples when creating VMs and
> setting targets.
> 

When creating VM, toolstack already tries to use as many huge pages as
possible.

Setting target doesn't use 2M multiples.  But I don't think this is
necessary. To balloon in / out X MB memory

  nr_2m = X % 2M
  nr_4k = (X / 2M) / 4k

The remainder just goes to 4K queue.

And what do you mean by "autoballoon" driver? Do you mean functionality
of xl? In the end the request is still fulfilled by Xen balloon driver
in kernel. So if dom0 is using the new balloon driver proposed here, it
should balloon down in 2M multiples automatically.

> > ## Introduction
> > 
> > This document describe a design to improve Xen balloon driver in Linux.
> 
> "Linux balloon driver for Xen guests"?
> 

Sure.

> > ## Goal of improvement
> > 
> > The balloon driver makes use of as many huge pages as possible,
> > defragmenting guest address space. Contiguous guest address space
> > permits huge page ballooning which helps prevent host address space
> > fragmentation.
> > 
> > This should be achieved without any particular hypervisor side
> > feature.
> 
> I really think you need to be taking whole-system view and not focusing
> on just the guest balloon driver.
> 

I don't think there's terribly tight linkage between hypervisor side
change and guest side change. This design doesn't involve new hypervisor
interface and I intend to remain so.

To have guest automatically defragmenting it's address space while at
the same time helps prevent hypervisor memory from fragmenting (at least
this is what the design aims for, as for how it works in practice, it
needs to be prototyped and benchmarked).

The above reasoning is good enough to justify this change, isn't it?

I think Ian Campbell explained better than me in another email. To quote
him verbatim:

<QUOTE>
Compaction on the guest side serves two purposes immediately even
without hypervisor side compaction: Firstly it increases the chances of
being able to allocate a 2M page when required to balloon one out,
either right now or at some point in the future, IOW it helps towards
the goal of doing as much ballooning as possible in 2M chunks.

Secondly it means that we will end up with contiguous 2M holes which
will give the opportunity for future balloon operations to up with 2M
mappings, this is useful in its own right even if it is neutral wrt the
fragmentation of the populated 2M regions right now (and we know it
can't make things worse in that regard).
</QUOTE>

If you have very concrete concern we can talk about it case by case. If
you have any concern about linkage between guest and hypervisor we can
also analyse it further.

> > ### Make use of balloon page compaction
> > 
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > the target is being enforced, it can still compact memory.
> 
> Having looked at what XENMEM_exchange actually does, I can't see how
> you're using it to give this behaviour.
> 

??

Doesn't it guarantee atomicity (a single hypercall)? Isn't it able to
exchange pages even if target is enforced? (Note the MEMF_no_refcount
when calling steal_page / assign_pages).

So which aspect do you think it doesn't work? Can you make this clearer
so that I can answer your question better?

> IMO, XEMMEM_exchange should probably be renamed XENMEM_repopulate or
> something.
> 

I will leave it to hypervisor maintainer.  TBH I don't think
XENMEM_repopulate reflects the nature of this hypercall either.

> > ### Periodically exchange normal size pages with huge pages
> > 
> > Worker thread wakes up periodically to check if there are enough pages
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> 
> I don't see what this is supposed to achieve.  This is going to take a
> (potentially) non-fragmented superpage and fragment it.
> 

Let's look at this from start of day.

Guest always tries to balloon in / out as many 2M pages as possible. So
if we have a long list of 4K pages, it means the underlying host super
frames are fragmented already.

So if 1) there are enough 4K pages in ballooned out list, 2) there is a
spare 2M page, it means that the 2M page comes from the result of
balloon page compaction, which means the underlying host super frame is
fragmented

What this tries to achieve is that we build up a cycle to create chances
to balloon in / out 2M pages. As you're releasing a 2M page backed by
512 4K pages then balloon it back, that 2M page can be backed by a 2M
host frame.

> Your set of 512 4k ballooned pages needs to be ordered, contiguous and
> superpage aligned, for this to be any use.
> 

The idea to promote sorted aligned pages from 4K list to 2M list is of
course achievable and probably easier to reason about. But it won't help
prevent hypervisor side fragmentation though, as it doesn't involve
exchanging memory when doing promotion. However in the end it might still be
able to build up a cycle to help prevent host fragmentation.

I plan to prototype both and choose the one that works better.  In any
case, this is implementation detail.

> > ## Relationship with NUMA-aware ballooning
> > 
> > Another orthogonal improvement to Linux balloon driver is NUMA-aware
> > ballooning.
> > 
> > The use of balloon page compaction will not interfere with NUMA-ware
> > ballooning because balloon compaction, which is part of Linux's memory
> > subsystem, is already NUMA-aware.
> > 
> > All the changes proposed in this design can be made NUMA-aware
> > provided virtual NUMA topology information is in place.
> 
> How?

The exchange hypercall accepts node information. So it's potentially the
same level of work as to make balloon driver NUMA-aware (the increase /
decrease hypercall).

Wei.

> 
> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 16:29   ` Wei Liu
@ 2014-10-27 17:29     ` David Vrabel
  2014-10-27 19:10       ` Wei Liu
  2014-10-27 19:14       ` Wei Liu
  0 siblings, 2 replies; 11+ messages in thread
From: David Vrabel @ 2014-10-27 17:29 UTC (permalink / raw)
  To: Wei Liu, David Vrabel
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Dario Faggioli,
	xen-devel, Boris Ostrovsky

On 27/10/14 16:29, Wei Liu wrote:
> On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
>> On 27/10/14 12:33, Wei Liu wrote:
>>>
>>> Changes in this version:
>>>
>>> 1. Style, grammar and typo fixes.
>>> 2. Make this document Linux centric.
>>> 3. Add a new section for NUMA-aware ballooning.
>>
>> You've not included the required changes to the toolstack and
>> autoballoon driver to always use 2M multiples when creating VMs and
>> setting targets.
>>
> 
> When creating VM, toolstack already tries to use as many huge pages as
> possible.
> 
> Setting target doesn't use 2M multiples.  But I don't think this is
> necessary. To balloon in / out X MB memory
> 
>   nr_2m = X % 2M
>   nr_4k = (X / 2M) / 4k
> 
> The remainder just goes to 4K queue.

I understand that it will work with 4k multiples but it is not /optimal/
to do so since it will result in more fragmentation.

> And what do you mean by "autoballoon" driver? Do you mean functionality
> of xl? In the end the request is still fulfilled by Xen balloon driver
> in kernel. So if dom0 is using the new balloon driver proposed here, it
> should balloon down in 2M multiples automatically.

Both xl and the auto-balloon driver in the kernel should only set the
target in multiples of 2M.

>>> ## Goal of improvement
>>>
>>> The balloon driver makes use of as many huge pages as possible,
>>> defragmenting guest address space. Contiguous guest address space
>>> permits huge page ballooning which helps prevent host address space
>>> fragmentation.
>>>
>>> This should be achieved without any particular hypervisor side
>>> feature.
>>
>> I really think you need to be taking whole-system view and not focusing
>> on just the guest balloon driver.
>>
> 
> I don't think there's terribly tight linkage between hypervisor side
> change and guest side change.

I don't see how you can think this unless you also have a design for the
hypervisor side.

I do not want a situation were effective and efficient host
defragmentation requires balloon driver changes to avoid a regression.

> To have guest automatically defragmenting it's address space while at
> the same time helps prevent hypervisor memory from fragmenting (at least
> this is what the design aims for, as for how it works in practice, it
> needs to be prototyped and benchmarked).
> 
> The above reasoning is good enough to justify this change, isn't it?

Having a whole system design does not mean that it must be all
implemented.  If one part has benefits independently from the rest then
it can be implemented and merged.

>>> ### Make use of balloon page compaction
>>>
>>> The core of migration callback is XENMEM\_exchange hypercall. This
>>> makes sure that inflation of old page and deflation of new page is
>>> done atomically, so even if a domain is beyond its memory target and
>>> the target is being enforced, it can still compact memory.
>>
>> Having looked at what XENMEM_exchange actually does, I can't see how
>> you're using it to give this behaviour.

Never mind.  I misread the docs.

>>> ### Periodically exchange normal size pages with huge pages
>>>
>>> Worker thread wakes up periodically to check if there are enough pages
>>> in normal size page queue to coalesce into a huge page. If so, it will
>>> try to exchange that huge page into a number of normal size pages with
>>> XENMEM\_exchange hypercall.
>>
>> I don't see what this is supposed to achieve.  This is going to take a
>> (potentially) non-fragmented superpage and fragment it.
>>
> 
> Let's look at this from start of day.
> 
> Guest always tries to balloon in / out as many 2M pages as possible. So
> if we have a long list of 4K pages, it means the underlying host super
> frames are fragmented already.
> 
> So if 1) there are enough 4K pages in ballooned out list, 2) there is a
> spare 2M page, it means that the 2M page comes from the result of
> balloon page compaction, which means the underlying host super frame is
> fragmented.

This assumption is only true because your page migration isn't trying
hard enough to defragment super frames, and it is assuming that Xen does
nothing to address host super frame fragmentation.  This highlights the
importance of looking at a system-level for designs, IMO.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 17:29     ` David Vrabel
@ 2014-10-27 19:10       ` Wei Liu
  2014-10-27 19:42         ` Stefano Stabellini
  2014-10-27 19:14       ` Wei Liu
  1 sibling, 1 reply; 11+ messages in thread
From: Wei Liu @ 2014-10-27 19:10 UTC (permalink / raw)
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, xen-devel, Boris Ostrovsky

On Mon, Oct 27, 2014 at 05:29:16PM +0000, David Vrabel wrote:
> On 27/10/14 16:29, Wei Liu wrote:
> > On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
> >> On 27/10/14 12:33, Wei Liu wrote:
> >>>
> >>> Changes in this version:
> >>>
> >>> 1. Style, grammar and typo fixes.
> >>> 2. Make this document Linux centric.
> >>> 3. Add a new section for NUMA-aware ballooning.
> >>
> >> You've not included the required changes to the toolstack and
> >> autoballoon driver to always use 2M multiples when creating VMs and
> >> setting targets.
> >>
> > 
> > When creating VM, toolstack already tries to use as many huge pages as
> > possible.
> > 
> > Setting target doesn't use 2M multiples.  But I don't think this is
> > necessary. To balloon in / out X MB memory
> > 
> >   nr_2m = X % 2M
> >   nr_4k = (X / 2M) / 4k
> > 
> > The remainder just goes to 4K queue.
> 
> I understand that it will work with 4k multiples but it is not /optimal/
> to do so since it will result in more fragmentation.
> 

The fragmentation should be less than 2M right? Is that terrible?

> > And what do you mean by "autoballoon" driver? Do you mean functionality
> > of xl? In the end the request is still fulfilled by Xen balloon driver
> > in kernel. So if dom0 is using the new balloon driver proposed here, it
> > should balloon down in 2M multiples automatically.
> 
> Both xl and the auto-balloon driver in the kernel should only set the
> target in multiples of 2M.
> 

This is easy to achieve. I can always round up to 2M multiples in
balloon driver.  Change to toolstack is simple as well. Just that it's
very well possible newer kernel runs on older toolstack, or even
homebrew toolstacks that don't set target to 2M multiples. So after all
there are always suboptimal situations, be it 1) balloon out a bit more
memory than requested or 2) a little bit fragmentation.

I don't have very strong opinion on this. I will round up to 2M
multiples in balloon driver. Toolstack change will be introduced
separately.

> >>> ## Goal of improvement
> >>>
> >>> The balloon driver makes use of as many huge pages as possible,
> >>> defragmenting guest address space. Contiguous guest address space
> >>> permits huge page ballooning which helps prevent host address space
> >>> fragmentation.
> >>>
> >>> This should be achieved without any particular hypervisor side
> >>> feature.
> >>
> >> I really think you need to be taking whole-system view and not focusing
> >> on just the guest balloon driver.
> >>
> > 
> > I don't think there's terribly tight linkage between hypervisor side
> > change and guest side change.
> 
> I don't see how you can think this unless you also have a design for the
> hypervisor side.
> 

Because the basic requirement for this design is to not rely on
hypervisor side feature, so that we can have it worked on older
hypervisor as well. And by far the proposed design seems to stick to
that principle well.

> I do not want a situation were effective and efficient host
> defragmentation requires balloon driver changes to avoid a regression.
> 

Fair enough. I think that should be classified as a bug in hypervisor.
We should not change guest side for that reason. And more reasoning
coming near the end of this mail...

> > To have guest automatically defragmenting it's address space while at
> > the same time helps prevent hypervisor memory from fragmenting (at least
> > this is what the design aims for, as for how it works in practice, it
> > needs to be prototyped and benchmarked).
> > 
> > The above reasoning is good enough to justify this change, isn't it?
> 
> Having a whole system design does not mean that it must be all
> implemented.  If one part has benefits independently from the rest then
> it can be implemented and merged.
> 

So are you worrying about this change in guest makes corresponding
feature in hypervisor harder to implement? Do you see harm (whether to
the guest itself or to the hypervisor) in this guest side change? Are we
any worse than before at least in the theoretical point of view? Of
course in practice we would still need to see how this goes.

[...]
> 
> >>> ### Periodically exchange normal size pages with huge pages
> >>>
> >>> Worker thread wakes up periodically to check if there are enough pages
> >>> in normal size page queue to coalesce into a huge page. If so, it will
> >>> try to exchange that huge page into a number of normal size pages with
> >>> XENMEM\_exchange hypercall.
> >>
> >> I don't see what this is supposed to achieve.  This is going to take a
> >> (potentially) non-fragmented superpage and fragment it.
> >>
> > 
> > Let's look at this from start of day.
> > 
> > Guest always tries to balloon in / out as many 2M pages as possible. So
> > if we have a long list of 4K pages, it means the underlying host super
> > frames are fragmented already.
> > 
> > So if 1) there are enough 4K pages in ballooned out list, 2) there is a
> > spare 2M page, it means that the 2M page comes from the result of
> > balloon page compaction, which means the underlying host super frame is
> > fragmented.
> 
> This assumption is only true because your page migration isn't trying
> hard enough to defragment super frames,

However hard it tries, if the hypervisor is not defragmenting, this
assumption still stands. As long as you get the 2M page as a result of
balloon compaction, the underlying host frame is fragmented. Note, we're
not worse than before.

> and it is assuming that Xen does
> nothing to address host super frame fragmentation.  This highlights the
> importance of looking at a system-level for designs, IMO.
> 

What would make this design different when Xen knows how to defragment
frames?

We end up ballooning out a 2M host frame if the underlying huge frame is
defragemented (instead of a bunch of 4K frames). We're giving huge frame
back to Xen, so it's OK; then we exchange in 512 4K consecutive pages
(or a 2M page if we merge them) with 2M frame backing them. Xen is not
harmed; guest now has got a huge frame. It's only making things better
if Xen knows how to defragemnt.

In any case, I will need to prototype different approach to see which
works best. I think figures of Xen heap fragmentation and guest P2M
entry counts grouped by page order will be interesting.

Does it require change to guest balloon driver if we're to implement Xen
side feature? From the guest's point of view I don't see one. To do any
work with regard to changing guest P2M we would surely need to get hold
of domain lock and p2m lock, in which case a guest is blocked from
issuing any memory hypercall anyway.

If contention is a problem, how are we worse off what we have now?
Ballooning in / out certainly causes contention too. All we can do from
guest side is avoid trying too hard. But if the guest tries too hard
it's harming itself anyway.

Wei.

> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 17:29     ` David Vrabel
  2014-10-27 19:10       ` Wei Liu
@ 2014-10-27 19:14       ` Wei Liu
  2014-10-28 10:51         ` David Vrabel
  1 sibling, 1 reply; 11+ messages in thread
From: Wei Liu @ 2014-10-27 19:14 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, xen-devel, Boris Ostrovsky

On Mon, Oct 27, 2014 at 05:29:16PM +0000, David Vrabel wrote:
> On 27/10/14 16:29, Wei Liu wrote:
> > On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
> >> On 27/10/14 12:33, Wei Liu wrote:
> >>>
> >>> Changes in this version:
> >>>
> >>> 1. Style, grammar and typo fixes.
> >>> 2. Make this document Linux centric.
> >>> 3. Add a new section for NUMA-aware ballooning.
> >>
> >> You've not included the required changes to the toolstack and
> >> autoballoon driver to always use 2M multiples when creating VMs and
> >> setting targets.
> >>
> > 
> > When creating VM, toolstack already tries to use as many huge pages as
> > possible.
> > 
> > Setting target doesn't use 2M multiples.  But I don't think this is
> > necessary. To balloon in / out X MB memory
> > 
> >   nr_2m = X % 2M
> >   nr_4k = (X / 2M) / 4k
> > 
> > The remainder just goes to 4K queue.
> 
> I understand that it will work with 4k multiples but it is not /optimal/
> to do so since it will result in more fragmentation.
> 

The fragmentation should be less than 2M right? Is that terrible?

> > And what do you mean by "autoballoon" driver? Do you mean functionality
> > of xl? In the end the request is still fulfilled by Xen balloon driver
> > in kernel. So if dom0 is using the new balloon driver proposed here, it
> > should balloon down in 2M multiples automatically.
> 
> Both xl and the auto-balloon driver in the kernel should only set the
> target in multiples of 2M.
> 

This is easy to achieve. I can always round up to 2M multiples in
balloon driver.  Change to toolstack is simple as well. Just that it's
very well possible newer kernel runs on older toolstack, or even
homebrew toolstacks that don't set target to 2M multiples. So after all
there are always suboptimal situations, be it 1) balloon out a bit more
memory than requested or 2) a little bit fragmentation.

I don't have very strong opinion on this. I will round up to 2M
multiples in balloon driver. Toolstack change will be introduced
separately.

> >>> ## Goal of improvement
> >>>
> >>> The balloon driver makes use of as many huge pages as possible,
> >>> defragmenting guest address space. Contiguous guest address space
> >>> permits huge page ballooning which helps prevent host address space
> >>> fragmentation.
> >>>
> >>> This should be achieved without any particular hypervisor side
> >>> feature.
> >>
> >> I really think you need to be taking whole-system view and not focusing
> >> on just the guest balloon driver.
> >>
> > 
> > I don't think there's terribly tight linkage between hypervisor side
> > change and guest side change.
> 
> I don't see how you can think this unless you also have a design for the
> hypervisor side.
> 

Because the basic requirement for this design is to not rely on
hypervisor side feature, so that we can have it worked on older
hypervisor as well. And by far the proposed design seems to stick to
that principle well.

> I do not want a situation were effective and efficient host
> defragmentation requires balloon driver changes to avoid a regression.
> 

Fair enough. I think that should be classified as a bug in hypervisor.
We should not change guest side for that reason. And more reasoning
coming near the end of this mail...

> > To have guest automatically defragmenting it's address space while at
> > the same time helps prevent hypervisor memory from fragmenting (at least
> > this is what the design aims for, as for how it works in practice, it
> > needs to be prototyped and benchmarked).
> > 
> > The above reasoning is good enough to justify this change, isn't it?
> 
> Having a whole system design does not mean that it must be all
> implemented.  If one part has benefits independently from the rest then
> it can be implemented and merged.
> 

So are you worrying about this change in guest makes corresponding
feature in hypervisor harder to implement? Do you see harm (whether to
the guest itself or to the hypervisor) in this guest side change? Are we
any worse than before at least in the theoretical point of view? Of
course in practice we would still need to see how this goes.

[...]
> 
> >>> ### Periodically exchange normal size pages with huge pages
> >>>
> >>> Worker thread wakes up periodically to check if there are enough pages
> >>> in normal size page queue to coalesce into a huge page. If so, it will
> >>> try to exchange that huge page into a number of normal size pages with
> >>> XENMEM\_exchange hypercall.
> >>
> >> I don't see what this is supposed to achieve.  This is going to take a
> >> (potentially) non-fragmented superpage and fragment it.
> >>
> > 
> > Let's look at this from start of day.
> > 
> > Guest always tries to balloon in / out as many 2M pages as possible. So
> > if we have a long list of 4K pages, it means the underlying host super
> > frames are fragmented already.
> > 
> > So if 1) there are enough 4K pages in ballooned out list, 2) there is a
> > spare 2M page, it means that the 2M page comes from the result of
> > balloon page compaction, which means the underlying host super frame is
> > fragmented.
> 
> This assumption is only true because your page migration isn't trying
> hard enough to defragment super frames,

However hard it tries, if the hypervisor is not defragmenting, this
assumption still stands. As long as you get the 2M page as a result of
balloon compaction, the underlying host frame is fragmented. Note, we're
not worse than before.

> and it is assuming that Xen does
> nothing to address host super frame fragmentation.  This highlights the
> importance of looking at a system-level for designs, IMO.
> 

What would make this design different when Xen knows how to defragment
frames?

We end up ballooning out a 2M host frame if the underlying huge frame is
defragemented (instead of a bunch of 4K frames). We're giving huge frame
back to Xen, so it's OK; then we exchange in 512 4K consecutive pages
(or a 2M page if we merge them) with 2M frame backing them. Xen is not
harmed; guest now has got a huge frame. It's only making things better
if Xen knows how to defragemnt.

In any case, I will need to prototype different approach to see which
works best. I think figures of Xen heap fragmentation and guest P2M
entry counts grouped by page order will be interesting.

Does it require change to guest balloon driver if we're to implement Xen
side feature? From the guest's point of view I don't see one. To do any
work with regard to changing guest P2M we would surely need to get hold
of domain lock and p2m lock, in which case a guest is blocked from
issuing any memory hypercall anyway.

If contention is a problem, how are we worse off what we have now?
Ballooning in / out certainly causes contention too. All we can do from
guest side is avoid trying too hard. But if the guest tries too hard
it's harming itself anyway.

Wei.

> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 19:10       ` Wei Liu
@ 2014-10-27 19:42         ` Stefano Stabellini
  0 siblings, 0 replies; 11+ messages in thread
From: Stefano Stabellini @ 2014-10-27 19:42 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Dario Faggioli,
	xen-devel, Boris Ostrovsky

On Mon, 27 Oct 2014, Wei Liu wrote:
> On Mon, Oct 27, 2014 at 05:29:16PM +0000, David Vrabel wrote:
> > > I don't think there's terribly tight linkage between hypervisor side
> > > change and guest side change.
> > 
> > I don't see how you can think this unless you also have a design for the
> > hypervisor side.
> > 
> 
> Because the basic requirement for this design is to not rely on
> hypervisor side feature, so that we can have it worked on older
> hypervisor as well. And by far the proposed design seems to stick to
> that principle well.

Wei, you do have a design for the hypervisor side changes: the design is
to make no hypervisor side changes. Maybe you should add:

## Hypervisor design for Linux Balloon Driver Compaction

No hypervisor changes required. The guest is going to make use of the
existing XENMEM_exchange interface. The Linux feature should work on any
hypervisor since Xen XXX.

> > I do not want a situation were effective and efficient host
> > defragmentation requires balloon driver changes to avoid a regression.

As you know there are non-Linux domU and dom0 out there that people use.
Older Linux and Xen versions mix and match quite well too. I think that
Linux side ballooning and Xen side defragmentation should be required to
be independent. Specifying that we don't need any Xen side changes to
implement Linux compaction is a good way to do that.

Also it is not common in the Linux community to request design
documents.  I don't think we should make them mandatory now. If we
really want to go down that path within the Xen community, we should
talk about it at the next Hackathon/Meeting and decide with the support
of the majority of the maintainers.

Personally, I am not in favor.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 19:14       ` Wei Liu
@ 2014-10-28 10:51         ` David Vrabel
  2014-10-29 11:01           ` Wei Liu
  0 siblings, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-10-28 10:51 UTC (permalink / raw)
  To: Wei Liu, David Vrabel
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Dario Faggioli,
	xen-devel, Boris Ostrovsky

On 27/10/14 19:14, Wei Liu wrote:
> On Mon, Oct 27, 2014 at 05:29:16PM +0000, David Vrabel wrote:
>> On 27/10/14 16:29, Wei Liu wrote:
>>> On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
>>>> On 27/10/14 12:33, Wei Liu wrote:
>>>>>
>>>>> Changes in this version:
>>>>>
>>>>> 1. Style, grammar and typo fixes.
>>>>> 2. Make this document Linux centric.
>>>>> 3. Add a new section for NUMA-aware ballooning.
>>>>
>>>> You've not included the required changes to the toolstack and
>>>> autoballoon driver to always use 2M multiples when creating VMs and
>>>> setting targets.
>>>>
>>>
>>> When creating VM, toolstack already tries to use as many huge pages as
>>> possible.
>>>
>>> Setting target doesn't use 2M multiples.  But I don't think this is
>>> necessary. To balloon in / out X MB memory
>>>
>>>   nr_2m = X % 2M
>>>   nr_4k = (X / 2M) / 4k
>>>
>>> The remainder just goes to 4K queue.
>>
>> I understand that it will work with 4k multiples but it is not /optimal/
>> to do so since it will result in more fragmentation.
>>
> 
> The fragmentation should be less than 2M right? Is that terrible?

I think it will increase fragmentation every time the target is set. Or
perhaps more correctly, I can't prove that it does not increase
fragmentation each time.

> Because the basic requirement for this design is to not rely on
> hypervisor side feature, so that we can have it worked on older
> hypervisor as well. And by far the proposed design seems to stick to
> that principle well.

Even with the requirement for no hypervisor changes, I do not see how
you can produce a good design without considering the hypervisor
behaviour (both current and possible future changes).

Having said this, after thinking some more, in this case it is
sufficient to show that every step in the guest balloon driver always
reduces fragmentation, regardless of the underlying hypervisor behaviour.

>>>>> ### Periodically exchange normal size pages with huge pages
>>>>>
>>>>> Worker thread wakes up periodically to check if there are enough pages
>>>>> in normal size page queue to coalesce into a huge page. If so, it will
>>>>> try to exchange that huge page into a number of normal size pages with
>>>>> XENMEM\_exchange hypercall.
>>>>
>>>> I don't see what this is supposed to achieve.  This is going to take a
>>>> (potentially) non-fragmented superpage and fragment it.
>>>>
>>>
>>> Let's look at this from start of day.
>>>
>>> Guest always tries to balloon in / out as many 2M pages as possible. So
>>> if we have a long list of 4K pages, it means the underlying host super
>>> frames are fragmented already.
>>>
>>> So if 1) there are enough 4K pages in ballooned out list, 2) there is a
>>> spare 2M page, it means that the 2M page comes from the result of
>>> balloon page compaction, which means the underlying host super frame is
>>> fragmented.
>>
>> This assumption is only true because your page migration isn't trying
>> hard enough to defragment super frames,
> 
> However hard it tries, if the hypervisor is not defragmenting, this
> assumption still stands. As long as you get the 2M page as a result of
> balloon compaction, the underlying host frame is fragmented. Note, we're
> not worse than before.
> 
>> and it is assuming that Xen does
>> nothing to address host super frame fragmentation.  This highlights the
>> importance of looking at a system-level for designs, IMO.
>>
> 
> What would make this design different when Xen knows how to defragment
> frames?
> 
> We end up ballooning out a 2M host frame if the underlying huge frame is
> defragemented (instead of a bunch of 4K frames). We're giving huge frame
> back to Xen, so it's OK; then we exchange in 512 4K consecutive pages
> (or a 2M page if we merge them) with 2M frame backing them. Xen is not
> harmed; guest now has got a huge frame. It's only making things better
> if Xen knows how to defragemnt.

Um.  Yes, of course if you use a contiguous and aligned set of of 4K
ballooned pages then it it will work well -- this is exactly the point I
am making. But this is not what your design says.

Also, to quote you from the thread on draft 1:

"As for moving contiguous ballooned pages from 4K list to 2M list,
unfortunately I see a problem with this proposal: The 4K pages list is
not sorted. Sorting it requires hooking into core balloon driver -- that
is, to grab multiple locks to avoid racing with page migration thread,
which is prone to error."

Although I don't understand the comments about multiple locks etc. since
I think the core balloon driver should be responsible for maintaining
the order of the list.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-28 10:51         ` David Vrabel
@ 2014-10-29 11:01           ` Wei Liu
  0 siblings, 0 replies; 11+ messages in thread
From: Wei Liu @ 2014-10-29 11:01 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, xen-devel, Boris Ostrovsky

On Tue, Oct 28, 2014 at 10:51:34AM +0000, David Vrabel wrote:
[...]
> >>
> > 
> > What would make this design different when Xen knows how to defragment
> > frames?
> > 
> > We end up ballooning out a 2M host frame if the underlying huge frame is
> > defragemented (instead of a bunch of 4K frames). We're giving huge frame
> > back to Xen, so it's OK; then we exchange in 512 4K consecutive pages
> > (or a 2M page if we merge them) with 2M frame backing them. Xen is not
> > harmed; guest now has got a huge frame. It's only making things better
> > if Xen knows how to defragemnt.
> 
> Um.  Yes, of course if you use a contiguous and aligned set of of 4K
> ballooned pages then it it will work well -- this is exactly the point I
> am making. But this is not what your design says.
> 

Well, I've said I will prototype different approaches to see which one
works better in practice (with maintenance burden taken into account).

What these drafts say is by no means final, because to me memory
subsystem optimisation is very empirical.

> Also, to quote you from the thread on draft 1:
> 
> "As for moving contiguous ballooned pages from 4K list to 2M list,
> unfortunately I see a problem with this proposal: The 4K pages list is
> not sorted. Sorting it requires hooking into core balloon driver -- that
> is, to grab multiple locks to avoid racing with page migration thread,
> which is prone to error."
> 
> Although I don't understand the comments about multiple locks etc. since
> I think the core balloon driver should be responsible for maintaining
> the order of the list.
> 

"Prone to error" is a bit overrated, I admit.

Core balloon driver doesn't sort pages at the moment. To sort pages we
either 1) sort them in core driver, 2) dequeue as many as possible to
Xen balloon driver then sort.

#1 is less desirable because upstream is not likely to accept this
change. Core balloon driver is basically a vault to keep those pages.
And I was thinking to have Xen balloon driver worker to hook into core
balloon driver when I replied. In that case I need to carefully fiddle
with some page locks to keep pages safe from page migration thread which
is tricky.

#2 Dequeueing a balloon page may fail due to contention with compaction.
When under memory pressure we might not able to dequeue all pages to
effectively coalesce them. However this approach is still worth trying
to see how it works.

Wei.

> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-10-27 12:33 Linux Xen Balloon Driver Improvement (Draft 2) Wei Liu
  2014-10-27 14:23 ` David Vrabel
@ 2014-12-15 10:52 ` David Vrabel
  2014-12-15 10:58   ` Wei Liu
  1 sibling, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-12-15 10:52 UTC (permalink / raw)
  To: Wei Liu, xen-devel
  Cc: Ian Campbell, Stefano Stabellini, Andrew Cooper, Dario Faggioli,
	David Vrabel, Boris Ostrovsky

On 27/10/14 12:33, Wei Liu wrote:
> 
> ### Periodically exchange normal size pages with huge pages
> 
> Worker thread wakes up periodically to check if there are enough pages
> in normal size page queue to coalesce into a huge page. If so, it will
> try to exchange that huge page into a number of normal size pages with
> XENMEM\_exchange hypercall.

As Andrew recently pointed out[1], changes to a guest's p2m are not
properly handled during migration.  This would have to be fixed before
we can have any sort of background memory exchange process.

David

[1] http://lists.xen.org/archives/html/xen-devel/2014-11/msg01954.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Linux Xen Balloon Driver Improvement (Draft 2)
  2014-12-15 10:52 ` David Vrabel
@ 2014-12-15 10:58   ` Wei Liu
  0 siblings, 0 replies; 11+ messages in thread
From: Wei Liu @ 2014-12-15 10:58 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Dario Faggioli, xen-devel, Boris Ostrovsky

On Mon, Dec 15, 2014 at 10:52:40AM +0000, David Vrabel wrote:
> On 27/10/14 12:33, Wei Liu wrote:
> > 
> > ### Periodically exchange normal size pages with huge pages
> > 
> > Worker thread wakes up periodically to check if there are enough pages
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> 
> As Andrew recently pointed out[1], changes to a guest's p2m are not
> properly handled during migration.  This would have to be fixed before
> we can have any sort of background memory exchange process.
> 

Thanks for the pointer. I read that thread already.

Wei.

> David
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2014-11/msg01954.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-12-15 10:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-27 12:33 Linux Xen Balloon Driver Improvement (Draft 2) Wei Liu
2014-10-27 14:23 ` David Vrabel
2014-10-27 16:29   ` Wei Liu
2014-10-27 17:29     ` David Vrabel
2014-10-27 19:10       ` Wei Liu
2014-10-27 19:42         ` Stefano Stabellini
2014-10-27 19:14       ` Wei Liu
2014-10-28 10:51         ` David Vrabel
2014-10-29 11:01           ` Wei Liu
2014-12-15 10:52 ` David Vrabel
2014-12-15 10:58   ` Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.