Re: [RFC] Xen NUMA strategy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Akio Takebe <takebe_akio@jp.fujitsu.com>
To: Andre Przywara <andre.przywara@amd.com>, xen-devel@lists.xensource.com
Cc: "Xu, Anthony" <anthony.xu@intel.com>
Subject: Re: [RFC] Xen NUMA strategy
Date: Tue, 18 Sep 2007 15:08:12 +0900	[thread overview]
Message-ID: <54C7F9BA4B1341takebe_akio@jp.fujitsu.com> (raw)
In-Reply-To: <46EA7906.2010504@amd.com>

Hi, Andre, Anthony and all

>Anthony Xu and I have had some fruitful discussion about the further 
>direction of the NUMA support in Xen, I wanted to share the results with 
>the Xen community and start a discussion:
Thank you for this sharing.

>We came up with two different approaches for better NUMA support in Xen:
>1.) Guest NUMA support: spread a guest's resources (CPUs and memory) 
>over several nodes and propagate the appropriate topology to the guest.
>The first part of this is in the patches I sent recently to the list (PV 
>support is following, bells and whistles like automatic placement will 
>follow, too.).
>	***Advantages***:
>- The guest OS has better means to deal with the NUMA setup, it can more 
>easily migrate _processes_ among the nodes (Xen-HV can only migrate 
>whole domains).
>- Changes to Xen are relatively small.
>- There is no limit for the guest resources, since they can use more 
>resources than there are on one node.
>- If guests are well spread over the nodes, the system is more balanced 
>even if guests are destroyed and created later.
>	***Disadvantages***:
>- The guest has to support NUMA. This is not true for older guests 
>(Win2K, older Linux).
>- The guest's workload has to fit NUMA. If the guests tasks are merely 
>parallelizable or use much shared memory, they cannot take advantage of 
>NUMA and will degrade in performance. This includes all single task 
>problems.
>
We may need to write something about guest NUMA in guest configuration file.
For example, in guest configuration file;
vnode = <a number of guest node>
vcpu = [<vcpus# pinned into the node: machine node#>, ...]
memory = [<amount of memory per node: machine node#>, ...]

e.g.
vnode = 2
vcpu = [0-1:0, 2-3:1]
memory = [128:0, 128:1]

If we setup vnode=1, old OSes should work fine.

And almost OSes read NUMA configuration only at booting and CPU/memory hotplug.
So if xen migrate vcpu, xen has to occur hotpulg event.
It's costly. So pinning vcpu to node may be good.

In this case, we may need something also about cap/weight.

>In general this approach seems to fit better with smaller NUMA nodes and 
>larger guests.
>
>2.) Dynamic load balancing and page migration: create guests within one 
>NUMA node and distribute all guests across the nodes. If the system 
>becomes imbalanced, migrate guests to other nodes and copy (at least 
>part of) their memory pages to the other node's local memory.
>	***Advantages***:
>- No guest NUMA support necessary. Older as well a recent guests should 
>run fine.
>- Smaller guests don't have to cope with NUMA and will have 'flat' 
>memory available.
>- Guests running on separate nodes usually don't disturb each other and 
>can benefit from the higher distributed memory bandwidth.
>	***Disadvantages***:
>- Guests are limited to the resources available on one node. This 
>applies for both the number of CPUs and the amount of memory.
>- Costly migration of guests. In a simple implementation we'd use live 
>migration, which requires the whole guest's memory to be copied before 
>the guest starts to run on the other node. If this whole move proves to 
>be unnecessary a few minutes later, all this was in vain. A more 
>advanced implementation would do the page migration in the background 
>and thus can avoid this problem, if only the hot pages are migrated first.
>- Integration into Xen seems to be more complicated (at least for the 
>more ungifted hackers among us).
>
>This approach seems to be more reasonable if you have larger nodes (for 
>instance 16 cores) and smaller guests (the more usual case nowadays?)
If xen migrate a guest, does the system need to have two times memory
of the guest?

I think basicaly pinning a guest into a node is good.
If the system becomes imbalanced, and we absolutely want
to migration a guest, then xen temporarily migrate only vcpus,
and we abandon the performance at that time.
What do you think?

Best Regards,

Akio Takebe

next prev parent reply	other threads:[~2007-09-18  6:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-14 12:05 [RFC] Xen NUMA strategy Andre Przywara
2007-09-18  6:08 ` Akio Takebe [this message]
2007-09-18  6:33   ` Xu, Anthony
2007-09-18  6:57     ` Akio Takebe
2007-09-18  8:43     ` Ian Pratt
2007-09-18 13:30       ` Aron Griffis
2007-09-19  1:04         ` Ian Pratt
2007-09-20  1:44       ` Xu, Anthony
2007-09-20  9:56         ` Ian Pratt
2007-09-20  3:09       ` Aron Griffis
2007-09-20  9:50         ` Ian Pratt
2007-09-21 21:36           ` Aron Griffis
2007-09-18 14:31 ` Aron Griffis
  -- strict thread matches above, loose matches on Subject: below --
2007-09-20 10:26 André Przywara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54C7F9BA4B1341takebe_akio@jp.fujitsu.com \
    --to=takebe_akio@jp.fujitsu.com \
    --cc=andre.przywara@amd.com \
    --cc=anthony.xu@intel.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.