From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: NUMA TODO-list for xen-devel Date: Wed, 1 Aug 2012 17:30:54 +0100 Message-ID: <501959BE.60801@citrix.com> References: <1343837796.4958.32.camel@Solace> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7985365601985832137==" Return-path: In-Reply-To: <1343837796.4958.32.camel@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Andre Przywara , Anil Madhavapeddy , George Dunlap , xen-devel , Jan Beulich , "Zhang, Yang Z" List-Id: xen-devel@lists.xenproject.org --===============7985365601985832137== Content-Type: multipart/alternative; boundary="------------090302080202010206030406" --------------090302080202010206030406 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit On 01/08/12 17:16, Dario Faggioli wrote: > Hi everyone, > > With automatic placement finally landing into xen-unstable, I stated > thinking about what I could work on next, still in the field of > improving Xen's NUMA support. Well, it turned out that running out of > things to do is not an option! :-O > > In fact, I can think of quite a bit of open issues in that area, that I'm > just braindumping here. If anyone has thoughts or idea or feedback or > whatever, I'd be happy to serve as a collector of them. I've already > created a Wiki page to help with the tracking. You can see it here > (for now it basically replicates this e-mail): > > http://wiki.xen.org/wiki/Xen_NUMA_Roadmap > > I'm putting a [D] (standing for Dario) near the points I've started > working on or looking at, and again, I'd be happy to try tracking this > too, i.e., keeping the list of "who-is-doing-what" updated, in order to > ease collaboration. > > So, let's cut the talking: > > - Automatic placement at guest creation time. Basics are there and > will be shipping with 4.2. However, a lot of other things are > missing and/or can be improved, for instance: > [D] * automated verification and testing of the placement; > * benchmarks and improvements of the placement heuristic; > [D] * choosing/building up some measure of node load (more accurate > than just counting vcpus) onto which to rely during placement; > * consider IONUMA during placement; > * automatic placement of Dom0, if possible (my current series is > only affecting DomU) > * having internal xen data structure honour the placement (e.g., > I've been told that right now vcpu stacks are always allocated > on node 0... Andrew?). > > [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus, > just have them _prefer_ running on the nodes where their memory > is. > > [D] - Dynamic memory migration between different nodes of the host. As > the counter-part of the NUMA-aware scheduler. > > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > guest ends up on more than one nodes, make sure it knows it's > running on a NUMA platform (smaller than the actual host, but > still NUMA). This interacts with some of the above points: > * consider this during automatic placement for > resuming/migrating domains (if they have a virtual topology, > better not to change it); > * consider this during memory migration (it can change the > actual topology, should we update it on-line or disable memory > migration?) > > - NUMA and ballooning and memory sharing. In some more details: > * page sharing on NUMA boxes: it's probably sane to make it > possible disabling sharing pages across nodes; > * ballooning and its interaction with placement (races, amount of > memory needed and reported being different at different time, > etc.). > > - Inter-VM dependencies and communication issues. If a workload is > made up of more than just a VM and they all share the same (NUMA) > host, it might be best to have them sharing the nodes as much as > possible, or perhaps do right the opposite, depending on the > specific characteristics of he workload itself, and this might be > considered during placement, memory migration and perhaps > scheduling. > > - Benchmarking and performances evaluation in general. Meaning both > agreeing on a (set of) relevant workload(s) and on how to extract > meaningful performances data from there (and maybe how to do that > automatically?). - Xen NUMA internals. Placing items such as the per-cpu stacks and data area on the local NUMA node, rather than unconditionally on node 0 at the moment. As part of this, there will be changes to alloc_{dom,xen}heap_page() to allow specification of which node(s) to allocate memory from. ~Andrew > > > So, what do you think? > > Thanks and Regards, > Dario > -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com --------------090302080202010206030406 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit
On 01/08/12 17:16, Dario Faggioli wrote:
> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen's NUMA support. Well, it turned out that running out of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that I'm
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I'd be happy to serve as a collector of them. I've already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
>
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
>
> I'm putting a [D] (standing for Dario) near the points I've started
> working on or looking at, and again, I'd be happy to try tracking this
> too, i.e., keeping the list of "who-is-doing-what" updated, in order to
> ease collaboration.
>
> So, let's cut the talking:
>
> - Automatic placement at guest creation time. Basics are there and
> will be shipping with 4.2. However, a lot of other things are
> missing and/or can be improved, for instance:
> [D] * automated verification and testing of the placement;
> * benchmarks and improvements of the placement heuristic;
> [D] * choosing/building up some measure of node load (more accurate
> than just counting vcpus) onto which to rely during placement;
> * consider IONUMA during placement;
> * automatic placement of Dom0, if possible (my current series is
> only affecting DomU)
> * having internal xen data structure honour the placement (e.g.,
> I've been told that right now vcpu stacks are always allocated
> on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
> just have them _prefer_ running on the nodes where their memory
> is.
>
> [D] - Dynamic memory migration between different nodes of the host. As
> the counter-part of the NUMA-aware scheduler.
>
> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> guest ends up on more than one nodes, make sure it knows it's
> running on a NUMA platform (smaller than the actual host, but
> still NUMA). This interacts with some of the above points:
> * consider this during automatic placement for
> resuming/migrating domains (if they have a virtual topology,
> better not to change it);
> * consider this during memory migration (it can change the
> actual topology, should we update it on-line or disable memory
> migration?)
>
> - NUMA and ballooning and memory sharing. In some more details:
> * page sharing on NUMA boxes: it's probably sane to make it
> possible disabling sharing pages across nodes;
> * ballooning and its interaction with placement (races, amount of
> memory needed and reported being different at different time,
> etc.).
>
> - Inter-VM dependencies and communication issues. If a workload is
> made up of more than just a VM and they all share the same (NUMA)
> host, it might be best to have them sharing the nodes as much as
> possible, or perhaps do right the opposite, depending on the
> specific characteristics of he workload itself, and this might be
> considered during placement, memory migration and perhaps
> scheduling.
>
> - Benchmarking and performances evaluation in general. Meaning both
> agreeing on a (set of) relevant workload(s) and on how to extract
> meaningful performances data from there (and maybe how to do that
> automatically?).


- Xen NUMA internals.  Placing items such as the per-cpu stacks and data area on the local NUMA node, rather than unconditionally on node 0 at the moment.  As part of this, there will be changes to alloc_{dom,xen}heap_page() to allow specification of which node(s) to allocate memory from.

~Andrew

>
>
> So, what do you think?
>
> Thanks and Regards,
> Dario
>


--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--------------090302080202010206030406-- --===============7985365601985832137== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============7985365601985832137==--