From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: NUMA TODO-list for xen-devel
Date: Wed, 1 Aug 2012 17:30:54 +0100
Message-ID: <501959BE.60801@citrix.com>
References: <1343837796.4958.32.camel@Solace>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============7985365601985832137=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1343837796.4958.32.camel@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>, Anil Madhavapeddy <anil@recoil.org>, George Dunlap <dunlapg@gmail.com>, xen-devel <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>, "Zhang, Yang Z" <yang.z.zhang@intel.com>
List-Id: xen-devel@lists.xenproject.org

--===============7985365601985832137==
Content-Type: multipart/alternative;
	boundary="------------090302080202010206030406"

--------------090302080202010206030406
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit


On 01/08/12 17:16, Dario Faggioli wrote:
> Hi everyone,
>
> With automatic placement finally landing into xen-unstable, I stated
> thinking about what I could work on next, still in the field of
> improving Xen's NUMA support. Well, it turned out that running out of
> things to do is not an option! :-O
>
> In fact, I can think of quite a bit of open issues in that area, that I'm
> just braindumping here. If anyone has thoughts or idea or feedback or
> whatever, I'd be happy to serve as a collector of them. I've already
> created a Wiki page to help with the tracking. You can see it here
> (for now it basically replicates this e-mail):
>
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap
>
> I'm putting a [D] (standing for Dario) near the points I've started
> working on or looking at, and again, I'd be happy to try tracking this
> too, i.e., keeping the list of "who-is-doing-what" updated, in order to
> ease collaboration.
>
> So, let's cut the talking:
>
> - Automatic placement at guest creation time. Basics are there and
> will be shipping with 4.2. However, a lot of other things are
> missing and/or can be improved, for instance:
> [D] * automated verification and testing of the placement;
> * benchmarks and improvements of the placement heuristic;
> [D] * choosing/building up some measure of node load (more accurate
> than just counting vcpus) onto which to rely during placement;
> * consider IONUMA during placement;
> * automatic placement of Dom0, if possible (my current series is
> only affecting DomU)
> * having internal xen data structure honour the placement (e.g.,
> I've been told that right now vcpu stacks are always allocated
> on node 0... Andrew?).
>
> [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus,
> just have them _prefer_ running on the nodes where their memory
> is.
>
> [D] - Dynamic memory migration between different nodes of the host. As
> the counter-part of the NUMA-aware scheduler.
>
> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a
> guest ends up on more than one nodes, make sure it knows it's
> running on a NUMA platform (smaller than the actual host, but
> still NUMA). This interacts with some of the above points:
> * consider this during automatic placement for
> resuming/migrating domains (if they have a virtual topology,
> better not to change it);
> * consider this during memory migration (it can change the
> actual topology, should we update it on-line or disable memory
> migration?)
>
> - NUMA and ballooning and memory sharing. In some more details:
> * page sharing on NUMA boxes: it's probably sane to make it
> possible disabling sharing pages across nodes;
> * ballooning and its interaction with placement (races, amount of
> memory needed and reported being different at different time,
> etc.).
>
> - Inter-VM dependencies and communication issues. If a workload is
> made up of more than just a VM and they all share the same (NUMA)
> host, it might be best to have them sharing the nodes as much as
> possible, or perhaps do right the opposite, depending on the
> specific characteristics of he workload itself, and this might be
> considered during placement, memory migration and perhaps
> scheduling.
>
> - Benchmarking and performances evaluation in general. Meaning both
> agreeing on a (set of) relevant workload(s) and on how to extract
> meaningful performances data from there (and maybe how to do that
> automatically?).

- Xen NUMA internals.  Placing items such as the per-cpu stacks and data
area on the local NUMA node, rather than unconditionally on node 0 at
the moment.  As part of this, there will be changes to
alloc_{dom,xen}heap_page() to allow specification of which node(s) to
allocate memory from.

~Andrew

>
>
> So, what do you think?
>
> Thanks and Regards,
> Dario
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


--------------090302080202010206030406
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    On 01/08/12 17:16, Dario Faggioli wrote:<br>
    <span style="white-space: pre;">&gt; Hi everyone,<br>
      &gt;<br>
      &gt; With automatic placement finally landing into xen-unstable, I
      stated<br>
      &gt; thinking about what I could work on next, still in the field
      of<br>
      &gt; improving Xen's NUMA support. Well, it turned out that
      running out of<br>
      &gt; things to do is not an option! :-O<br>
      &gt;<br>
      &gt; In fact, I can think of quite a bit of open issues in that
      area, that I'm<br>
      &gt; just braindumping here. If anyone has thoughts or idea or
      feedback or<br>
      &gt; whatever, I'd be happy to serve as a collector of them. I've
      already<br>
      &gt; created a Wiki page to help with the tracking. You can see it
      here<br>
      &gt; (for now it basically replicates this e-mail):<br>
      &gt;<br>
      &gt; <a class="moz-txt-link-freetext" href="http://wiki.xen.org/wiki/Xen_NUMA_Roadmap">http://wiki.xen.org/wiki/Xen_NUMA_Roadmap</a><br>
      &gt;<br>
      &gt; I'm putting a [D] (standing for Dario) near the points I've
      started<br>
      &gt; working on or looking at, and again, I'd be happy to try
      tracking this<br>
      &gt; too, i.e., keeping the list of "who-is-doing-what" updated,
      in order to<br>
      &gt; ease collaboration.<br>
      &gt;<br>
      &gt; So, let's cut the talking:<br>
      &gt;<br>
      &gt; - Automatic placement at guest creation time. Basics are
      there and<br>
      &gt; will be shipping with 4.2. However, a lot of other things are<br>
      &gt; missing and/or can be improved, for instance:<br>
      &gt; [D] * automated verification and testing of the placement;<br>
      &gt; * benchmarks and improvements of the placement heuristic;<br>
      &gt; [D] * choosing/building up some measure of node load (more
      accurate<br>
      &gt; than just counting vcpus) onto which to rely during
      placement;<br>
      &gt; * consider IONUMA during placement;<br>
      &gt; * automatic placement of Dom0, if possible (my current series
      is<br>
      &gt; only affecting DomU)<br>
      &gt; * having internal xen data structure honour the placement
      (e.g., <br>
      &gt; I've been told that right now vcpu stacks are always
      allocated<br>
      &gt; on node 0... Andrew?).<br>
      &gt;<br>
      &gt; [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes'
      pcpus,<br>
      &gt; just have them _prefer_ running on the nodes where their
      memory<br>
      &gt; is.<br>
      &gt;<br>
      &gt; [D] - Dynamic memory migration between different nodes of the
      host. As<br>
      &gt; the counter-part of the NUMA-aware scheduler.<br>
      &gt;<br>
      &gt; - Virtual NUMA topology exposure to guests (a.k.a
      guest-numa). If a<br>
      &gt; guest ends up on more than one nodes, make sure it knows it's<br>
      &gt; running on a NUMA platform (smaller than the actual host, but<br>
      &gt; still NUMA). This interacts with some of the above points:<br>
      &gt; * consider this during automatic placement for<br>
      &gt; resuming/migrating domains (if they have a virtual topology,<br>
      &gt; better not to change it);<br>
      &gt; * consider this during memory migration (it can change the<br>
      &gt; actual topology, should we update it on-line or disable
      memory<br>
      &gt; migration?)<br>
      &gt;<br>
      &gt; - NUMA and ballooning and memory sharing. In some more
      details:<br>
      &gt; * page sharing on NUMA boxes: it's probably sane to make it<br>
      &gt; possible disabling sharing pages across nodes;<br>
      &gt; * ballooning and its interaction with placement (races,
      amount of<br>
      &gt; memory needed and reported being different at different time,<br>
      &gt; etc.).<br>
      &gt;<br>
      &gt; - Inter-VM dependencies and communication issues. If a
      workload is<br>
      &gt; made up of more than just a VM and they all share the same
      (NUMA)<br>
      &gt; host, it might be best to have them sharing the nodes as much
      as<br>
      &gt; possible, or perhaps do right the opposite, depending on the<br>
      &gt; specific characteristics of he workload itself, and this
      might be<br>
      &gt; considered during placement, memory migration and perhaps<br>
      &gt; scheduling.<br>
      &gt;<br>
      &gt; - Benchmarking and performances evaluation in general.
      Meaning both<br>
      &gt; agreeing on a (set of) relevant workload(s) and on how to
      extract<br>
      &gt; meaningful performances data from there (and maybe how to do
      that<br>
      &gt; automatically?).</span><br>
    <br>
    - Xen NUMA internals.  Placing items such as the per-cpu stacks and
    data area on the local NUMA node, rather than unconditionally on
    node 0 at the moment.  As part of this, there will be changes to
    alloc_{dom,xen}heap_page() to allow specification of which node(s)
    to allocate memory from.<br>
    <br>
    ~Andrew<br>
    <br>
    <span style="white-space: pre;">&gt;<br>
      &gt;<br>
      &gt; So, what do you think?<br>
      &gt;<br>
      &gt; Thanks and Regards,<br>
      &gt; Dario<br>
      &gt;</span><br>
    <br>
    -- <br>
    Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br>
    T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext" href="http://www.citrix.com">http://www.citrix.com</a><br>
    <br>
  </body>
</html>

--------------090302080202010206030406--


--===============7985365601985832137==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============7985365601985832137==--