From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: NUMA TODO-list for xen-devel Date: Fri, 3 Aug 2012 12:02:40 +0200 Message-ID: <501BA1C0.7040100@amd.com> References: <1343837796.4958.32.camel@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1343837796.4958.32.camel@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Anil Madhavapeddy , George Dunlap , xen-devel , Jan Beulich , Andrew Cooper , "Zhang, Yang Z" List-Id: xen-devel@lists.xenproject.org On 08/01/2012 06:16 PM, Dario Faggioli wrote: > Hi everyone, > > With automatic placement finally landing into xen-unstable, I stated > thinking about what I could work on next, still in the field of > improving Xen's NUMA support. Well, it turned out that running out of > things to do is not an option! :-O > > In fact, I can think of quite a bit of open issues in that area, that I'm > just braindumping here. > ... > > * automatic placement of Dom0, if possible (my current series is > only affecting DomU) I think Dom0 NUMA awareness should be one of the top priorities. If I boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which actually has memory from all 8 nodes and thinks it's memory is flat. There are some tricks to confine it to node 0 (dom0_mem= dom0_vcpus= dom0_vcpus_pin), but this requires intimate knowledge of the systems parameters and is error-prone. Also this does not work well with ballooning. Actually we could improve the NUMA placement with that: By asking the Dom0 explicitly for memory from a certain node. > * having internal xen data structure honour the placement (e.g., > I've been told that right now vcpu stacks are always allocated > on node 0... Andrew?). > > [D] - NUMA aware scheduling in Xen. Don't pin vcpus on nodes' pcpus, > just have them _prefer_ running on the nodes where their memory > is. This would be really cool. I once thought about something like a home-node. We start with placement to allocate memory from one node. Then we relax the VCPU-pinning, but mark this node as special for this guest, so that it if possible happens to get run there. But in times of CPU pressure we are happy to let it run on other nodes: CPU starving is much worse than NUMA penalty. > > [D] - Dynamic memory migration between different nodes of the host. As > the counter-part of the NUMA-aware scheduler. I once read about a VMware feature: bandwith-limited migration in the background, hot pages first. So we get flexibility and avoid CPU starving, but still don't hog the system with memory copying. Sounds quite ambitious, though. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany