From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:52317) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RSVdf-0001ns-Kn for qemu-devel@nongnu.org; Mon, 21 Nov 2011 10:18:41 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RSVdb-0008Hs-Ex for qemu-devel@nongnu.org; Mon, 21 Nov 2011 10:18:31 -0500 Received: from e28smtp08.in.ibm.com ([122.248.162.8]:44352) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RSVda-0008H1-Oa for qemu-devel@nongnu.org; Mon, 21 Nov 2011 10:18:27 -0500 Received: from /spool/local by e28smtp08.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Nov 2011 20:48:20 +0530 Received: from d28av04.in.ibm.com (d28av04.in.ibm.com [9.184.220.66]) by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id pALFICjv2760850 for ; Mon, 21 Nov 2011 20:48:13 +0530 Received: from d28av04.in.ibm.com (loopback [127.0.0.1]) by d28av04.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id pALFICTl017940 for ; Tue, 22 Nov 2011 02:18:12 +1100 Date: Mon, 21 Nov 2011 20:48:07 +0530 From: Bharata B Rao Message-ID: <20111121150054.GA3602@in.ibm.com> References: <20111029184502.GH11038@in.ibm.com> <7816C401-9BE5-48A9-8BA9-4CDAD1B39FC8@suse.de> <20111108173304.GA14486@sequoia.sous-sol.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20111108173304.GA14486@sequoia.sous-sol.org> Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding Reply-To: bharata@linux.vnet.ibm.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chris Wright Cc: Andrea Arcangeli , Peter Zijlstra , kvm list , qemu-devel Developers , Alexander Graf , dipankar@in.ibm.com, Vaidyanathan S On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote: > * Alexander Graf (agraf@suse.de) wrote: > > On 29.10.2011, at 20:45, Bharata B Rao wrote: > > > As guests become NUMA aware, it becomes important for the guests to > > > have correct NUMA policies when they run on NUMA aware hosts. > > > Currently limited support for NUMA binding is available via libvirt > > > where it is possible to apply a NUMA policy to the guest as a whole. > > > However multinode guests would benefit if guest memory belonging to > > > different guest nodes are mapped appropriately to different host NUMA nodes. > > > > > > To achieve this we would need QEMU to expose information about > > > guest RAM ranges (Guest Physical Address - GPA) and their host virtual > > > address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external > > > tool like libvirt would be able to divide the guest RAM as per the guest NUMA > > > node geometry and bind guest memory nodes to corresponding host memory nodes > > > using HVA. This needs both QEMU (and libvirt) changes as well as changes > > > in the kernel. > > > > Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. > > > > Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. > > I think that both Peter and Andrea are looking at this. Before we commit > an API to QEMU that has a different semantic than a possible new kernel > interface (that perhaps QEMU could use directly to inform kernel of the > binding/relationship between vcpu thread and it's memory at VM startuup) > it would be useful to see what these guys are working on... I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. (*) memsched: An abstraction for representing coupling of threads with virtual address ranges. Threads and virtual address ranges of a memsched group are guaranteed (?) to be located on the same node. Regards, Bharata.