From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53852) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z2fVa-00073V-6l for qemu-devel@nongnu.org; Wed, 10 Jun 2015 08:53:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z2fVW-0006Vf-4Z for qemu-devel@nongnu.org; Wed, 10 Jun 2015 08:53:30 -0400 Received: from e28smtp03.in.ibm.com ([122.248.162.3]:32956) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z2fVV-0006Rn-5t for qemu-devel@nongnu.org; Wed, 10 Jun 2015 08:53:26 -0400 Received: from /spool/local by e28smtp03.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 10 Jun 2015 18:23:20 +0530 Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) by d28dlp01.in.ibm.com (Postfix) with ESMTP id 279D3E0045 for ; Wed, 10 Jun 2015 18:26:40 +0530 (IST) Received: from d28av01.in.ibm.com (d28av01.in.ibm.com [9.184.220.63]) by d28relay03.in.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t5ACqJCa23855342 for ; Wed, 10 Jun 2015 18:23:15 +0530 Received: from d28av01.in.ibm.com (localhost [127.0.0.1]) by d28av01.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t5ACouRc008663 for ; Wed, 10 Jun 2015 18:20:57 +0530 Date: Wed, 10 Jun 2015 18:20:53 +0530 From: Bharata B Rao Message-ID: <20150610125053.GH9812@in.ibm.com> References: <20150513180607.GK25766@thinpad.lan.raisama.net> <55546D3A.8010509@redhat.com> <20150525074757.GE32383@in.ibm.com> <20150525174240.GP17796@thinpad.lan.raisama.net> <20150608055818.GE25832@in.ibm.com> <20150608115103.3f888b1e@nial.brq.redhat.com> <20150608155139.GL3410@thinpad.lan.raisama.net> <20150609112319.67bb7312@nial.brq.redhat.com> <20150609124054.GR3410@thinpad.lan.raisama.net> <20150610114319.6d97bedc@nial.brq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610114319.6d97bedc@nial.brq.redhat.com> Subject: Re: [Qemu-devel] [RFC PATCH v0] numa: API to lookup NUMA node by address Reply-To: bharata@linux.vnet.ibm.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Igor Mammedov Cc: Paolo Bonzini , david@gibson.dropbear.id.au, Eduardo Habkost , qemu-devel@nongnu.org On Wed, Jun 10, 2015 at 11:43:19AM +0200, Igor Mammedov wrote: > On Tue, 9 Jun 2015 09:40:54 -0300 > Eduardo Habkost wrote: > > > On Tue, Jun 09, 2015 at 11:23:19AM +0200, Igor Mammedov wrote: > > > On Mon, 8 Jun 2015 12:51:39 -0300 > > > Eduardo Habkost wrote: > > > > > > > On Mon, Jun 08, 2015 at 11:51:03AM +0200, Igor Mammedov wrote: > > > > > On Mon, 8 Jun 2015 11:28:18 +0530 > > > > > Bharata B Rao wrote: > > > > > > > > > > > On Mon, May 25, 2015 at 02:42:40PM -0300, Eduardo Habkost wrote: > > > > > > > On Mon, May 25, 2015 at 01:17:57PM +0530, Bharata B Rao wrote: > > > > > > > > On Thu, May 14, 2015 at 11:39:06AM +0200, Paolo Bonzini wrote: > > > > > > > > > On 13/05/2015 20:06, Eduardo Habkost wrote: > > > > > > > > > > Also, this introduces a circular dependency between pc-dimm.c and > > > > > > > > > > numa.c. Instead of that, pc-dimm could simply notify us when a new > > > > > > > > > > device is realized (with just (addr, end, node) as arguments), so we can > > > > > > > > > > save the list of memory ranges inside struct node_info. > > > > > > > > > > > > > > > > > > > > I wonder if the memory API already provides something that would help > > > > > > > > > > us. Paolo, do you see a way we could simply use a MemoryRegion as input > > > > > > > > > > to lookup the NUMA node? > > > > > > > > > > > > > > > > > > No, but I guess you could add a numa_get/set_memory_region_node_id API > > > > > > > > > that uses a hash table. That's a variant of the "pc-dimm could simply > > > > > > > > > notify" numa.c that you propose above. > > > > > > > > > > > > > > > > While you say we can't use MemoryRegion as input to lookup the NUMA node, > > > > > > > > you suggest that we add numa_get/set_memory_region_node_id. Does this API > > > > > > > > get/set NUMA node id for the given MemoryRegion ? > > > > > > > > > > > > > > I was going to suggest that, but it would require changing the > > > > > > > non-memdev code path to create a MemoryRegion for each node, too. So > > > > > > > having a numa_set_mem_node_id(start_addr, end_addr, node_id) API would > > > > > > > be simpler. > > > > > > > > > > > > In order to save the list of memory ranges inside node_info, I tried this > > > > > > approach where I call > > > > > > > > > > > > numa_set_mem_node_id(dimm.addr, dimm.size, dimm.node) from > > > > > > > > > > > > pc_dimm_realize(), but > > > > > > > > > > > > the value of dimm.addr is finalized only later in ->plug(). > > > > > > > > > > > > So we would have to call this API from arch code like pc_dimm_plug(). > > > > > > Is that acceptable ? > > > > > > > > It looks acceptable to me, as pc.c already has all the rest of the > > > > NUMA-specific code for PC. I believe it would be interesting to keep all > > > > numa.o dependencies contained inside machine code. > > > > > > > > > Could you query pc_dimms' numa property each time you need mapping > > > > > instead of additionally storing that mapping elsewhere? > > > > > > > > The original patch did that, but I suggested the > > > > numa_set_mem_node_id() API for two reasons: 1) not requiring special > > > > cases for hotplug inside numa_get_node(); 2) not introducing a circular > > > > dependency between pc-dimm.c and numa.c. > > > What circular dependency doing foreach(pc-dimm) would introduce? > > > So far pc-dimm is independent from numa.c and a regular device with no > > > dependencies (except of on backend memdev) and has it's own 'numa' property > > > for providing that information to interested users. I'd rather keep > > > it separate form legacy numa.c:-numa handling. > > > > pc-dimm.c already depends on numa.c because it checks nb_numa_nodes > > inside pc_dimm_realize(). > check should be in pc_dimm_plug() instead of realize but I guess it saves > up duplication when pc-dimm reused with other targets, anyway we could move > it out into generic common function. > > > > > I don't understand what you mean by "legacy numa.c:-numa handling". > > Unless there's a way to query pc-dimm (or other code) for all > > (address -> numa_node) mappings without a special case for memory > > hotplug[1], I wouldn't call node_info[].node_mem "legacy". > > > > [1] And this is exactly what I want to provide with > > numa_set_mem_node_id(): an API that doesn't require special cases > > for memory hotplug. > For x86 board makers usually define address ranges -> node mapping statically. > So node_info[].node_mem & numa_set_mem_node_id() makes sense. > And sPAPR could probably do the same, no need to scan for pc-dimm devices. I thought numa_info[i].node_mem maintains memory size information only for those node memories that are defined at boot time. If I have -m 8G,slots=16,maxmem=16G -numa node,mem=4G -numa node,mem=4G there will be numa_info[0 & 1] for two nodes with 4G size each. However the remaining 8G of hotpluggable memory isn't covered by this and chunks of memory from this range can be hotplugged to any node using node= property of pc-dimm device. With this understanding, I think numa_set_mem_node_id() when called from pc_dimm_plug() could note/store the address range for the hotpluggged pc-dimm in corresponding numa_info[i]. Later this information can be used to lookup the node by address. > > > [...] > > > What for one needs to pull dimm properties into numa.c? > > > > I am not sure I parsed the question correctly, but: > > > > Original commit message explains why numa_get_node(addr) is needed: > > > > > > This is needed by sPAPR PowerPC to support > > > > ibm,dynamic-reconfiguration-memory device tree node which is needed > > > > for memory hotplug. > > > > And to make this work, it needs to be aware of NUMA information for > > hotplugged memory too. > I've checked spapr_populate_drconf_memory() from original series, > it needs to be aware at startup about address ranges -> node mapping > including mapping partitioning of whole hotplug memory range > (i.e. not actual hotplugged memory). > -numa node_mem & numa_set_mem_node_id() are sufficient for this purpose spapr_populate_drconf_memory() needs to know about node information for boot time memory as well as the hotplugged pc-dimm memory. Since chunks of hotplug memory range could be plugged into any node, we need to be able to locate the node id for such memory range. This is where numa_set_mem_node_id() call for each realized dimm will help. Regards, Bharata.