From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from e23smtp08.au.ibm.com (e23smtp08.au.ibm.com [202.81.31.141]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 06DFD1A0358 for ; Fri, 6 Mar 2015 22:29:55 +1100 (AEDT) Received: from /spool/local by e23smtp08.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 6 Mar 2015 21:29:53 +1000 Received: from d23relay10.au.ibm.com (d23relay10.au.ibm.com [9.190.26.77]) by d23dlp02.au.ibm.com (Postfix) with ESMTP id 87B7E2BB0040 for ; Fri, 6 Mar 2015 22:29:50 +1100 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay10.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t26BTfki42467384 for ; Fri, 6 Mar 2015 22:29:50 +1100 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t26BTFho030611 for ; Fri, 6 Mar 2015 22:29:16 +1100 Message-ID: <54F98FA9.1010900@linux.vnet.ibm.com> Date: Fri, 06 Mar 2015 16:59:45 +0530 From: Raghavendra K T MIME-Version: 1.0 To: Nishanth Aravamudan Subject: Re: [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot References: <20150305180549.GA29601@linux.vnet.ibm.com> <20150305231555.GB30570@linux.vnet.ibm.com> <20150306052750.GA9576@linux.vnet.ibm.com> In-Reply-To: <20150306052750.GA9576@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Cc: Tejun Heo , linuxppc-dev@lists.ozlabs.org, Paul Mackerras , Anton Blanchard , David Rientjes List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote: > On 05.03.2015 [15:29:00 -0800], David Rientjes wrote: >> On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: >> >>> So if we compare to x86: >>> >>> arch/x86/mm/numa.c::numa_init(): >>> >>> nodes_clear(numa_nodes_parsed); >>> nodes_clear(node_possible_map); >>> nodes_clear(node_online_map); >>> ... >>> numa_register_memblks(...); >>> >>> arch/x86/mm/numa.c::numa_register_memblks(): >>> >>> node_possible_map = numa_nodes_parsed; >>> >>> Basically, it looks like x86 NUMA init clears out possible map and >>> online map, probably for a similar reason to what I gave in the >>> changelog that by default, the possible map seems to be based off >>> MAX_NUMNODES, rather than nr_node_ids or anything dynamic. >>> >>> My patch was an attempt to emulate the same thing on powerpc. You are >>> right that there is a window in which the node_possible_map and >>> node_online_map are out of sync with my patch. It seems like it >>> shouldn't matter given how early in boot we are, but perhaps the >>> following would have been clearer: >>> >>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>> index 0257a7d659ef..1a118b08fad2 100644 >>> --- a/arch/powerpc/mm/numa.c >>> +++ b/arch/powerpc/mm/numa.c >>> @@ -958,6 +958,13 @@ void __init initmem_init(void) >>> >>> memblock_dump_all(); >>> >>> + /* >>> + * Reduce the possible NUMA nodes to the online NUMA nodes, >>> + * since we do not support node hotplug. This ensures that we >>> + * lower the maximum NUMA node ID to what is actually present. >>> + */ >>> + nodes_and(node_possible_map, node_possible_map, node_online_map); >> >> If you don't support node hotplug, then a node should always be possible >> if it's online unless there are other tricks powerpc plays with >> node_possible_map. Shouldn't this just be >> node_possible_map = node_online_map? > > Yeah, but I was too dumb to think of that before sending :) > > Updated version follows... > > -Nish > ---8<--- > > Raghu noticed an issue with excessive memory allocation on power with a > simple cgroup test, specifically, in mem_cgroup_css_alloc -> > for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing > up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup > directories). should we also add after this patch it has reduced to around 2MB? > > The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes > possible), which defines node_possible_map, which in turn defines the > value of nr_node_ids in setup_nr_node_ids and the iteration of > for_each_node. > > In practice, we never see a system with 256 NUMA nodes, and in fact, we > do not support node hotplug on power in the first place, so the nodes > that are online when we come up are the nodes that will be present for > the lifetime of this kernel. So let's, at least, drop the NUMA possible > map down to the online map at runtime. This is similar to what x86 does > in its initialization routines. > > mem_cgroup_css_alloc should also be fixed to only iterate over > memory-populated nodes and handle hotplug, but that is a separate > change. > Maybe we could fomally add Reported-by: Raghavendra K T > Signed-off-by: Nishanth Aravamudan > To: Michael Ellerman > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Tejun Heo > Cc: David Rientjes > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Anton Blanchard > Cc: Raghavendra K T > > --- > v1 -> v2: > Rather than clear node_possible_map and set it nid-by-nid, just > directly assign node_online_map to it, as suggested by Michael > Ellerman and Tejun Heo. > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 0257a7d659ef..0c1716cd271f 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -958,6 +958,13 @@ void __init initmem_init(void) > > memblock_dump_all(); > > + /* > + * Reduce the possible NUMA nodes to the online NUMA nodes, > + * since we do not support node hotplug. This ensures that we > + * lower the maximum NUMA node ID to what is actually present. > + */ Hope we remember this change when we add hotplug :) > + node_possible_map = node_online_map; > + > for_each_online_node(nid) { > unsigned long start_pfn, end_pfn; >