* [PATCH 0/3] powerpc: support memoryless nodes @ 2014-02-19 23:16 Nishanth Aravamudan 2014-02-19 23:17 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 0 siblings, 1 reply; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:16 UTC (permalink / raw) To: Nish Aravamudan Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim, Anton Blanchard We have seen several issues recently on powerpc LPARs with memoryless node NUMA configurations, e.g. (an extreme case): numactl --hardware available: 2 nodes (0,3) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 3 cpus: 0 1 2 3 node 3 size: 8142 MB node 3 free: 7765 MB node distances: node 0 3 0: 10 20 3: 20 10 powerpc doesn't set CONFIG_HAVE_MEMORYLESS_NODES, so we are missing out on a lot of the core-kernel support necessary. This series attempts to fix this by enabling the config option, which requires a few other changes as well. 1/3: mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2/3: powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID 3/3: powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES I have tested this series with Christoph's patch (currently being discussed): http://www.spinics.net/lists/linux-mm/msg69452.html Thanks, Nish ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-19 23:16 [PATCH 0/3] powerpc: support memoryless nodes Nishanth Aravamudan @ 2014-02-19 23:17 ` Nishanth Aravamudan 2014-02-19 23:18 ` [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID Nishanth Aravamudan ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:17 UTC (permalink / raw) To: Michal Hocko, Mel Gorman, linux-mm, Christoph Lameter, David Rientjes, Joonsoo Kim, Ben Herrenschmidt, Anton Blanchard, linuxppc-dev We can call local_memory_node() before the zonelists are setup. In that case, first_zones_zonelist() will not set zone and the reference to zone->node will Oops. Catch this case, and, since we presumably running very early, just return that any node will do. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Anton Blanchard <anton@samba.org> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e3758a0..5de4337 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3650,6 +3650,8 @@ int local_memory_node(int node) gfp_zone(GFP_KERNEL), NULL, &zone); + if (!zone) + return NUMA_NO_NODE; return zone->node; } #endif ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID 2014-02-19 23:17 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan @ 2014-02-19 23:18 ` Nishanth Aravamudan 2014-02-19 23:22 ` [PATCH 2/3 v2] " Nishanth Aravamudan 2014-02-19 23:32 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 2014-02-20 16:05 ` Christoph Lameter 2 siblings, 1 reply; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:18 UTC (permalink / raw) To: Michal Hocko, Mel Gorman, linux-mm, Christoph Lameter, David Rientjes, Joonsoo Kim, Ben Herrenschmidt, Anton Blanchard, linuxppc-dev In order to enable CONFIG_HAVE_MEMORYLESS_NODES, it is necessary to have somewhere to store the cpu <-> local-memory-node mapping. We could create another powerpc-specific lookup table, but the generic functions in include/linux/topology.h (protected by HAVE_PERCPU_NUMA_NODE_ID) are sufficient. This also allows us to remove the existing powerpc-specific cpu <-> node lookup table. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c8fbd1c 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,6 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index d0b5fca..8bbe8cc 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -20,19 +20,6 @@ struct device_node; #include <asm/mmzone.h> -static inline int cpu_to_node(int cpu) -{ - int nid; - - nid = numa_cpu_lookup_table[cpu]; - - /* - * During early boot, the numa-cpu lookup table might not have been - * setup for all CPUs yet. In such cases, default to node 0. - */ - return (nid < 0) ? 0 : nid; -} - #define parent_node(node) (node) #define cpumask_of_node(node) ((node) == -1 ? \ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..f45e68d 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -739,6 +739,9 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); + set_cpu_numa_node(cpu, cpu_to_node(cpu)); + set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); + smp_wmb(); notify_cpu_starting(cpu); set_cpu_online(cpu, true); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 30a42e2..57e2809 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -46,11 +46,9 @@ static char *cmdline __initdata; static int numa_debug; #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } -int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -154,22 +152,25 @@ static void __init get_node_active_region(unsigned long pfn, } } -static void reset_numa_cpu_lookup_table(void) +static void reset_numa_cpu_node(void) { unsigned int cpu; - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + for_each_possible_cpu(cpu) { + set_cpu_numa_node(cpu, -1); + set_cpu_numa_mem(cpu, -1); + } } -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static void update_numa_cpu_node(unsigned int cpu, int node) { - numa_cpu_lookup_table[cpu] = node; + set_cpu_numa_node(cpu, node); + set_cpu_numa_mem(cpu, local_memory_node(node)); } static void map_cpu_to_node(int cpu, int node) { - update_numa_cpu_lookup_table(cpu, node); + update_numa_cpu_node(cpu, node); dbg("adding cpu %d to node %d\n", cpu, node); @@ -180,7 +181,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = cpu_to_node(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -545,7 +546,7 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + if ((nid = cpu_to_node(lcpu)) >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1119,7 +1120,7 @@ void __init do_init_bootmem(void) */ setup_node_to_cpumask_map(); - reset_numa_cpu_lookup_table(); + reset_numa_cpu_node(); register_cpu_notifier(&ppc64_numa_nb); cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE, (void *)(unsigned long)boot_cpuid); @@ -1518,7 +1519,7 @@ static int update_lookup_table(void *data) base = cpu_first_thread_sibling(update->cpu); for (j = 0; j < threads_per_core; j++) { - update_numa_cpu_lookup_table(base + j, nid); + update_numa_cpu_node(base + j, nid); } } @@ -1571,7 +1572,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == cpu_to_node(cpu)) { cpumask_andnot(&cpu_associativity_changes_mask, &cpu_associativity_changes_mask, cpu_sibling_mask(cpu)); @@ -1583,7 +1584,7 @@ int arch_update_cpu_topology(void) ud = &updates[i++]; ud->cpu = sibling; ud->new_nid = new_nid; - ud->old_nid = numa_cpu_lookup_table[sibling]; + ud->old_nid = cpu_to_node(sibling); cpumask_set_cpu(sibling, &updated_cpus); if (i < weight) ud->next = &updates[i]; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/3 v2] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID 2014-02-19 23:18 ` [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID Nishanth Aravamudan @ 2014-02-19 23:22 ` Nishanth Aravamudan 2014-02-19 23:23 ` [PATCH 3/3] powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES Nishanth Aravamudan 0 siblings, 1 reply; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:22 UTC (permalink / raw) To: Michal Hocko, Mel Gorman, linux-mm, Christoph Lameter, David Rientjes, Joonsoo Kim, Ben Herrenschmidt, Anton Blanchard, linuxppc-dev [Apologies, I sent a stale version of this patch a moment ago...] In order to enable CONFIG_HAVE_MEMORYLESS_NODES, it is necessary to have somewhere to store the cpu <-> local-memory-node mapping. We could create another powerpc-specific lookup table, but the generic functions in include/linux/topology.h (protected by HAVE_PERCPU_NUMA_NODE_ID) are sufficient. This also allows us to remove the existing powerpc-specific cpu <-> node lookup table. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Anton Blanchard <anton@samba.org> Cc: linuxppc-dev@lists.ozlabs.org diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 957bf34..a84816c 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -449,6 +449,10 @@ config NODES_SHIFT default "4" depends on NEED_MULTIPLE_NODES +config USE_PERCPU_NUMA_NODE_ID + def_bool y + depends on NUMA + config ARCH_SELECT_MEMORY_MODEL def_bool y depends on PPC64 diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c8fbd1c 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,6 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index d0b5fca..8bbe8cc 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -20,19 +20,6 @@ struct device_node; #include <asm/mmzone.h> -static inline int cpu_to_node(int cpu) -{ - int nid; - - nid = numa_cpu_lookup_table[cpu]; - - /* - * During early boot, the numa-cpu lookup table might not have been - * setup for all CPUs yet. In such cases, default to node 0. - */ - return (nid < 0) ? 0 : nid; -} - #define parent_node(node) (node) #define cpumask_of_node(node) ((node) == -1 ? \ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..f45e68d 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -739,6 +739,9 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); + set_cpu_numa_node(cpu, cpu_to_node(cpu)); + set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); + smp_wmb(); notify_cpu_starting(cpu); set_cpu_online(cpu, true); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 30a42e2..57e2809 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -46,11 +46,9 @@ static char *cmdline __initdata; static int numa_debug; #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } -int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -154,22 +152,25 @@ static void __init get_node_active_region(unsigned long pfn, } } -static void reset_numa_cpu_lookup_table(void) +static void reset_numa_cpu_node(void) { unsigned int cpu; - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + for_each_possible_cpu(cpu) { + set_cpu_numa_node(cpu, -1); + set_cpu_numa_mem(cpu, -1); + } } -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static void update_numa_cpu_node(unsigned int cpu, int node) { - numa_cpu_lookup_table[cpu] = node; + set_cpu_numa_node(cpu, node); + set_cpu_numa_mem(cpu, local_memory_node(node)); } static void map_cpu_to_node(int cpu, int node) { - update_numa_cpu_lookup_table(cpu, node); + update_numa_cpu_node(cpu, node); dbg("adding cpu %d to node %d\n", cpu, node); @@ -180,7 +181,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = cpu_to_node(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -545,7 +546,7 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + if ((nid = cpu_to_node(lcpu)) >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1119,7 +1120,7 @@ void __init do_init_bootmem(void) */ setup_node_to_cpumask_map(); - reset_numa_cpu_lookup_table(); + reset_numa_cpu_node(); register_cpu_notifier(&ppc64_numa_nb); cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE, (void *)(unsigned long)boot_cpuid); @@ -1518,7 +1519,7 @@ static int update_lookup_table(void *data) base = cpu_first_thread_sibling(update->cpu); for (j = 0; j < threads_per_core; j++) { - update_numa_cpu_lookup_table(base + j, nid); + update_numa_cpu_node(base + j, nid); } } @@ -1571,7 +1572,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == cpu_to_node(cpu)) { cpumask_andnot(&cpu_associativity_changes_mask, &cpu_associativity_changes_mask, cpu_sibling_mask(cpu)); @@ -1583,7 +1584,7 @@ int arch_update_cpu_topology(void) ud = &updates[i++]; ud->cpu = sibling; ud->new_nid = new_nid; - ud->old_nid = numa_cpu_lookup_table[sibling]; + ud->old_nid = cpu_to_node(sibling); cpumask_set_cpu(sibling, &updated_cpus); if (i < weight) ud->next = &updates[i]; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/3] powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES 2014-02-19 23:22 ` [PATCH 2/3 v2] " Nishanth Aravamudan @ 2014-02-19 23:23 ` Nishanth Aravamudan 0 siblings, 0 replies; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:23 UTC (permalink / raw) To: Michal Hocko, Mel Gorman, linux-mm, Christoph Lameter, David Rientjes, Joonsoo Kim, Ben Herrenschmidt, Anton Blanchard, linuxppc-dev Anton Blanchard found an issue with an LPAR that had no memory in Node 0. Christoph Lameter recommended, as one possible solution, to use numa_mem_id() for locality of the nearest memory node-wise. However, numa_mem_id() [and the other related APIs] are only useful if CONFIG_HAVE_MEMORYLESS_NODES is set. This is only the case for ia64 currently, but clearly we can have memoryless nodes on ppc64. Add the Kconfig option and define it to be the same value as CONFIG_NUMA. On the LPAR in question, which was very inefficiently using slabs, this took the slab consumption at boot from roughly 7GB to roughly 4GB. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Anton Blanchard <anton@samba.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: linuxppc-dev@lists.ozlabs.org diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a84816c..0f5cd68 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -449,6 +449,9 @@ config NODES_SHIFT default "4" depends on NEED_MULTIPLE_NODES +config HAVE_MEMORYLESS_NODES + def_bool NUMA + config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-19 23:17 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 2014-02-19 23:18 ` [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID Nishanth Aravamudan @ 2014-02-19 23:32 ` Nishanth Aravamudan 2014-02-20 16:05 ` Christoph Lameter 2 siblings, 0 replies; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-19 23:32 UTC (permalink / raw) To: Michal Hocko, Mel Gorman, linux-mm, Christoph Lameter, David Rientjes, Joonsoo Kim, Ben Herrenschmidt, Anton Blanchard, linuxppc-dev, akpm [ Grr, sorry for not including you originally Andrew, if this ends up being ok with others, it will probably need to go through your tree. ] On 19.02.2014 [15:17:14 -0800], Nishanth Aravamudan wrote: > We can call local_memory_node() before the zonelists are setup. In that > case, first_zones_zonelist() will not set zone and the reference to > zone->node will Oops. Catch this case, and, since we presumably running > very early, just return that any node will do. > > Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Mel Gorman <mgorman@suse.de> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Christoph Lameter <cl@linux.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Cc: Ben Herrenschmidt <benh@kernel.crashing.org> > Cc: Anton Blanchard <anton@samba.org> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e3758a0..5de4337 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3650,6 +3650,8 @@ int local_memory_node(int node) > gfp_zone(GFP_KERNEL), > NULL, > &zone); > + if (!zone) > + return NUMA_NO_NODE; > return zone->node; > } > #endif ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-19 23:17 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 2014-02-19 23:18 ` [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID Nishanth Aravamudan 2014-02-19 23:32 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan @ 2014-02-20 16:05 ` Christoph Lameter 2014-02-20 18:28 ` Nishanth Aravamudan 2 siblings, 1 reply; 12+ messages in thread From: Christoph Lameter @ 2014-02-20 16:05 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Joonsoo Kim, linuxppc-dev, Anton Blanchard On Wed, 19 Feb 2014, Nishanth Aravamudan wrote: > We can call local_memory_node() before the zonelists are setup. In that > case, first_zones_zonelist() will not set zone and the reference to > zone->node will Oops. Catch this case, and, since we presumably running > very early, just return that any node will do. Really? Isnt there some way to avoid this call if zonelists are not setup yet? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-20 16:05 ` Christoph Lameter @ 2014-02-20 18:28 ` Nishanth Aravamudan 2014-02-21 22:42 ` Andrew Morton 0 siblings, 1 reply; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-20 18:28 UTC (permalink / raw) To: Christoph Lameter Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Joonsoo Kim, linuxppc-dev, Anton Blanchard On 20.02.2014 [10:05:39 -0600], Christoph Lameter wrote: > On Wed, 19 Feb 2014, Nishanth Aravamudan wrote: > > > We can call local_memory_node() before the zonelists are setup. In that > > case, first_zones_zonelist() will not set zone and the reference to > > zone->node will Oops. Catch this case, and, since we presumably running > > very early, just return that any node will do. > > Really? Isnt there some way to avoid this call if zonelists are not setup > yet? How do I best determine if zonelists aren't setup yet? The call-path in question (after my series is applied) is: arch/powerpc/kernel/setup_64.c::setup_arch -> arch/powerpc/mm/numa.c::do_init_bootmem() -> cpu_numa_callback() -> numa_setup_cpu() -> map_cpu_to_node() -> update_numa_cpu_node() -> set_cpu_numa_mem() and setup_arch() is called before build_all_zonelists(NULL, NULL) in start_kernel(). This seemed like the most reasonable path, as it's used on hotplug as well. I'm open to suggestsions! Thanks, Nish ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-20 18:28 ` Nishanth Aravamudan @ 2014-02-21 22:42 ` Andrew Morton 2014-02-21 23:56 ` Nishanth Aravamudan 0 siblings, 1 reply; 12+ messages in thread From: Andrew Morton @ 2014-02-21 22:42 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim, Anton Blanchard On Thu, 20 Feb 2014 10:28:47 -0800 Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > On 20.02.2014 [10:05:39 -0600], Christoph Lameter wrote: > > On Wed, 19 Feb 2014, Nishanth Aravamudan wrote: > > > > > We can call local_memory_node() before the zonelists are setup. In that > > > case, first_zones_zonelist() will not set zone and the reference to > > > zone->node will Oops. Catch this case, and, since we presumably running > > > very early, just return that any node will do. > > > > Really? Isnt there some way to avoid this call if zonelists are not setup > > yet? > > How do I best determine if zonelists aren't setup yet? > > The call-path in question (after my series is applied) is: > > arch/powerpc/kernel/setup_64.c::setup_arch -> > arch/powerpc/mm/numa.c::do_init_bootmem() -> > cpu_numa_callback() -> > numa_setup_cpu() -> > map_cpu_to_node() -> > update_numa_cpu_node() -> > set_cpu_numa_mem() > > and setup_arch() is called before build_all_zonelists(NULL, NULL) in > start_kernel(). This seemed like the most reasonable path, as it's used > on hotplug as well. > But the call to local_memory_node() you added was in start_secondary(), which isn't in that trace. I do agree that calling local_memory_node() too early then trying to fudge around the consequences seems rather wrong. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-21 22:42 ` Andrew Morton @ 2014-02-21 23:56 ` Nishanth Aravamudan 2014-02-24 19:43 ` Christoph Lameter 0 siblings, 1 reply; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-21 23:56 UTC (permalink / raw) To: Andrew Morton Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim, Anton Blanchard On 21.02.2014 [14:42:03 -0800], Andrew Morton wrote: > On Thu, 20 Feb 2014 10:28:47 -0800 Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote: > > > On 20.02.2014 [10:05:39 -0600], Christoph Lameter wrote: > > > On Wed, 19 Feb 2014, Nishanth Aravamudan wrote: > > > > > > > We can call local_memory_node() before the zonelists are setup. In that > > > > case, first_zones_zonelist() will not set zone and the reference to > > > > zone->node will Oops. Catch this case, and, since we presumably running > > > > very early, just return that any node will do. > > > > > > Really? Isnt there some way to avoid this call if zonelists are not setup > > > yet? > > > > How do I best determine if zonelists aren't setup yet? > > > > The call-path in question (after my series is applied) is: > > > > arch/powerpc/kernel/setup_64.c::setup_arch -> > > arch/powerpc/mm/numa.c::do_init_bootmem() -> > > cpu_numa_callback() -> > > numa_setup_cpu() -> > > map_cpu_to_node() -> > > update_numa_cpu_node() -> > > set_cpu_numa_mem() > > > > and setup_arch() is called before build_all_zonelists(NULL, NULL) in > > start_kernel(). This seemed like the most reasonable path, as it's used > > on hotplug as well. > > > > But the call to local_memory_node() you added was in start_secondary(), > which isn't in that trace. I added two calls to local_memory_node(), I *think* both are necessary, but am willing to be corrected. One is in map_cpu_to_node() and one is in start_secondary(). The start_secondary() path is fine, AFAICT, as we are up & running at that point. But in [the renamed function] update_numa_cpu_node() which is used by hotplug, we get called from do_init_bootmem(), which is before the zonelists are setup. I think both calls are necessary because I believe the arch_update_cpu_topology() is used for supporting firmware-driven home-noding, which does not invoke start_secondary() again (the processor is already running, we're just updating the topology in that situation). Then again, I could special-case the do_init_bootmem callpath, which is only called at kernel init time? > I do agree that calling local_memory_node() too early then trying to > fudge around the consequences seems rather wrong. If the answer is to simply not call local_memory_node() early, I'll submit a patch to at least add a comment, as there's nothing in the code itself to prevent this from happening and is guaranteed to oops. Thanks, Nish ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-21 23:56 ` Nishanth Aravamudan @ 2014-02-24 19:43 ` Christoph Lameter 2014-02-25 2:34 ` Nishanth Aravamudan 0 siblings, 1 reply; 12+ messages in thread From: Christoph Lameter @ 2014-02-24 19:43 UTC (permalink / raw) To: Nishanth Aravamudan Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Andrew Morton, linuxppc-dev, Joonsoo Kim, Anton Blanchard On Fri, 21 Feb 2014, Nishanth Aravamudan wrote: > I added two calls to local_memory_node(), I *think* both are necessary, > but am willing to be corrected. > > One is in map_cpu_to_node() and one is in start_secondary(). The > start_secondary() path is fine, AFAICT, as we are up & running at that > point. But in [the renamed function] update_numa_cpu_node() which is > used by hotplug, we get called from do_init_bootmem(), which is before > the zonelists are setup. > > I think both calls are necessary because I believe the > arch_update_cpu_topology() is used for supporting firmware-driven > home-noding, which does not invoke start_secondary() again (the > processor is already running, we're just updating the topology in that > situation). > > Then again, I could special-case the do_init_bootmem callpath, which is > only called at kernel init time? Well taht looks to be simpler. > > I do agree that calling local_memory_node() too early then trying to > > fudge around the consequences seems rather wrong. > > If the answer is to simply not call local_memory_node() early, I'll > submit a patch to at least add a comment, as there's nothing in the code > itself to prevent this from happening and is guaranteed to oops. Ok. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup 2014-02-24 19:43 ` Christoph Lameter @ 2014-02-25 2:34 ` Nishanth Aravamudan 0 siblings, 0 replies; 12+ messages in thread From: Nishanth Aravamudan @ 2014-02-25 2:34 UTC (permalink / raw) To: Christoph Lameter Cc: Michal Hocko, linux-mm, Mel Gorman, David Rientjes, Andrew Morton, linuxppc-dev, Joonsoo Kim, Anton Blanchard On 24.02.2014 [13:43:31 -0600], Christoph Lameter wrote: > On Fri, 21 Feb 2014, Nishanth Aravamudan wrote: > > > I added two calls to local_memory_node(), I *think* both are necessary, > > but am willing to be corrected. > > > > One is in map_cpu_to_node() and one is in start_secondary(). The > > start_secondary() path is fine, AFAICT, as we are up & running at that > > point. But in [the renamed function] update_numa_cpu_node() which is > > used by hotplug, we get called from do_init_bootmem(), which is before > > the zonelists are setup. > > > > I think both calls are necessary because I believe the > > arch_update_cpu_topology() is used for supporting firmware-driven > > home-noding, which does not invoke start_secondary() again (the > > processor is already running, we're just updating the topology in that > > situation). > > > > Then again, I could special-case the do_init_bootmem callpath, which is > > only called at kernel init time? > > Well taht looks to be simpler. Ok, I'll work on this. > > > I do agree that calling local_memory_node() too early then trying to > > > fudge around the consequences seems rather wrong. > > > > If the answer is to simply not call local_memory_node() early, I'll > > submit a patch to at least add a comment, as there's nothing in the code > > itself to prevent this from happening and is guaranteed to oops. > > Ok. Thanks! -Nish ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-02-25 2:34 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-02-19 23:16 [PATCH 0/3] powerpc: support memoryless nodes Nishanth Aravamudan 2014-02-19 23:17 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 2014-02-19 23:18 ` [PATCH 2/3] powerpc: enable CONFIG_HAVE_PERCPU_NUMA_NODE_ID Nishanth Aravamudan 2014-02-19 23:22 ` [PATCH 2/3 v2] " Nishanth Aravamudan 2014-02-19 23:23 ` [PATCH 3/3] powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES Nishanth Aravamudan 2014-02-19 23:32 ` [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup Nishanth Aravamudan 2014-02-20 16:05 ` Christoph Lameter 2014-02-20 18:28 ` Nishanth Aravamudan 2014-02-21 22:42 ` Andrew Morton 2014-02-21 23:56 ` Nishanth Aravamudan 2014-02-24 19:43 ` Christoph Lameter 2014-02-25 2:34 ` Nishanth Aravamudan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).