From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Date: Fri, 16 Apr 2010 13:33:24 -0700 Message-ID: <20100416133324.fcb1c168.akpm@linux-foundation.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , linux-arch@vger.kernel.org List-Id: linux-arch.vger.kernel.org On Thu, 15 Apr 2010 13:29:56 -0400 Lee Schermerhorn wrote: > Rework the generic version of the numa_node_id() function to use the > new generic percpu variable infrastructure. > > Guard the new implementation with a new config option: > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > Archs which support this new implemention will default this option > to 'y' when NUMA is configured. This config option could be removed > if/when all archs switch over to the generic percpu implementation > of numa_node_id(). Arch support involves: > > 1) converting any existing per cpu variable implementations to use > this implementation. x86_64 is an instance of such an arch. > 2) archs that don't use a per cpu variable for numa_node_id() will > need to initialize the new per cpu variable "numa_node" as cpus > are brought on-line. ia64 is an example. > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > when NUMA is configured. This is required because I have > retained the old implementation by default to allow archs to > be modified incrementally, as desired. > > Subsequent patches will convert x86_64 and ia64 to use this > implemenation. So which arches _aren't_ converted? powerpc, sparc and alpha? Is there sufficient info here for the maintainers to be able to perform the conversion with minimal head-scratching? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Date: Mon, 19 Apr 2010 09:22:31 -0400 Message-ID: <1271683352.10937.34.camel@useless.americas.hpqcorp.net> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> <20100416133324.fcb1c168.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100416133324.fcb1c168.akpm@linux-foundation.org> Sender: linux-numa-owner@vger.kernel.org To: Andrew Morton Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , linux-arch@vger.kernel.org List-Id: linux-arch.vger.kernel.org On Fri, 2010-04-16 at 13:33 -0700, Andrew Morton wrote: > On Thu, 15 Apr 2010 13:29:56 -0400 > Lee Schermerhorn wrote: > > > Rework the generic version of the numa_node_id() function to use the > > new generic percpu variable infrastructure. > > > > Guard the new implementation with a new config option: > > > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > > > Archs which support this new implemention will default this option > > to 'y' when NUMA is configured. This config option could be removed > > if/when all archs switch over to the generic percpu implementation > > of numa_node_id(). Arch support involves: > > > > 1) converting any existing per cpu variable implementations to use > > this implementation. x86_64 is an instance of such an arch. > > 2) archs that don't use a per cpu variable for numa_node_id() will > > need to initialize the new per cpu variable "numa_node" as cpus > > are brought on-line. ia64 is an example. > > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > > when NUMA is configured. This is required because I have > > retained the old implementation by default to allow archs to > > be modified incrementally, as desired. > > > > Subsequent patches will convert x86_64 and ia64 to use this > > implemenation. > > So which arches _aren't_ converted? powerpc, sparc and alpha? Right. Plus ARM, mips, ... I could take a cut at other archs, but can't test them. I'm hoping that this patch doesn't break the existing implementation for them. It should be a no-op until the new support is enabled via Kconfig. The fact that both x86_64 and ia64 build with just this patch gives me some hope but not a lot of confidence. I see that you've merged the series with into -mm. We'll see what happens. Of course, no reports of errors could just mean no testing. > > Is there sufficient info here for the maintainers to be able to > perform the conversion with minimal head-scratching? Arch maintainers will need to chime in on that. I'd hoped that the list above and the examples of x86_64 and ia64 in the subsequent patches would suffice. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linux-foundation.org ([140.211.169.13]:48067 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932396Ab0DPUfA (ORCPT ); Fri, 16 Apr 2010 16:35:00 -0400 Date: Fri, 16 Apr 2010 13:33:24 -0700 From: Andrew Morton Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Message-ID: <20100416133324.fcb1c168.akpm@linux-foundation.org> In-Reply-To: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-arch-owner@vger.kernel.org List-ID: To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , linux-arch@vger.kernel.org Message-ID: <20100416203324.2XKtVNPomPxY9tilcbkHcvM-GxRtGZ6PFuZy1W3TlRw@z> On Thu, 15 Apr 2010 13:29:56 -0400 Lee Schermerhorn wrote: > Rework the generic version of the numa_node_id() function to use the > new generic percpu variable infrastructure. > > Guard the new implementation with a new config option: > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > Archs which support this new implemention will default this option > to 'y' when NUMA is configured. This config option could be removed > if/when all archs switch over to the generic percpu implementation > of numa_node_id(). Arch support involves: > > 1) converting any existing per cpu variable implementations to use > this implementation. x86_64 is an instance of such an arch. > 2) archs that don't use a per cpu variable for numa_node_id() will > need to initialize the new per cpu variable "numa_node" as cpus > are brought on-line. ia64 is an example. > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > when NUMA is configured. This is required because I have > retained the old implementation by default to allow archs to > be modified incrementally, as desired. > > Subsequent patches will convert x86_64 and ia64 to use this > implemenation. So which arches _aren't_ converted? powerpc, sparc and alpha? Is there sufficient info here for the maintainers to be able to perform the conversion with minimal head-scratching? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 0/8] Numa: Use Generic Per-cpu Variables for numa_*_id() Date: Thu, 15 Apr 2010 13:29:50 -0400 Message-ID: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Use Generic Per cpu infrastructure for numa_*_id() V4 Series Against: 2.6.34-rc3-mmotm-100405-1609 Background: V1 of this series resolved a fairly serious performance problem on our ia64 platforms with memoryless nodes because SLAB cannot cache object from a remote node, even tho' that node is the effective "local memory node" for a given cpu. V1 caused no regression in x86_64 [a slight improvement even] for the admittedly few tests that I ran. Christoph Lameter suggested the approach implemented in V2 and later: define a new function--numa_mem_id()--that returns the "local memory node" for cpus attached to memoryless nodes. Christoph also suggested that, while at it, I could modify the implementation of numa_node_id() [and the related cpu_to_node()] to use the generic percpu variable implementation. While implementing V2, I encountered a circular header dependency between: topology.h -> percpu.h -> slab.h -> gfp.h -> topology.h I resolved this by moving the generic percpu functions to include/asm-generic/percpu.h so that various arch asm/percpu.h could include that, and topology.h could include asm/percpu.h to avoid including slab.h, breaking the circular dependency. Reviewers didn't like that. Matthew Willcox suggested that I uninline percpu_alloc()/free() for the !SMP config and remove slab.h from percpu.h. I tried that. I broke the build of a LOT of files. Tejun Heo mentioned that percpu-defs.h would be a better place for the generic function definitions. V3 implemented that suggestion. Later, Tejun decided to jump in and remove slab.h from percpu.h and semi- automagically fix up all of the affected modules. V4 is implemented atop Tejun's series now in mmotm. Again, this solves the slab performance problem on our servers configured with memoryless nodes, and shows no regression with hackbench on x86_64. Of course, more performance testing would be welcome. The slab changes in patch 6 of the series need review w/rt to node hot plug that could change the effective "local memory node" for a memoryless node by inserting a "nearer" node in the zonelists. An additional patch may be required to address this. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Date: Thu, 15 Apr 2010 13:29:56 -0400 Message-ID: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn --- V0: # From cl@linux-foundation.org Wed Nov 4 10:36:12 2009 # Date: Wed, 4 Nov 2009 12:35:14 -0500 (EST) # From: Christoph Lameter # To: Lee Schermerhorn # Subject: Re: [PATCH/RFC] slab: handle memoryless nodes efficiently # # I have a very early form of a draft of a patch here that genericizes # numa_node_id(). Uses the new generic this_cpu_xxx stuff. # # Not complete. V1: + split out x86 specific changes to subsequent patch + split out "numa_mem_id()" and related changes to separate patch + moved generic definitions of __this_cpu_xxx from linux/percpu.h to asm-generic/percpu.h where asm/percpu.h and other asm hdrs can use them. + export new percpu symbol 'numa_node' in mm/percpu.h + include in for use by new numa_node_id(). V2: + add back the #ifndef/#endif guard around numa_node_id() so that archs can override generic definition + add generic stub for set_numa_node() + use generic percpu numa_node_id() only if enabled by CONFIG_USE_PERCPU_NUMA_NODE_ID to allow incremental per arch support. This option could be removed when/if all archs that support NUMA support this option. V3: + separated the rework of linux/percpu.h into another [preceding] patch. + moved definition of the numa_node percpu variable from mm/percpu.c to mm/page-alloc.c + moved premature definition of cpu_to_mem() to later patch. V4: + topology.h: include rather than Requires Tejun Heo's percpu.h/slab.h cleanup series include/linux/topology.h | 33 ++++++++++++++++++++++++++++----- mm/page_alloc.c | 5 +++++ 2 files changed, 33 insertions(+), 5 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:04:04.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 @@ -56,6 +56,11 @@ #include #include "internal.h" +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +DEFINE_PER_CPU(int, numa_node); +EXPORT_PER_CPU_SYMBOL(numa_node); +#endif + /* * Array of node states. */ Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 @@ -31,6 +31,7 @@ #include #include #include +#include #include #ifndef node_has_online_mem @@ -203,8 +204,35 @@ int arch_update_cpu_topology(void); #ifndef SD_NODE_INIT #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!! #endif + #endif /* CONFIG_NUMA */ +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +DECLARE_PER_CPU(int, numa_node); + +#ifndef numa_node_id +/* Returns the number of the current Node. */ +#define numa_node_id() __this_cpu_read(numa_node) +#endif + +#ifndef cpu_to_node +#define cpu_to_node(__cpu) per_cpu(numa_node, (__cpu)) +#endif + +#ifndef set_numa_node +#define set_numa_node(__node) percpu_write(numa_node, __node) +#endif + +#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ + +/* Returns the number of the current Node. */ +#ifndef numa_node_id +#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) + +#endif + +#endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ + #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif @@ -218,9 +246,4 @@ int arch_update_cpu_topology(void); #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif -/* Returns the number of the current Node. */ -#ifndef numa_node_id -#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) -#endif - #endif /* _LINUX_TOPOLOGY_H */ From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Thu, 15 Apr 2010 13:30:03 -0400 Message-ID: <20100415173003.8801.48519.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 x86 arch specific changes to use generic numa_node_id() based on generic percpu variable infrastructure. Back out x86's custom version of numa_node_id() Signed-off-by: Lee Schermerhorn [Christoph's signoff here?] --- V0: based on: # From cl@linux-foundation.org Wed Nov 4 10:36:12 2009 # Date: Wed, 4 Nov 2009 12:35:14 -0500 (EST) # From: Christoph Lameter # To: Lee Schermerhorn # Subject: Re: [PATCH/RFC] slab: handle memoryless nodes efficiently # # I have a very early form of a draft of a patch here that genericizes # numa_node_id(). Uses the new generic this_cpu_xxx stuff. # # Not complete. V1: + split out x86-specific changes from generic. + change 'node_number' => 'numa_node' in x86 arch code + define __this_cpu_read in x86 asm/percpu.h + change x86/kernel/setup_percpu.c to use early_cpu_to_node() to setup 'numa_node' as cpu_to_node() now depends on the per cpu var. [I think! What about cpu_to_node() func in x86/mm/numa_64.c ???] V2: + cpu_to_node() => early_cpu_to_node(); incomplete change in V01 + x86 arch define USE_PERCPU_NUMA_NODE_ID. V4: + remove '__this_cpu_{read|write}() from arch/x86/include/asm/percpu.h. + rename cpu_to_node() to __cpu_to_node() in arch/x86/mm/numa_64.c and override generic percpu implementation of cpu_to_node() in arch/x86/include/asm/topology.h under CONFIG_DEBUG_PER_CPU_MAPS to fix build breakage. [Don't know why we couldn't use the percpu version for debugging cpu maps.] arch/x86/Kconfig | 4 ++++ arch/x86/include/asm/topology.h | 20 +++++++------------- arch/x86/kernel/cpu/common.c | 6 +++--- arch/x86/kernel/setup_percpu.c | 4 ++-- arch/x86/mm/numa_64.c | 9 +++------ 5 files changed, 19 insertions(+), 24 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/include/asm/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/x86/include/asm/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/include/asm/topology.h 2010-04-07 10:10:25.000000000 -0400 @@ -53,33 +53,27 @@ extern int cpu_to_node_map[]; /* Returns the number of the node containing CPU 'cpu' */ -static inline int cpu_to_node(int cpu) +static inline int early_cpu_to_node(int cpu) { return cpu_to_node_map[cpu]; } -#define early_cpu_to_node(cpu) cpu_to_node(cpu) #else /* CONFIG_X86_64 */ /* Mappings between logical cpu number and node number */ DECLARE_EARLY_PER_CPU(int, x86_cpu_to_node_map); -/* Returns the number of the current Node. */ -DECLARE_PER_CPU(int, node_number); -#define numa_node_id() percpu_read(node_number) - #ifdef CONFIG_DEBUG_PER_CPU_MAPS -extern int cpu_to_node(int cpu); +/* + * override generic percpu implementation of cpu_to_node + */ +extern int __cpu_to_node(int cpu); +#define cpu_to_node __cpu_to_node + extern int early_cpu_to_node(int cpu); #else /* !CONFIG_DEBUG_PER_CPU_MAPS */ -/* Returns the number of the node containing CPU 'cpu' */ -static inline int cpu_to_node(int cpu) -{ - return per_cpu(x86_cpu_to_node_map, cpu); -} - /* Same function but used if called before per_cpu areas are setup */ static inline int early_cpu_to_node(int cpu) { Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/mm/numa_64.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/x86/mm/numa_64.c 2010-04-07 10:03:41.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/mm/numa_64.c 2010-04-07 10:10:25.000000000 -0400 @@ -33,9 +33,6 @@ int numa_off __initdata; static unsigned long __initdata nodemap_addr; static unsigned long __initdata nodemap_size; -DEFINE_PER_CPU(int, node_number) = 0; -EXPORT_PER_CPU_SYMBOL(node_number); - /* * Map cpu index to node index */ @@ -809,7 +806,7 @@ void __cpuinit numa_set_node(int cpu, in per_cpu(x86_cpu_to_node_map, cpu) = node; if (node != NUMA_NO_NODE) - per_cpu(node_number, cpu) = node; + per_cpu(numa_node, cpu) = node; } void __cpuinit numa_clear_node(int cpu) @@ -867,7 +864,7 @@ void __cpuinit numa_remove_cpu(int cpu) numa_set_cpumask(cpu, 0); } -int cpu_to_node(int cpu) +int __cpu_to_node(int cpu) { if (early_per_cpu_ptr(x86_cpu_to_node_map)) { printk(KERN_WARNING @@ -877,7 +874,7 @@ int cpu_to_node(int cpu) } return per_cpu(x86_cpu_to_node_map, cpu); } -EXPORT_SYMBOL(cpu_to_node); +EXPORT_SYMBOL(__cpu_to_node); /* * Same function as cpu_to_node() but used if called before the Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/kernel/cpu/common.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/x86/kernel/cpu/common.c 2010-04-07 10:03:49.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/kernel/cpu/common.c 2010-04-07 10:10:25.000000000 -0400 @@ -1121,9 +1121,9 @@ void __cpuinit cpu_init(void) oist = &per_cpu(orig_ist, cpu); #ifdef CONFIG_NUMA - if (cpu != 0 && percpu_read(node_number) == 0 && - cpu_to_node(cpu) != NUMA_NO_NODE) - percpu_write(node_number, cpu_to_node(cpu)); + if (cpu != 0 && percpu_read(numa_node) == 0 && + early_cpu_to_node(cpu) != NUMA_NO_NODE) + set_numa_node(early_cpu_to_node(cpu)); #endif me = current; Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/kernel/setup_percpu.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/x86/kernel/setup_percpu.c 2010-04-07 10:03:49.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/kernel/setup_percpu.c 2010-04-07 10:10:25.000000000 -0400 @@ -265,10 +265,10 @@ void __init setup_per_cpu_areas(void) #if defined(CONFIG_X86_64) && defined(CONFIG_NUMA) /* - * make sure boot cpu node_number is right, when boot cpu is on the + * make sure boot cpu numa_node is right, when boot cpu is on the * node that doesn't have mem installed */ - per_cpu(node_number, boot_cpu_id) = cpu_to_node(boot_cpu_id); + per_cpu(numa_node, boot_cpu_id) = early_cpu_to_node(boot_cpu_id); #endif /* Setup node to cpumask map */ Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/Kconfig =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/x86/Kconfig 2010-04-07 10:10:20.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/x86/Kconfig 2010-04-07 10:10:25.000000000 -0400 @@ -1715,6 +1715,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID def_bool X86_64 depends on NUMA +config USE_PERCPU_NUMA_NODE_ID + def_bool y + depends on NUMA + menu "Power management and ACPI options" config ARCH_HIBERNATION_HEADER -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 3/8] numa: ia64: use generic percpu var numa_node_id() implementation Date: Thu, 15 Apr 2010 13:30:09 -0400 Message-ID: <20100415173009.8801.67345.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 ia64: Use generic percpu implementation of numa_node_id() + intialize per cpu 'numa_node' + remove ia64 cpu_to_node() macro; use generic + define CONFIG_USE_PERCPU_NUMA_NODE_ID when NUMA configured Signed-off-by: Lee Schermerhorn Reviewed-by: Christoph Lameter --- New in V2 V3, V4: no change arch/ia64/Kconfig | 4 ++++ arch/ia64/include/asm/topology.h | 5 ----- arch/ia64/kernel/smpboot.c | 6 ++++++ 3 files changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:03:38.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 @@ -390,6 +390,11 @@ smp_callin (void) fix_b0_for_bsp(); + /* + * numa_node_id() works after this. + */ + set_numa_node(cpu_to_node_map[cpuid]); + ipi_call_lock_irq(); spin_lock(&vector_lock); /* Setup the per cpu irq handling data structures */ @@ -632,6 +637,7 @@ void __devinit smp_prepare_boot_cpu(void { cpu_set(smp_processor_id(), cpu_online_map); cpu_set(smp_processor_id(), cpu_callin_map); + set_numa_node(cpu_to_node_map[smp_processor_id()]); per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; paravirt_post_smp_prepare_boot_cpu(); } Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/include/asm/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h 2010-04-07 10:10:27.000000000 -0400 @@ -26,11 +26,6 @@ #define RECLAIM_DISTANCE 15 /* - * Returns the number of the node containing CPU 'cpu' - */ -#define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu]) - -/* * Returns a bitmask of CPUs on Node 'node'. */ #define cpumask_of_node(node) ((node) == -1 ? \ Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:04:03.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 @@ -497,6 +497,10 @@ config HAVE_ARCH_NODEDATA_EXTENSION def_bool y depends on NUMA +config USE_PERCPU_NUMA_NODE_ID + def_bool y + depends on NUMA + config ARCH_PROC_KCORE_TEXT def_bool y depends on PROC_KCORE From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 4/8] numa: Introduce numa_mem_id()- effective local memory node id Date: Thu, 15 Apr 2010 13:30:16 -0400 Message-ID: <20100415173016.8801.34970.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. Signed-off-by: Lee Schermerhorn Signed-off-by: Christoph Lameter --- V2: + split this out of Christoph's incomplete "starter patch" + flesh out the definition V3,V4: no change include/asm-generic/topology.h | 3 +++ include/linux/mmzone.h | 6 ++++++ include/linux/topology.h | 24 ++++++++++++++++++++++++ mm/page_alloc.c | 39 ++++++++++++++++++++++++++++++++++++++- 4 files changed, 71 insertions(+), 1 deletion(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:28.000000000 -0400 @@ -233,6 +233,30 @@ DECLARE_PER_CPU(int, numa_node); #endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ +#ifdef CONFIG_HAVE_MEMORYLESS_NODES + +DECLARE_PER_CPU(int, numa_mem); + +#ifndef set_numa_mem +#define set_numa_mem(__node) percpu_write(numa_mem, __node) +#endif + +#else /* !CONFIG_HAVE_MEMORYLESS_NODES */ + +#define numa_mem numa_node +static inline void set_numa_mem(int node) {} + +#endif /* [!]CONFIG_HAVE_MEMORYLESS_NODES */ + +#ifndef numa_mem_id +/* Returns the number of the nearest Node with memory */ +#define numa_mem_id() __this_cpu_read(numa_mem) +#endif + +#ifndef cpu_to_mem +#define cpu_to_mem(__cpu) per_cpu(numa_mem, (__cpu)) +#endif + #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:28.000000000 -0400 @@ -61,6 +61,11 @@ DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); #endif +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +DEFINE_PER_CPU(int, numa_mem); /* Kernel "local memory" node */ +EXPORT_PER_CPU_SYMBOL(numa_mem); +#endif + /* * Array of node states. */ @@ -2752,6 +2757,24 @@ static void build_zonelist_cache(pg_data zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z); } +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +/* + * Return node id of node used for "local" allocations. + * I.e., first node id of first zone in arg node's generic zonelist. + * Used for initializing percpu 'numa_mem', which is used primarily + * for kernel allocations, so use GFP_KERNEL flags to locate zonelist. + */ +int local_memory_node(int node) +{ + struct zone *zone; + + (void)first_zones_zonelist(node_zonelist(node, GFP_KERNEL), + gfp_zone(GFP_KERNEL), + NULL, + &zone); + return zone->node; +} +#endif #else /* CONFIG_NUMA */ @@ -2851,9 +2874,23 @@ static int __build_all_zonelists(void *d * needs the percpu allocator in order to allocate its pagesets * (a chicken-egg dilemma). */ - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { setup_pageset(&per_cpu(boot_pageset, cpu), 0); +#ifdef CONFIG_HAVE_MEMORYLESS_NODES + /* + * We now know the "local memory node" for each node-- + * i.e., the node of the first zone in the generic zonelist. + * Set up numa_mem percpu variable for on-line cpus. During + * boot, only the boot cpu should be on-line; we'll init the + * secondary cpus' numa_mem as they come on-line. During + * node/memory hotplug, we'll fixup all on-line cpus. + */ + if (cpu_online(cpu)) + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu)); +#endif + } + return 0; } Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/mmzone.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/mmzone.h 2010-04-07 10:03:46.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/mmzone.h 2010-04-07 10:10:28.000000000 -0400 @@ -661,6 +661,12 @@ void memory_present(int nid, unsigned lo static inline void memory_present(int nid, unsigned long start, unsigned long end) {} #endif +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +int local_memory_node(int node_id); +#else +static inline int local_memory_node(int node_id) { return node_id; }; +#endif + #ifdef CONFIG_NEED_NODE_MEMMAP_SIZE unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif Index: linux-2.6.34-rc3-mmotm-100405-1609/include/asm-generic/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/asm-generic/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/asm-generic/topology.h 2010-04-07 10:10:28.000000000 -0400 @@ -34,6 +34,9 @@ #ifndef cpu_to_node #define cpu_to_node(cpu) ((void)(cpu),0) #endif +#ifndef cpu_to_mem +#define cpu_to_mem(cpu) (void)(cpu),0) +#endif #ifndef parent_node #define parent_node(node) ((void)(node),0) #endif From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 5/8] numa: ia64: support numa_mem_id() for memoryless nodes Date: Thu, 15 Apr 2010 13:30:24 -0400 Message-ID: <20100415173024.8801.36840.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 IA64: Support memoryless nodes Enable 'HAVE_MEMORYLESS_NODES' by default when NUMA configured on ia64. Initialize percpu 'numa_mem' variable when starting secondary cpus. Generic initialization will handle the boot cpu. Nothing uses 'numa_mem_id()' yet. Subsequent patch with modify slab to use this. Signed-off-by: Lee Schermerhorn --- New in V2 V3, V4: no change arch/ia64/Kconfig | 4 ++++ arch/ia64/kernel/smpboot.c | 1 + 2 files changed, 5 insertions(+) Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:30.000000000 -0400 @@ -501,6 +501,10 @@ config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA +config HAVE_MEMORYLESS_NODES + def_bool y + depends on NUMA + config ARCH_PROC_KCORE_TEXT def_bool y depends on PROC_KCORE Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:30.000000000 -0400 @@ -394,6 +394,7 @@ smp_callin (void) * numa_node_id() works after this. */ set_numa_node(cpu_to_node_map[cpuid]); + set_numa_mem(local_memory_node(cpu_to_node_map[cpuid])); ipi_call_lock_irq(); spin_lock(&vector_lock); From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node Date: Thu, 15 Apr 2010 13:30:30 -0400 Message-ID: <20100415173030.8801.84836.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn --- V4: no change to code. rebased patch and updated test results in description. mm/slab.c | 43 ++++++++++++++++++++++--------------------- 1 files changed, 22 insertions(+), 21 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/slab.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/slab.c 2010-04-07 10:04:02.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/slab.c 2010-04-07 10:11:34.000000000 -0400 @@ -844,7 +844,7 @@ static void init_reap_node(int cpu) { int node; - node = next_node(cpu_to_node(cpu), node_online_map); + node = next_node(cpu_to_mem(cpu), node_online_map); if (node == MAX_NUMNODES) node = first_node(node_online_map); @@ -1073,7 +1073,7 @@ static inline int cache_free_alien(struc struct array_cache *alien = NULL; int node; - node = numa_node_id(); + node = numa_mem_id(); /* * Make sure we are not freeing a object from another node to the array @@ -1106,7 +1106,7 @@ static void __cpuinit cpuup_canceled(lon { struct kmem_cache *cachep; struct kmem_list3 *l3 = NULL; - int node = cpu_to_node(cpu); + int node = cpu_to_mem(cpu); const struct cpumask *mask = cpumask_of_node(node); list_for_each_entry(cachep, &cache_chain, next) { @@ -1171,7 +1171,7 @@ static int __cpuinit cpuup_prepare(long { struct kmem_cache *cachep; struct kmem_list3 *l3 = NULL; - int node = cpu_to_node(cpu); + int node = cpu_to_mem(cpu); const int memsize = sizeof(struct kmem_list3); /* @@ -1418,7 +1418,7 @@ void __init kmem_cache_init(void) * 6) Resize the head arrays of the kmalloc caches to their final sizes. */ - node = numa_node_id(); + node = numa_mem_id(); /* 1) create the cache_cache */ INIT_LIST_HEAD(&cache_chain); @@ -2052,7 +2052,7 @@ static int __init_refok setup_cpu_cache( } } } - cachep->nodelists[numa_node_id()]->next_reap = + cachep->nodelists[numa_mem_id()]->next_reap = jiffies + REAPTIMEOUT_LIST3 + ((unsigned long)cachep) % REAPTIMEOUT_LIST3; @@ -2383,7 +2383,7 @@ static void check_spinlock_acquired(stru { #ifdef CONFIG_SMP check_irq_off(); - assert_spin_locked(&cachep->nodelists[numa_node_id()]->list_lock); + assert_spin_locked(&cachep->nodelists[numa_mem_id()]->list_lock); #endif } @@ -2410,7 +2410,7 @@ static void do_drain(void *arg) { struct kmem_cache *cachep = arg; struct array_cache *ac; - int node = numa_node_id(); + int node = numa_mem_id(); check_irq_off(); ac = cpu_cache_get(cachep); @@ -2943,7 +2943,7 @@ static void *cache_alloc_refill(struct k retry: check_irq_off(); - node = numa_node_id(); + node = numa_mem_id(); ac = cpu_cache_get(cachep); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { @@ -3147,7 +3147,7 @@ static void *alternate_node_alloc(struct if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; - nid_alloc = nid_here = numa_node_id(); + nid_alloc = nid_here = numa_mem_id(); if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD)) nid_alloc = cpuset_mem_spread_node(); else if (current->mempolicy) @@ -3209,7 +3209,7 @@ retry: if (local_flags & __GFP_WAIT) local_irq_enable(); kmem_flagcheck(cache, flags); - obj = kmem_getpages(cache, local_flags, numa_node_id()); + obj = kmem_getpages(cache, local_flags, numa_mem_id()); if (local_flags & __GFP_WAIT) local_irq_disable(); if (obj) { @@ -3316,6 +3316,7 @@ __cache_alloc_node(struct kmem_cache *ca { unsigned long save_flags; void *ptr; + int slab_node = numa_mem_id(); flags &= gfp_allowed_mask; @@ -3328,7 +3329,7 @@ __cache_alloc_node(struct kmem_cache *ca local_irq_save(save_flags); if (nodeid == -1) - nodeid = numa_node_id(); + nodeid = slab_node; if (unlikely(!cachep->nodelists[nodeid])) { /* Node not bootstrapped yet */ @@ -3336,7 +3337,7 @@ __cache_alloc_node(struct kmem_cache *ca goto out; } - if (nodeid == numa_node_id()) { + if (nodeid == slab_node) { /* * Use the locally cached objects if possible. * However ____cache_alloc does not allow fallback @@ -3380,8 +3381,8 @@ __do_cache_alloc(struct kmem_cache *cach * We may just have run out of memory on the local node. * ____cache_alloc_node() knows how to locate memory on other nodes */ - if (!objp) - objp = ____cache_alloc_node(cache, flags, numa_node_id()); + if (!objp) + objp = ____cache_alloc_node(cache, flags, numa_mem_id()); out: return objp; @@ -3478,7 +3479,7 @@ static void cache_flusharray(struct kmem { int batchcount; struct kmem_list3 *l3; - int node = numa_node_id(); + int node = numa_mem_id(); batchcount = ac->batchcount; #if DEBUG @@ -3923,7 +3924,7 @@ static int do_tune_cpucache(struct kmem_ return -ENOMEM; for_each_online_cpu(i) { - new->new[i] = alloc_arraycache(cpu_to_node(i), limit, + new->new[i] = alloc_arraycache(cpu_to_mem(i), limit, batchcount, gfp); if (!new->new[i]) { for (i--; i >= 0; i--) @@ -3945,9 +3946,9 @@ static int do_tune_cpucache(struct kmem_ struct array_cache *ccold = new->new[i]; if (!ccold) continue; - spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); - free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i)); - spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); + spin_lock_irq(&cachep->nodelists[cpu_to_mem(i)]->list_lock); + free_block(cachep, ccold->entry, ccold->avail, cpu_to_mem(i)); + spin_unlock_irq(&cachep->nodelists[cpu_to_mem(i)]->list_lock); kfree(ccold); } kfree(new); @@ -4053,7 +4054,7 @@ static void cache_reap(struct work_struc { struct kmem_cache *searchp; struct kmem_list3 *l3; - int node = numa_node_id(); + int node = numa_mem_id(); struct delayed_work *work = to_delayed_work(w); if (!mutex_trylock(&cache_chain_mutex)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 7/8] numa: in-kernel profiling: use cpu_to_mem() for per cpu allocations Date: Thu, 15 Apr 2010 13:30:36 -0400 Message-ID: <20100415173036.8801.29768.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 Patch: in-kernel profiling -- support memoryless nodes. In kernel profiling requires that we be able to allocate "local" memory for each cpu. Use "cpu_to_mem()" instead of "cpu_to_node()" to support memoryless nodes. Depends on the "numa_mem_id()" patch. Signed-off-by: Lee Schermerhorn --- New in V3. V4: No change kernel/profile.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/kernel/profile.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/kernel/profile.c 2010-04-07 10:04:02.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/kernel/profile.c 2010-04-07 10:11:38.000000000 -0400 @@ -363,7 +363,7 @@ static int __cpuinit profile_cpu_callbac switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: - node = cpu_to_node(cpu); + node = cpu_to_mem(cpu); per_cpu(cpu_profile_flip, cpu) = 0; if (!per_cpu(cpu_profile_hits, cpu)[1]) { page = alloc_pages_exact_node(node, @@ -565,7 +565,7 @@ static int create_hash_tables(void) int cpu; for_each_online_cpu(cpu) { - int node = cpu_to_node(cpu); + int node = cpu_to_mem(cpu); struct page *page; page = alloc_pages_exact_node(node, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: [PATCH 8/8] numa: update Documentation/vm/numa, add memoryless node info Date: Thu, 15 Apr 2010 13:30:42 -0400 Message-ID: <20100415173042.8801.17049.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Against: 2.6.34-rc3-mmotm-100405-1609 Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab related changes. He suggested Documentation/vm/numa for this documentation. Looking at this file, it seems to me to be hopelessly out of date relative to current Linux NUMA support. At the risk of going down a rathole, I have made an attempt to rewrite the doc at a slightly higher level [I think] and provide pointers to other in-tree documents and out-of-tree man pages that cover the details. Let the games begin. Signed-off-by: Lee Schermerhorn --- New in V4. Documentation/vm/numa | 184 +++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 146 insertions(+), 38 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/Documentation/vm/numa 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa 2010-04-07 10:11:40.000000000 -0400 @@ -1,41 +1,149 @@ Started Nov 1999 by Kanoj Sarcar -The intent of this file is to have an uptodate, running commentary -from different people about NUMA specific code in the Linux vm. +What is NUMA? -What is NUMA? It is an architecture where the memory access times -for different regions of memory from a given processor varies -according to the "distance" of the memory region from the processor. -Each region of memory to which access times are the same from any -cpu, is called a node. On such architectures, it is beneficial if -the kernel tries to minimize inter node communications. Schemes -for this range from kernel text and read-only data replication -across nodes, and trying to house all the data structures that -key components of the kernel need on memory on that node. - -Currently, all the numa support is to provide efficient handling -of widely discontiguous physical memory, so architectures which -are not NUMA but can have huge holes in the physical address space -can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM. - -The initial port includes NUMAizing the bootmem allocator code by -encapsulating all the pieces of information into a bootmem_data_t -structure. Node specific calls have been added to the allocator. -In theory, any platform which uses the bootmem allocator should -be able to put the bootmem and mem_map data structures anywhere -it deems best. - -Each node's page allocation data structures have also been encapsulated -into a pg_data_t. The bootmem_data_t is just one part of this. To -make the code look uniform between NUMA and regular UMA platforms, -UMA platforms have a statically allocated pg_data_t too (contig_page_data). -For the sake of uniformity, the function num_online_nodes() is also defined -for all platforms. As we run benchmarks, we might decide to NUMAize -more variables like low_on_memory, nr_free_pages etc into the pg_data_t. - -The NUMA aware page allocation code currently tries to allocate pages -from different nodes in a round robin manner. This will be changed to -do concentratic circle search, starting from current node, once the -NUMA port achieves more maturity. The call alloc_pages_node has been -added, so that drivers can make the call and not worry about whether -it is running on a NUMA or UMA platform. +This question can be answered from a couple of perspectives: the +hardware view and the Linux software view. + +From the hardware perspective, a NUMA system is a computer platform that +comprises multiple components or assemblies each of which may contain 0 +or more cpus, local memory, and/or IO buses. For brevity and to +disambiguate the hardware view of these physical components/assemblies +from the software abstraction thereof, we'll call the components/assemblies +'cells' in this document. + +Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset +of the system--although some components necessary for a stand-alone SMP system +may not be populated on any given cell. The cells of the NUMA system are +connected together with some sort of system interconnect--e.g., a crossbar or +point-to-point link are common types of NUMA system interconnects. Both of +these types of interconnects can be aggregated to create NUMA platforms with +cells at multiple distances from other cells. + +For Linux, the NUMA platforms of interest are primarily what is known as Cache +Coherent NUMA or CCNuma systems. With CCNUMA systems, all memory is visible +to and accessible from any cpu attached to any cell and cache coherency +is handled in hardware by the processor caches and/or the system interconnect. + +Memory access time and effective memory bandwidth varies depending on how far +away the cell containing the cpu or io bus making the memory access is from the +cell containing the target memory. For example, access to memory by cpus +attached to the same cell will experience faster access times and higher +bandwidths than accesses to memory on other, remote cells. NUMA platforms +can have cells at multiple remote distances from any given cell. + +Platform vendors don't build NUMA systems just to make software developers' +lives interesting. Rather, this architecture is a means to provide scalable +memory bandwidth. However, to achieve scalable memory bandwidth, system and +application software must arrange for a large majority of the memory references +[cache misses] to be to "local" memory--memory on the same cell, if any--or +to the closest cell with memory. + +This leads to the Linux software view of a NUMA system: + +Linux divides the system's hardware resources into multiple software +abstractions called "nodes". Linux maps the nodes onto the physical cells +of the hardware platform, abstracting away some of the details for some +architectures. As with physical cells, software nodes may contain 0 or more +cpus, memory and/or IO buses. And, again, memory access times to memory on +"closer" nodes [nodes that map to closer cells] will generally experience +faster access times and higher effective bandwidth than accesses to more +remote cells. + +For some architectures, such as x86, Linux will "hide" any node representing a +physical cell that has no memory attached, and reassign any cpus attached to +that cell to a node representing a cell that does have memory. Thus, on +these architectures, one cannot assume that all cpus that Linux associates with +a given node will see the same local memory access times and bandwidth. + +In addition, for some architectures, again x86 is an example, Linux supports +the emulation of additional nodes. For NUMA emulation, linux will carve up +the existing nodes--or the system memory for non-NUMA platforms--into multiple +nodes. Each emulated node will manage a fraction of the underlying cells' +physical memory. Numa emluation is useful for testing NUMA kernel and +application features on non-NUMA platforms, and as a sort of memory resource +management mechanism when used together with cpusets. +[See Documentation/cgroups/cpusets.txt] + +For each node with memory, Linux constructs an independent memory management +subsystem, complete with its own free page lists, in-use page lists, usage +statistics and locks to mediate access. In addition, Linux constructs for +each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], +an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a +selected zone/node cannot satisfy the allocation request. This situation, +when a zone's has no available memory to satisfy a request, is called +'overflow" or "fallback". + +Because some nodes contain multiple zones containing different types of +memory, Linux must decide whether to order the zonelists such that allocations +fall back to the same zone type on a different node, or to a different zone +type on the same node. This is an important consideration because some zones, +such as DMA or DMA32, represent relatively scarce resources. Linux chooses +a default zonelist order based on the sizes of the various zone types relative +to the total memory of the node and the total memory of the system. The +default zonelist order may be overridden using the numa_zonelist_order kernel +boot parameter or sysctl. [See Documentation/kernel-parameters.txt and +Documentation/sysctl/vm.txt] + +By default, Linux will attempt to satisfy memory allocation requests from the +node to which the cpu that executes the request is assigned. Specifically, +Linux will attempt to allocate from the first node in the appropriate zonelist +for the node where the request originates. This is called "local allocation." +If the "local" node cannot satisfy the request, the kernel will examine other +nodes' zones in the selected zonelist looking for the first zone in the list +that can satisfy the request. + +Local allocation will tend to keep subsequent access to the allocated memory +"local" to the underlying physical resources and off the system interconnect-- +as long as the task on whose behalf the kernel allocated some memory does not +later migrate away from that memory. The Linux scheduler is aware of the +NUMA topology of the platform--embodied in the "scheduling domains" data +structures [See Documentation/scheduler/sched-domains.txt]--and the scheduler +attempts to minimize task migration to distant scheduling domains. However, +the scheduler does not take a task's NUMA footprint into account directly. +Thus, under sufficient imbalance, tasks can migrate between nodes, remote +from their initial node and kernel data structures. + +System administrators and application designers can restrict a tasks migration +to improve NUMA locality using various cpu affinity command line interfaces, +such as taskset(1) and numactl(1), and program interfaces such as +sched_setaffinity(2). Further, one can modify the kernel's default local +allocation behavior using Linux NUMA memory policy. +[See Documentation/vm/numa_memory_policy.] + +System administrators can restrict the cpus and nodes' memories that a non- +privileged user can specify in the scheduling or NUMA commands and functions +using control groups and cpusets. [See Documentation/cgroups/cpusets.txt] + +On architectures that do not hide memoryless nodes, Linux will include only +zones [nodes] with memory in the zonelists. This means that for a memoryless +node the "local memory node"--the node of the first zone in cpu's node's +zonelist--will not be the node itself. Rather, it will be the node that the +kernel selected as the nearest node with memory when it built the zonelists. +So, default, local allocations will succeed with the kernel supplying the +closest available memory. This is a consequence of the same mechanism that +allows such allocations to fallback to other nearby nodes when a node that +does contain memory overflows. + +Some kernel allocations do not want or cannot tolerate this allocation fallback +behavior. Rather they want to be sure they get memory from the specified node +or get notified that the node has no free memory. This is usually the case when +a subsystem allocates per cpu memory resources, for example. + +A typical model for making such an allocation is to obtain the node id of the +node to which the "current cpu" is attached using one of the kernel's +numa_node_id() or cpu_to_node() functions and then request memory from only +the node id returned. When such an allocation fails, the requesting subsystem +may revert to its own fallback path. The slab kernel memory allocator is an +example of this. Or, the subsystem may chose to disable or not to enable +itself on allocation failure. The kernel profiling subsystem is an example of +this. + +If the architecture supports [does not hide] memoryless nodes, then cpus +attached to memoryless nodes would always incur the fallback path overhead +or some subsystems would fail to initialize if they attempted to allocated +memory exclusively from the a node without memory. To support such +architectures transparently, kernel subsystems can use the numa_mem_id() +or cpu_to_mem() function to locate the "local memory node" for the calling or +specified cpu. Again, this is the same node from which default, local page +allocations will be attempted. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Randy Dunlap Subject: Re: [PATCH 8/8] numa: update Documentation/vm/numa, add memoryless node info Date: Thu, 15 Apr 2010 11:00:04 -0700 Message-ID: <20100415110004.917dae3c.randy.dunlap@oracle.com> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173042.8801.17049.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173042.8801.17049.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On Thu, 15 Apr 2010 13:30:42 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > Kamezawa Hiroyuki requested documentation for the numa_mem_id() > and slab related changes. He suggested Documentation/vm/numa for > this documentation. Looking at this file, it seems to me to be > hopelessly out of date relative to current Linux NUMA support. > At the risk of going down a rathole, I have made an attempt to > rewrite the doc at a slightly higher level [I think] and provide > pointers to other in-tree documents and out-of-tree man pages that > cover the details. > > Let the games begin. OK. > Signed-off-by: Lee Schermerhorn > > --- > > New in V4. > > Documentation/vm/numa | 184 +++++++++++++++++++++++++++++++++++++++----------- > 1 files changed, 146 insertions(+), 38 deletions(-) > > Index: linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/Documentation/vm/numa 2010-04-07 09:49:13.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa 2010-04-07 10:11:40.000000000 -0400 > @@ -1,41 +1,149 @@ > Started Nov 1999 by Kanoj Sarcar > > -The intent of this file is to have an uptodate, running commentary > -from different people about NUMA specific code in the Linux vm. > +What is NUMA? > ... > +This question can be answered from a couple of perspectives: the > +hardware view and the Linux software view. > + > +From the hardware perspective, a NUMA system is a computer platform that > +comprises multiple components or assemblies each of which may contain 0 > +or more cpus, local memory, and/or IO buses. For brevity and to > +disambiguate the hardware view of these physical components/assemblies > +from the software abstraction thereof, we'll call the components/assemblies > +'cells' in this document. > + > +Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset > +of the system--although some components necessary for a stand-alone SMP system > +may not be populated on any given cell. The cells of the NUMA system are > +connected together with some sort of system interconnect--e.g., a crossbar or > +point-to-point link are common types of NUMA system interconnects. Both of > +these types of interconnects can be aggregated to create NUMA platforms with > +cells at multiple distances from other cells. > + > +For Linux, the NUMA platforms of interest are primarily what is known as Cache > +Coherent NUMA or CCNuma systems. With CCNUMA systems, all memory is visible > +to and accessible from any cpu attached to any cell and cache coherency > +is handled in hardware by the processor caches and/or the system interconnect. > + CCNuma or CCNUMA ? Please spell "cpu" as "CPU" (or plural: CPUs). and "io" as "IO". > +Memory access time and effective memory bandwidth varies depending on how far > +away the cell containing the cpu or io bus making the memory access is from the > +cell containing the target memory. For example, access to memory by cpus > +attached to the same cell will experience faster access times and higher > +bandwidths than accesses to memory on other, remote cells. NUMA platforms > +can have cells at multiple remote distances from any given cell. > + > +Platform vendors don't build NUMA systems just to make software developers' > +lives interesting. Rather, this architecture is a means to provide scalable > +memory bandwidth. However, to achieve scalable memory bandwidth, system and > +application software must arrange for a large majority of the memory references > +[cache misses] to be to "local" memory--memory on the same cell, if any--or > +to the closest cell with memory. > + > +This leads to the Linux software view of a NUMA system: > + > +Linux divides the system's hardware resources into multiple software > +abstractions called "nodes". Linux maps the nodes onto the physical cells > +of the hardware platform, abstracting away some of the details for some > +architectures. As with physical cells, software nodes may contain 0 or more > +cpus, memory and/or IO buses. And, again, memory access times to memory on > +"closer" nodes [nodes that map to closer cells] will generally experience > +faster access times and higher effective bandwidth than accesses to more > +remote cells. > + > +For some architectures, such as x86, Linux will "hide" any node representing a > +physical cell that has no memory attached, and reassign any cpus attached to > +that cell to a node representing a cell that does have memory. Thus, on > +these architectures, one cannot assume that all cpus that Linux associates with > +a given node will see the same local memory access times and bandwidth. > + > +In addition, for some architectures, again x86 is an example, Linux supports > +the emulation of additional nodes. For NUMA emulation, linux will carve up > +the existing nodes--or the system memory for non-NUMA platforms--into multiple > +nodes. Each emulated node will manage a fraction of the underlying cells' > +physical memory. Numa emluation is useful for testing NUMA kernel and NUMA > +application features on non-NUMA platforms, and as a sort of memory resource > +management mechanism when used together with cpusets. > +[See Documentation/cgroups/cpusets.txt] > + > +For each node with memory, Linux constructs an independent memory management > +subsystem, complete with its own free page lists, in-use page lists, usage > +statistics and locks to mediate access. In addition, Linux constructs for > +each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], > +an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a > +selected zone/node cannot satisfy the allocation request. This situation, > +when a zone's has no available memory to satisfy a request, is called zone > +'overflow" or "fallback". "overflow" > + > +Because some nodes contain multiple zones containing different types of > +memory, Linux must decide whether to order the zonelists such that allocations > +fall back to the same zone type on a different node, or to a different zone > +type on the same node. This is an important consideration because some zones, > +such as DMA or DMA32, represent relatively scarce resources. Linux chooses > +a default zonelist order based on the sizes of the various zone types relative > +to the total memory of the node and the total memory of the system. The > +default zonelist order may be overridden using the numa_zonelist_order kernel > +boot parameter or sysctl. [See Documentation/kernel-parameters.txt and > +Documentation/sysctl/vm.txt] > + > +By default, Linux will attempt to satisfy memory allocation requests from the > +node to which the cpu that executes the request is assigned. Specifically, > +Linux will attempt to allocate from the first node in the appropriate zonelist > +for the node where the request originates. This is called "local allocation." > +If the "local" node cannot satisfy the request, the kernel will examine other > +nodes' zones in the selected zonelist looking for the first zone in the list > +that can satisfy the request. > + > +Local allocation will tend to keep subsequent access to the allocated memory > +"local" to the underlying physical resources and off the system interconnect-- > +as long as the task on whose behalf the kernel allocated some memory does not > +later migrate away from that memory. The Linux scheduler is aware of the > +NUMA topology of the platform--embodied in the "scheduling domains" data > +structures [See Documentation/scheduler/sched-domains.txt]--and the scheduler see > +attempts to minimize task migration to distant scheduling domains. However, > +the scheduler does not take a task's NUMA footprint into account directly. > +Thus, under sufficient imbalance, tasks can migrate between nodes, remote > +from their initial node and kernel data structures. > + > +System administrators and application designers can restrict a tasks migration task's > +to improve NUMA locality using various cpu affinity command line interfaces, > +such as taskset(1) and numactl(1), and program interfaces such as > +sched_setaffinity(2). Further, one can modify the kernel's default local > +allocation behavior using Linux NUMA memory policy. > +[See Documentation/vm/numa_memory_policy.] > + > +System administrators can restrict the cpus and nodes' memories that a non- > +privileged user can specify in the scheduling or NUMA commands and functions > +using control groups and cpusets. [See Documentation/cgroups/cpusets.txt] > + > +On architectures that do not hide memoryless nodes, Linux will include only > +zones [nodes] with memory in the zonelists. This means that for a memoryless > +node the "local memory node"--the node of the first zone in cpu's node's > +zonelist--will not be the node itself. Rather, it will be the node that the > +kernel selected as the nearest node with memory when it built the zonelists. > +So, default, local allocations will succeed with the kernel supplying the > +closest available memory. This is a consequence of the same mechanism that > +allows such allocations to fallback to other nearby nodes when a node that > +does contain memory overflows. > + > +Some kernel allocations do not want or cannot tolerate this allocation fallback > +behavior. Rather they want to be sure they get memory from the specified node > +or get notified that the node has no free memory. This is usually the case when > +a subsystem allocates per cpu memory resources, for example. > + > +A typical model for making such an allocation is to obtain the node id of the > +node to which the "current cpu" is attached using one of the kernel's > +numa_node_id() or cpu_to_node() functions and then request memory from only > +the node id returned. When such an allocation fails, the requesting subsystem > +may revert to its own fallback path. The slab kernel memory allocator is an > +example of this. Or, the subsystem may chose to disable or not to enable choose > +itself on allocation failure. The kernel profiling subsystem is an example of > +this. > + > +If the architecture supports [does not hide] memoryless nodes, then cpus > +attached to memoryless nodes would always incur the fallback path overhead > +or some subsystems would fail to initialize if they attempted to allocated > +memory exclusively from the a node without memory. To support such > +architectures transparently, kernel subsystems can use the numa_mem_id() > +or cpu_to_mem() function to locate the "local memory node" for the calling or > +specified cpu. Again, this is the same node from which default, local page > +allocations will be attempted. > > -- Nice update, thanks. --- ~Randy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 8/8] numa: update Documentation/vm/numa, add memoryless node info Date: Fri, 16 Apr 2010 09:50:45 +0900 Message-ID: <20100416095045.46ab6552.kamezawa.hiroyu@jp.fujitsu.com> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173042.8801.17049.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173042.8801.17049.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton On Thu, 15 Apr 2010 13:30:42 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > Kamezawa Hiroyuki requested documentation for the numa_mem_id() > and slab related changes. He suggested Documentation/vm/numa for > this documentation. Looking at this file, it seems to me to be > hopelessly out of date relative to current Linux NUMA support. > At the risk of going down a rathole, I have made an attempt to > rewrite the doc at a slightly higher level [I think] and provide > pointers to other in-tree documents and out-of-tree man pages that > cover the details. > > Let the games begin. > > Signed-off-by: Lee Schermerhorn > Thank you, seems very nice and covers almost all range we have to explain to new comers. My eye can't check details enough but...;) Reviewed-by: KAMEZAWA Hiroyuki I think this patch itself is very good. Being more greedy... Hmm, from user's view, I feel quick guide of /sys/devices/system/node/ and /sys/devices/system/node/node0/numastat can be added somewhere. (Documentation/numastat.txt is not under /vm :( ) And one more important? thing. [kamezawa@firextal Documentation]$ cat /sys/bus/pci/devices/0000\:00\:01.0/numa_node -1 PCI device (and other??) has numa_node_id in it, if it has locality information. I hear some guy had to be aware locality of NIC to do high-throuput network transaction. Then, "how to get device's locality via sysfs" is worth to be written. And mentioning what "nid = -1" means may help new comer. Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Date: Fri, 16 Apr 2010 11:43:14 -0500 (CDT) Message-ID: References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: In-Reply-To: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Reviewed-by: Christoph Lameter -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Fri, 16 Apr 2010 11:46:51 -0500 (CDT) Message-ID: References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173003.8801.48519.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Return-path: In-Reply-To: <20100415173003.8801.48519.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On Thu, 15 Apr 2010, Lee Schermerhorn wrote: > x86 arch specific changes to use generic numa_node_id() based on > generic percpu variable infrastructure. Back out x86's custom > version of numa_node_id() > > Signed-off-by: Lee Schermerhorn > [Christoph's signoff here?] Hmmm. Its mostly your work now. Maybe Reviewed-by will be ok? > @@ -809,7 +806,7 @@ void __cpuinit numa_set_node(int cpu, in > per_cpu(x86_cpu_to_node_map, cpu) = node; > > if (node != NUMA_NO_NODE) > - per_cpu(node_number, cpu) = node; > + per_cpu(numa_node, cpu) = node; > } Maybe provide a generic function to set the node for cpu X? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 5/8] numa: ia64: support numa_mem_id() for memoryless nodes Date: Sun, 18 Apr 2010 12:14:42 +0900 Message-ID: <4BCA7922.4070900@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173024.8801.36840.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173024.8801.36840.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Hello, On 04/16/2010 02:30 AM, Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > IA64: Support memoryless nodes > > Enable 'HAVE_MEMORYLESS_NODES' by default when NUMA configured ^is > on ia64. Initialize percpu 'numa_mem' variable when starting > secondary cpus. Generic initialization will handle the boot > cpu. > > Nothing uses 'numa_mem_id()' yet. Subsequent patch with modify will > slab to use this. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Sun, 18 Apr 2010 11:56:24 +0900 Message-ID: <4BCA74D8.3030503@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173003.8801.48519.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Christoph Lameter Cc: Lee Schermerhorn , linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On 04/17/2010 01:46 AM, Christoph Lameter wrote: > Maybe provide a generic function to set the node for cpu X? Yeap, seconded. Also, why not use numa_node_id() in common.c::cpu_init()? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 0/8] Numa: Use Generic Per-cpu Variables for numa_*_id() Date: Sun, 18 Apr 2010 12:19:02 +0900 Message-ID: <4BCA7A26.9040208@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On 04/16/2010 02:29 AM, Lee Schermerhorn wrote: > Use Generic Per cpu infrastructure for numa_*_id() V4 > > Series Against: 2.6.34-rc3-mmotm-100405-1609 Other than the minor nitpicks, the patchset looks great to me. Through which tree should this be routed? If no one else is gonna take it, I can route it through percpu after patchset refresh. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 4/8] numa: Introduce numa_mem_id()- effective local memory node id Date: Sun, 18 Apr 2010 12:13:02 +0900 Message-ID: <4BCA78BE.9000904@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173016.8801.34970.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173016.8801.34970.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On 04/16/2010 02:30 AM, Lee Schermerhorn wrote: > +#ifdef CONFIG_HAVE_MEMORYLESS_NODES > + > +DECLARE_PER_CPU(int, numa_mem); > + > +#ifndef set_numa_mem > +#define set_numa_mem(__node) percpu_write(numa_mem, __node) > +#endif > + > +#else /* !CONFIG_HAVE_MEMORYLESS_NODES */ > + > +#define numa_mem numa_node Please make it a macro which takes arguments or an inline function. Name substitutions like this can easily lead to pretty strange problems when they end up substituting local variable names. > +static inline void set_numa_mem(int node) {} and maybe it's a good idea to make the above one emit warning if the given node id doesn't match the cpu's numa node id? Also, in general, setting numa id (cpu or mem) isn't a hot path and it would be better to take both cpu and the node id arguments. ie, set_numa_mem(unsigned int cpu, int node). > +#endif /* [!]CONFIG_HAVE_MEMORYLESS_NODES */ > + > +#ifndef numa_mem_id > +/* Returns the number of the nearest Node with memory */ > +#define numa_mem_id() __this_cpu_read(numa_mem) > +#endif > + > +#ifndef cpu_to_mem > +#define cpu_to_mem(__cpu) per_cpu(numa_mem, (__cpu)) > +#endif Isn't cpu_to_mem() too generic? Maybe it's a good idea to put 'numa' or 'node' in the name? > +#ifdef CONFIG_HAVE_MEMORYLESS_NODES > + /* > + * We now know the "local memory node" for each node-- > + * i.e., the node of the first zone in the generic zonelist. > + * Set up numa_mem percpu variable for on-line cpus. During > + * boot, only the boot cpu should be on-line; we'll init the > + * secondary cpus' numa_mem as they come on-line. During > + * node/memory hotplug, we'll fixup all on-line cpus. > + */ > + if (cpu_online(cpu)) > + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu)); Please make cpu_to_node() evaluate to a rvalue and use set_numa_mem() to set node. The above is a bit too easy to get wrong when archs override the macro. > +#ifdef CONFIG_HAVE_MEMORYLESS_NODES > +int local_memory_node(int node_id); > +#else > +static inline int local_memory_node(int node_id) { return node_id; }; > +#endif Hmmm... can there be local_memory_node() users when MEMORYLESS_NODES is not enabled? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Date: Mon, 19 Apr 2010 11:32:47 +0900 Message-ID: <20100419113247.27fd0ea0.kamezawa.hiroyu@jp.fujitsu.com> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton On Thu, 15 Apr 2010 13:29:56 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > Rework the generic version of the numa_node_id() function to use the > new generic percpu variable infrastructure. > > Guard the new implementation with a new config option: > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > Archs which support this new implemention will default this option > to 'y' when NUMA is configured. This config option could be removed > if/when all archs switch over to the generic percpu implementation > of numa_node_id(). Arch support involves: > > 1) converting any existing per cpu variable implementations to use > this implementation. x86_64 is an instance of such an arch. > 2) archs that don't use a per cpu variable for numa_node_id() will > need to initialize the new per cpu variable "numa_node" as cpus > are brought on-line. ia64 is an example. > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > when NUMA is configured. This is required because I have > retained the old implementation by default to allow archs to > be modified incrementally, as desired. > > Subsequent patches will convert x86_64 and ia64 to use this > implemenation. > > Signed-off-by: Lee Schermerhorn Reviewed-by: KAMEZAWA Hiroyuki > > --- > > V0: > # From cl@linux-foundation.org Wed Nov 4 10:36:12 2009 > # Date: Wed, 4 Nov 2009 12:35:14 -0500 (EST) > # From: Christoph Lameter > # To: Lee Schermerhorn > # Subject: Re: [PATCH/RFC] slab: handle memoryless nodes efficiently > # > # I have a very early form of a draft of a patch here that genericizes > # numa_node_id(). Uses the new generic this_cpu_xxx stuff. > # > # Not complete. > > V1: > + split out x86 specific changes to subsequent patch > + split out "numa_mem_id()" and related changes to separate patch > + moved generic definitions of __this_cpu_xxx from linux/percpu.h > to asm-generic/percpu.h where asm/percpu.h and other asm hdrs > can use them. > + export new percpu symbol 'numa_node' in mm/percpu.h > + include in for use by new > numa_node_id(). > > V2: > + add back the #ifndef/#endif guard around numa_node_id() so that archs > can override generic definition > + add generic stub for set_numa_node() > + use generic percpu numa_node_id() only if enabled by > CONFIG_USE_PERCPU_NUMA_NODE_ID > to allow incremental per arch support. This option could be removed when/if > all archs that support NUMA support this option. > > V3: > + separated the rework of linux/percpu.h into another [preceding] patch. > + moved definition of the numa_node percpu variable from mm/percpu.c to > mm/page-alloc.c > + moved premature definition of cpu_to_mem() to later patch. > > V4: > + topology.h: include rather than > Requires Tejun Heo's percpu.h/slab.h cleanup series > > include/linux/topology.h | 33 ++++++++++++++++++++++++++++----- > mm/page_alloc.c | 5 +++++ > 2 files changed, 33 insertions(+), 5 deletions(-) > > Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:04:04.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 > @@ -56,6 +56,11 @@ > #include > #include "internal.h" > > +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > +DEFINE_PER_CPU(int, numa_node); > +EXPORT_PER_CPU_SYMBOL(numa_node); > +#endif > + > /* > * Array of node states. > */ > Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 09:49:13.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include > > #ifndef node_has_online_mem > @@ -203,8 +204,35 @@ int arch_update_cpu_topology(void); > #ifndef SD_NODE_INIT > #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!! > #endif > + > #endif /* CONFIG_NUMA */ > > +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > +DECLARE_PER_CPU(int, numa_node); > + > +#ifndef numa_node_id > +/* Returns the number of the current Node. */ > +#define numa_node_id() __this_cpu_read(numa_node) > +#endif > + > +#ifndef cpu_to_node > +#define cpu_to_node(__cpu) per_cpu(numa_node, (__cpu)) > +#endif > + > +#ifndef set_numa_node > +#define set_numa_node(__node) percpu_write(numa_node, __node) > +#endif > + > +#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ > + > +/* Returns the number of the current Node. */ > +#ifndef numa_node_id > +#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) > + > +#endif > + > +#endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ > + > #ifndef topology_physical_package_id > #define topology_physical_package_id(cpu) ((void)(cpu), -1) > #endif > @@ -218,9 +246,4 @@ int arch_update_cpu_topology(void); > #define topology_core_cpumask(cpu) cpumask_of(cpu) > #endif > > -/* Returns the number of the current Node. */ > -#ifndef numa_node_id > -#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) > -#endif > - > #endif /* _LINUX_TOPOLOGY_H */ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 3/8] numa: ia64: use generic percpu var numa_node_id() implementation Date: Mon, 19 Apr 2010 11:51:34 +0900 Message-ID: <20100419115134.bd756fdb.kamezawa.hiroyu@jp.fujitsu.com> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173009.8801.67345.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173009.8801.67345.sendpatchset@localhost.localdomain> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton On Thu, 15 Apr 2010 13:30:09 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > ia64: Use generic percpu implementation of numa_node_id() > + intialize per cpu 'numa_node' > + remove ia64 cpu_to_node() macro; use generic > + define CONFIG_USE_PERCPU_NUMA_NODE_ID when NUMA configured > > Signed-off-by: Lee Schermerhorn > Reviewed-by: Christoph Lameter > Reviewd-by: KAMEZAWA Hiroyuki BTW, Could add some explanation about "when numa_node_id() turns to be available" ? IIUC, - BOOT cpu ... after smp_prepare_boot_cpu() - Other cpu .. after smp_init() (i.e. always.) Right ? I'm sorry if it's well-known. Thanks, -Kame > --- > > New in V2 > > V3, V4: no change > > arch/ia64/Kconfig | 4 ++++ > arch/ia64/include/asm/topology.h | 5 ----- > arch/ia64/kernel/smpboot.c | 6 ++++++ > 3 files changed, 10 insertions(+), 5 deletions(-) > > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:03:38.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 > @@ -390,6 +390,11 @@ smp_callin (void) > > fix_b0_for_bsp(); > > + /* > + * numa_node_id() works after this. > + */ > + set_numa_node(cpu_to_node_map[cpuid]); > + > ipi_call_lock_irq(); > spin_lock(&vector_lock); > /* Setup the per cpu irq handling data structures */ > @@ -632,6 +637,7 @@ void __devinit smp_prepare_boot_cpu(void > { > cpu_set(smp_processor_id(), cpu_online_map); > cpu_set(smp_processor_id(), cpu_callin_map); > + set_numa_node(cpu_to_node_map[smp_processor_id()]); > per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; > paravirt_post_smp_prepare_boot_cpu(); > } > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/include/asm/topology.h 2010-04-07 09:49:13.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h 2010-04-07 10:10:27.000000000 -0400 > @@ -26,11 +26,6 @@ > #define RECLAIM_DISTANCE 15 > > /* > - * Returns the number of the node containing CPU 'cpu' > - */ > -#define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu]) > - > -/* > * Returns a bitmask of CPUs on Node 'node'. > */ > #define cpumask_of_node(node) ((node) == -1 ? \ > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:04:03.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 > @@ -497,6 +497,10 @@ config HAVE_ARCH_NODEDATA_EXTENSION > def_bool y > depends on NUMA > > +config USE_PERCPU_NUMA_NODE_ID > + def_bool y > + depends on NUMA > + > config ARCH_PROC_KCORE_TEXT > def_bool y > depends on PROC_KCORE > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 0/8] Numa: Use Generic Per-cpu Variables for numa_*_id() Date: Mon, 19 Apr 2010 09:29:20 -0400 Message-ID: <1271683760.10937.35.camel@useless.americas.hpqcorp.net> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <4BCA7A26.9040208@kernel.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4BCA7A26.9040208@kernel.org> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Tejun Heo Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On Sun, 2010-04-18 at 12:19 +0900, Tejun Heo wrote: > On 04/16/2010 02:29 AM, Lee Schermerhorn wrote: > > Use Generic Per cpu infrastructure for numa_*_id() V4 > > > > Series Against: 2.6.34-rc3-mmotm-100405-1609 > > Other than the minor nitpicks, the patchset looks great to me. > Through which tree should this be routed? If no one else is gonna > take it, I can route it through percpu after patchset refresh. Andrew has merged this set into the -mm tree. I think that's fine and will proceed to address all of the comments there as incremental patches. I have comments/requests from yourself: 2/8: seconding Christoph's suggestion re: generic function to add generic function to set per cpu node id; plus suggestion to use numa_node_id() in common.c::cpu_init(). 4/8: lose the "#define numa_mem numa_node". I'll need to rework this. Currently, one can access the per cpu variable 'numa_node' directly as such. I added 'numa_mem' [actually got it from Christoph's starter patch] as an analog to numa_node. I/Christoph wanted to eliminate the redundant variable when it wasn't needed, but not break code that directly accesses it. Maybe better to not provide it at all? 5/8: wording error in patch description. Randy D and Kamezawa-san: comments on documentation patch Kame-san: request for clarification in 3/8 Thanks, Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Thu, 29 Apr 2010 12:56:48 -0400 Message-ID: <1272560208.4927.39.camel@useless.americas.hpqcorp.net> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173003.8801.48519.sendpatchset@localhost.localdomain> <4BCA74D8.3030503@kernel.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4BCA74D8.3030503@kernel.org> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Tejun Heo Cc: Christoph Lameter , linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On Sun, 2010-04-18 at 11:56 +0900, Tejun Heo wrote: > On 04/17/2010 01:46 AM, Christoph Lameter wrote: > > Maybe provide a generic function to set the node for cpu X? > > Yeap, seconded. Also, why not use numa_node_id() in > common.c::cpu_init()? Tejun: do you mean: #ifdef CONFIG_NUMA if (cpu != 0 && percpu_read(numa_node) == 0 && ........................^ here? early_cpu_to_node(cpu) != NUMA_NO_NODE) set_numa_node(early_cpu_to_node(cpu)); #endif Looks like 'numa_node_id()' would work there. But, I wonder what the "cpu != 0 && percpu_read(numa_node) == 0" is trying to do? E.g., is "cpu != 0" testing "cpu != boot_cpu_id"? Is there an implicit assumption that the boot cpu is zero? Or just a non-zero cpuid is obviously initialized? And the "percpu_read(numa_node) == 0" is testing that this cpu's 'numa_node' MAY not be initialized? 0 is a valid node id for !0 cpu ids. But it's OK to reinitialize numa_node in that case. Just trying to grok the intent. Maybe someone will chime in. Anyway, if the intent is to test the percpu 'numa_node' for initialization, using numa_node_id() might obscure this even more. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Fri, 30 Apr 2010 06:58:10 +0200 Message-ID: <4BDA6362.4030505@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173003.8801.48519.sendpatchset@localhost.localdomain> <4BCA74D8.3030503@kernel.org> <1272560208.4927.39.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1272560208.4927.39.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: Christoph Lameter , linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki Hello, On 04/29/2010 06:56 PM, Lee Schermerhorn wrote: > Tejun: do you mean: > > #ifdef CONFIG_NUMA > if (cpu != 0 && percpu_read(numa_node) == 0 && > ........................^ here? > early_cpu_to_node(cpu) != NUMA_NO_NODE) > set_numa_node(early_cpu_to_node(cpu)); > #endif > > Looks like 'numa_node_id()' would work there. Yeah, it just looked weird to use raw variable when an access wrapper is there. > But, I wonder what the "cpu != 0 && percpu_read(numa_node) == 0" is > trying to do? That I have don't have any clue about. :-) > Just trying to grok the intent. Maybe someone will chime in. Christoph? Mel? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [PATCH 2/8] numa: x86_64: use generic percpu var numa_node_id() implementation Date: Sat, 1 May 2010 20:49:41 -0500 (CDT) Message-ID: References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173003.8801.48519.sendpatchset@localhost.localdomain> <4BCA74D8.3030503@kernel.org> <1272560208.4927.39.camel@useless.americas.hpqcorp.net> <4BDA6362.4030505@kernel.org> Mime-Version: 1.0 Return-path: In-Reply-To: <4BDA6362.4030505@kernel.org> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: TEXT/PLAIN; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Tejun Heo Cc: Lee Schermerhorn , linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , andi@firstfloor.org, Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki On Fri, 30 Apr 2010, Tejun Heo wrote: > Hello, > > On 04/29/2010 06:56 PM, Lee Schermerhorn wrote: > > Tejun: do you mean: > > > > #ifdef CONFIG_NUMA > > if (cpu != 0 && percpu_read(numa_node) == 0 && > > ........................^ here? > > early_cpu_to_node(cpu) != NUMA_NO_NODE) > > set_numa_node(early_cpu_to_node(cpu)); > > #endif > > > > Looks like 'numa_node_id()' would work there. > > Yeah, it just looked weird to use raw variable when an access wrapper > is there. > > > But, I wonder what the "cpu != 0 && percpu_read(numa_node) == 0" is > > trying to do? > > That I have don't have any clue about. :-) I guess that cpu 0 is used for booting and its initialized early when certain functionality is not available yet. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node Date: Wed, 12 May 2010 11:49:00 -0700 Message-ID: <20100512114900.a12c4b35.akpm@linux-foundation.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100415173030.8801.84836.sendpatchset@localhost.localdomain> Sender: owner-linux-mm@kvack.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki I have a note here that this patch "breaks slab.c". But I don't recall what the problem was and I don't see a fix against this patch in your recently-sent fixup series? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node Date: Wed, 12 May 2010 15:11:43 -0400 Message-ID: <1273691503.6985.142.camel@useless.americas.hpqcorp.net> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> <20100512114900.a12c4b35.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20100512114900.a12c4b35.akpm@linux-foundation.org> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Andrew Morton Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi Kleen , Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , Valdis.Kletnieks@vt.edu On Wed, 2010-05-12 at 11:49 -0700, Andrew Morton wrote: > I have a note here that this patch "breaks slab.c". But I don't recall what > the problem was and I don't see a fix against this patch in your recently-sent > fixup series? Is that Valdis Kletnieks' issue? That was an i386 build. Happened because the earlier patches didn't properly default numa_mem_id() to numa_node_id() for the i386 build. The rework to those patches has fixed that. I have successfully built mmotm with the rework patches for i386+!NUMA. Valdis tested the series and confirmed that it fixed the problem. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 From: Valdis.Kletnieks@vt.edu Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node Date: Wed, 12 May 2010 15:25:51 -0400 Message-ID: <4170.1273692351@localhost> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> <20100512114900.a12c4b35.akpm@linux-foundation.org> <1273691503.6985.142.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="==_Exmh_1273692351_3904P"; micalg=pgp-sha1; protocol="application/pgp-signature" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Your message of "Wed, 12 May 2010 15:11:43 EDT." <1273691503.6985.142.camel@useless.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-Id: To: Lee Schermerhorn Cc: Andrew Morton , linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi Kleen , Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki --==_Exmh_1273692351_3904P Content-Type: text/plain; charset=us-ascii On Wed, 12 May 2010 15:11:43 EDT, Lee Schermerhorn said: > On Wed, 2010-05-12 at 11:49 -0700, Andrew Morton wrote: > > I have a note here that this patch "breaks slab.c". But I don't recall what > > the problem was and I don't see a fix against this patch in your recently-sent > > fixup series? > > Is that Valdis Kletnieks' issue? That was an i386 build. Happened > because the earlier patches didn't properly default numa_mem_id() to > numa_node_id() for the i386 build. The rework to those patches has > fixed that. I have successfully built mmotm with the rework patches > for i386+!NUMA. Valdis tested the series and confirmed that it fixed > the problem. I thought the problem was common to both i386 and X86_64 non-NUMA (which is where I hit the problem). In any case, builds OK for me now. --==_Exmh_1273692351_3904P Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Exmh version 2.5 07/13/2001 iD8DBQFL6wC/cC3lWbTT17ARAhAZAJ0Xr2Psa71AVoIG2Y3OnnggsC3CTwCg9X8e X57rbf1qSyZEJI6d9Jl0OuY= =kSyj -----END PGP SIGNATURE----- --==_Exmh_1273692351_3904P-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lee Schermerhorn Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node Date: Wed, 12 May 2010 16:03:35 -0400 Message-ID: <1273694615.6985.153.camel@useless.americas.hpqcorp.net> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> <20100512114900.a12c4b35.akpm@linux-foundation.org> <1273691503.6985.142.camel@useless.americas.hpqcorp.net> <4170.1273692351@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4170.1273692351@localhost> Sender: linux-numa-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii" To: Valdis.Kletnieks@vt.edu Cc: Andrew Morton , linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi Kleen , Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki On Wed, 2010-05-12 at 15:25 -0400, Valdis.Kletnieks@vt.edu wrote: > On Wed, 12 May 2010 15:11:43 EDT, Lee Schermerhorn said: > > On Wed, 2010-05-12 at 11:49 -0700, Andrew Morton wrote: > > > I have a note here that this patch "breaks slab.c". But I don't recall what > > > the problem was and I don't see a fix against this patch in your recently-sent > > > fixup series? > > > > Is that Valdis Kletnieks' issue? That was an i386 build. Happened > > because the earlier patches didn't properly default numa_mem_id() to > > numa_node_id() for the i386 build. The rework to those patches has > > fixed that. I have successfully built mmotm with the rework patches > > for i386+!NUMA. Valdis tested the series and confirmed that it fixed > > the problem. > > I thought the problem was common to both i386 and X86_64 non-NUMA (which is > where I hit the problem). In any case, builds OK for me now. The x86_64 !NUMA issue was another one I introduced in the rework -- patch 1/7 first version you tested. Fixed in the current version. Happened because x86_64 defines it's own fallback for numa_node_id(). See the description of patch 1/7. Turns out x86_64 builds fine with NUMA or !NUMA if I just remove the !NUMA numa_node_id() definition. I'll submit that patch shortly. Lee From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id DF4AC6B020C for ; Thu, 15 Apr 2010 13:30:25 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:29:50 -0400 Message-Id: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 0/8] Numa: Use Generic Per-cpu Variables for numa_*_id() Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Use Generic Per cpu infrastructure for numa_*_id() V4 Series Against: 2.6.34-rc3-mmotm-100405-1609 Background: V1 of this series resolved a fairly serious performance problem on our ia64 platforms with memoryless nodes because SLAB cannot cache object from a remote node, even tho' that node is the effective "local memory node" for a given cpu. V1 caused no regression in x86_64 [a slight improvement even] for the admittedly few tests that I ran. Christoph Lameter suggested the approach implemented in V2 and later: define a new function--numa_mem_id()--that returns the "local memory node" for cpus attached to memoryless nodes. Christoph also suggested that, while at it, I could modify the implementation of numa_node_id() [and the related cpu_to_node()] to use the generic percpu variable implementation. While implementing V2, I encountered a circular header dependency between: topology.h -> percpu.h -> slab.h -> gfp.h -> topology.h I resolved this by moving the generic percpu functions to include/asm-generic/percpu.h so that various arch asm/percpu.h could include that, and topology.h could include asm/percpu.h to avoid including slab.h, breaking the circular dependency. Reviewers didn't like that. Matthew Willcox suggested that I uninline percpu_alloc()/free() for the !SMP config and remove slab.h from percpu.h. I tried that. I broke the build of a LOT of files. Tejun Heo mentioned that percpu-defs.h would be a better place for the generic function definitions. V3 implemented that suggestion. Later, Tejun decided to jump in and remove slab.h from percpu.h and semi- automagically fix up all of the affected modules. V4 is implemented atop Tejun's series now in mmotm. Again, this solves the slab performance problem on our servers configured with memoryless nodes, and shows no regression with hackbench on x86_64. Of course, more performance testing would be welcome. The slab changes in patch 6 of the series need review w/rt to node hot plug that could change the effective "local memory node" for a memoryless node by inserting a "nearer" node in the zonelists. An additional patch may be required to address this. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 65DDF6B020D for ; Thu, 15 Apr 2010 13:30:36 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:29:56 -0400 Message-Id: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Against: 2.6.34-rc3-mmotm-100405-1609 Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn --- V0: # From cl@linux-foundation.org Wed Nov 4 10:36:12 2009 # Date: Wed, 4 Nov 2009 12:35:14 -0500 (EST) # From: Christoph Lameter # To: Lee Schermerhorn # Subject: Re: [PATCH/RFC] slab: handle memoryless nodes efficiently # # I have a very early form of a draft of a patch here that genericizes # numa_node_id(). Uses the new generic this_cpu_xxx stuff. # # Not complete. V1: + split out x86 specific changes to subsequent patch + split out "numa_mem_id()" and related changes to separate patch + moved generic definitions of __this_cpu_xxx from linux/percpu.h to asm-generic/percpu.h where asm/percpu.h and other asm hdrs can use them. + export new percpu symbol 'numa_node' in mm/percpu.h + include in for use by new numa_node_id(). V2: + add back the #ifndef/#endif guard around numa_node_id() so that archs can override generic definition + add generic stub for set_numa_node() + use generic percpu numa_node_id() only if enabled by CONFIG_USE_PERCPU_NUMA_NODE_ID to allow incremental per arch support. This option could be removed when/if all archs that support NUMA support this option. V3: + separated the rework of linux/percpu.h into another [preceding] patch. + moved definition of the numa_node percpu variable from mm/percpu.c to mm/page-alloc.c + moved premature definition of cpu_to_mem() to later patch. V4: + topology.h: include rather than Requires Tejun Heo's percpu.h/slab.h cleanup series include/linux/topology.h | 33 ++++++++++++++++++++++++++++----- mm/page_alloc.c | 5 +++++ 2 files changed, 33 insertions(+), 5 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:04:04.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 @@ -56,6 +56,11 @@ #include #include "internal.h" +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +DEFINE_PER_CPU(int, numa_node); +EXPORT_PER_CPU_SYMBOL(numa_node); +#endif + /* * Array of node states. */ Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 @@ -31,6 +31,7 @@ #include #include #include +#include #include #ifndef node_has_online_mem @@ -203,8 +204,35 @@ int arch_update_cpu_topology(void); #ifndef SD_NODE_INIT #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!! #endif + #endif /* CONFIG_NUMA */ +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID +DECLARE_PER_CPU(int, numa_node); + +#ifndef numa_node_id +/* Returns the number of the current Node. */ +#define numa_node_id() __this_cpu_read(numa_node) +#endif + +#ifndef cpu_to_node +#define cpu_to_node(__cpu) per_cpu(numa_node, (__cpu)) +#endif + +#ifndef set_numa_node +#define set_numa_node(__node) percpu_write(numa_node, __node) +#endif + +#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ + +/* Returns the number of the current Node. */ +#ifndef numa_node_id +#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) + +#endif + +#endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ + #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif @@ -218,9 +246,4 @@ int arch_update_cpu_topology(void); #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif -/* Returns the number of the current Node. */ -#ifndef numa_node_id -#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) -#endif - #endif /* _LINUX_TOPOLOGY_H */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 1EB6E6B0210 for ; Thu, 15 Apr 2010 13:30:44 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:30:09 -0400 Message-Id: <20100415173009.8801.67345.sendpatchset@localhost.localdomain> In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 3/8] numa: ia64: use generic percpu var numa_node_id() implementation Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Against: 2.6.34-rc3-mmotm-100405-1609 ia64: Use generic percpu implementation of numa_node_id() + intialize per cpu 'numa_node' + remove ia64 cpu_to_node() macro; use generic + define CONFIG_USE_PERCPU_NUMA_NODE_ID when NUMA configured Signed-off-by: Lee Schermerhorn Reviewed-by: Christoph Lameter --- New in V2 V3, V4: no change arch/ia64/Kconfig | 4 ++++ arch/ia64/include/asm/topology.h | 5 ----- arch/ia64/kernel/smpboot.c | 6 ++++++ 3 files changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:03:38.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 @@ -390,6 +390,11 @@ smp_callin (void) fix_b0_for_bsp(); + /* + * numa_node_id() works after this. + */ + set_numa_node(cpu_to_node_map[cpuid]); + ipi_call_lock_irq(); spin_lock(&vector_lock); /* Setup the per cpu irq handling data structures */ @@ -632,6 +637,7 @@ void __devinit smp_prepare_boot_cpu(void { cpu_set(smp_processor_id(), cpu_online_map); cpu_set(smp_processor_id(), cpu_callin_map); + set_numa_node(cpu_to_node_map[smp_processor_id()]); per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; paravirt_post_smp_prepare_boot_cpu(); } Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/include/asm/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h 2010-04-07 10:10:27.000000000 -0400 @@ -26,11 +26,6 @@ #define RECLAIM_DISTANCE 15 /* - * Returns the number of the node containing CPU 'cpu' - */ -#define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu]) - -/* * Returns a bitmask of CPUs on Node 'node'. */ #define cpumask_of_node(node) ((node) == -1 ? \ Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:04:03.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 @@ -497,6 +497,10 @@ config HAVE_ARCH_NODEDATA_EXTENSION def_bool y depends on NUMA +config USE_PERCPU_NUMA_NODE_ID + def_bool y + depends on NUMA + config ARCH_PROC_KCORE_TEXT def_bool y depends on PROC_KCORE -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 1A1986B0214 for ; Thu, 15 Apr 2010 13:30:59 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:30:24 -0400 Message-Id: <20100415173024.8801.36840.sendpatchset@localhost.localdomain> In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 5/8] numa: ia64: support numa_mem_id() for memoryless nodes Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Against: 2.6.34-rc3-mmotm-100405-1609 IA64: Support memoryless nodes Enable 'HAVE_MEMORYLESS_NODES' by default when NUMA configured on ia64. Initialize percpu 'numa_mem' variable when starting secondary cpus. Generic initialization will handle the boot cpu. Nothing uses 'numa_mem_id()' yet. Subsequent patch with modify slab to use this. Signed-off-by: Lee Schermerhorn --- New in V2 V3, V4: no change arch/ia64/Kconfig | 4 ++++ arch/ia64/kernel/smpboot.c | 1 + 2 files changed, 5 insertions(+) Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:30.000000000 -0400 @@ -501,6 +501,10 @@ config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA +config HAVE_MEMORYLESS_NODES + def_bool y + depends on NUMA + config ARCH_PROC_KCORE_TEXT def_bool y depends on PROC_KCORE Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:30.000000000 -0400 @@ -394,6 +394,7 @@ smp_callin (void) * numa_node_id() works after this. */ set_numa_node(cpu_to_node_map[cpuid]); + set_numa_mem(local_memory_node(cpu_to_node_map[cpuid])); ipi_call_lock_irq(); spin_lock(&vector_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 60618600375 for ; Thu, 15 Apr 2010 13:31:06 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:30:16 -0400 Message-Id: <20100415173016.8801.34970.sendpatchset@localhost.localdomain> In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 4/8] numa: Introduce numa_mem_id()- effective local memory node id Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Against: 2.6.34-rc3-mmotm-100405-1609 Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. Signed-off-by: Lee Schermerhorn Signed-off-by: Christoph Lameter --- V2: + split this out of Christoph's incomplete "starter patch" + flesh out the definition V3,V4: no change include/asm-generic/topology.h | 3 +++ include/linux/mmzone.h | 6 ++++++ include/linux/topology.h | 24 ++++++++++++++++++++++++ mm/page_alloc.c | 39 ++++++++++++++++++++++++++++++++++++++- 4 files changed, 71 insertions(+), 1 deletion(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:28.000000000 -0400 @@ -233,6 +233,30 @@ DECLARE_PER_CPU(int, numa_node); #endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ +#ifdef CONFIG_HAVE_MEMORYLESS_NODES + +DECLARE_PER_CPU(int, numa_mem); + +#ifndef set_numa_mem +#define set_numa_mem(__node) percpu_write(numa_mem, __node) +#endif + +#else /* !CONFIG_HAVE_MEMORYLESS_NODES */ + +#define numa_mem numa_node +static inline void set_numa_mem(int node) {} + +#endif /* [!]CONFIG_HAVE_MEMORYLESS_NODES */ + +#ifndef numa_mem_id +/* Returns the number of the nearest Node with memory */ +#define numa_mem_id() __this_cpu_read(numa_mem) +#endif + +#ifndef cpu_to_mem +#define cpu_to_mem(__cpu) per_cpu(numa_mem, (__cpu)) +#endif + #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:28.000000000 -0400 @@ -61,6 +61,11 @@ DEFINE_PER_CPU(int, numa_node); EXPORT_PER_CPU_SYMBOL(numa_node); #endif +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +DEFINE_PER_CPU(int, numa_mem); /* Kernel "local memory" node */ +EXPORT_PER_CPU_SYMBOL(numa_mem); +#endif + /* * Array of node states. */ @@ -2752,6 +2757,24 @@ static void build_zonelist_cache(pg_data zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z); } +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +/* + * Return node id of node used for "local" allocations. + * I.e., first node id of first zone in arg node's generic zonelist. + * Used for initializing percpu 'numa_mem', which is used primarily + * for kernel allocations, so use GFP_KERNEL flags to locate zonelist. + */ +int local_memory_node(int node) +{ + struct zone *zone; + + (void)first_zones_zonelist(node_zonelist(node, GFP_KERNEL), + gfp_zone(GFP_KERNEL), + NULL, + &zone); + return zone->node; +} +#endif #else /* CONFIG_NUMA */ @@ -2851,9 +2874,23 @@ static int __build_all_zonelists(void *d * needs the percpu allocator in order to allocate its pagesets * (a chicken-egg dilemma). */ - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { setup_pageset(&per_cpu(boot_pageset, cpu), 0); +#ifdef CONFIG_HAVE_MEMORYLESS_NODES + /* + * We now know the "local memory node" for each node-- + * i.e., the node of the first zone in the generic zonelist. + * Set up numa_mem percpu variable for on-line cpus. During + * boot, only the boot cpu should be on-line; we'll init the + * secondary cpus' numa_mem as they come on-line. During + * node/memory hotplug, we'll fixup all on-line cpus. + */ + if (cpu_online(cpu)) + cpu_to_mem(cpu) = local_memory_node(cpu_to_node(cpu)); +#endif + } + return 0; } Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/mmzone.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/mmzone.h 2010-04-07 10:03:46.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/mmzone.h 2010-04-07 10:10:28.000000000 -0400 @@ -661,6 +661,12 @@ void memory_present(int nid, unsigned lo static inline void memory_present(int nid, unsigned long start, unsigned long end) {} #endif +#ifdef CONFIG_HAVE_MEMORYLESS_NODES +int local_memory_node(int node_id); +#else +static inline int local_memory_node(int node_id) { return node_id; }; +#endif + #ifdef CONFIG_NEED_NODE_MEMMAP_SIZE unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif Index: linux-2.6.34-rc3-mmotm-100405-1609/include/asm-generic/topology.h =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/asm-generic/topology.h 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/include/asm-generic/topology.h 2010-04-07 10:10:28.000000000 -0400 @@ -34,6 +34,9 @@ #ifndef cpu_to_node #define cpu_to_node(cpu) ((void)(cpu),0) #endif +#ifndef cpu_to_mem +#define cpu_to_mem(cpu) (void)(cpu),0) +#endif #ifndef parent_node #define parent_node(node) ((void)(node),0) #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 9E464600375 for ; Thu, 15 Apr 2010 13:31:27 -0400 (EDT) From: Lee Schermerhorn Date: Thu, 15 Apr 2010 13:30:42 -0400 Message-Id: <20100415173042.8801.17049.sendpatchset@localhost.localdomain> In-Reply-To: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> Subject: [PATCH 8/8] numa: update Documentation/vm/numa, add memoryless node info Sender: owner-linux-mm@kvack.org To: linux-mm@kvack.org, linux-numa@vger.kernel.org Cc: Tejun Heo , Mel Gorman , Andi@domain.invalid, Kleen@domain.invalid, andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: Against: 2.6.34-rc3-mmotm-100405-1609 Kamezawa Hiroyuki requested documentation for the numa_mem_id() and slab related changes. He suggested Documentation/vm/numa for this documentation. Looking at this file, it seems to me to be hopelessly out of date relative to current Linux NUMA support. At the risk of going down a rathole, I have made an attempt to rewrite the doc at a slightly higher level [I think] and provide pointers to other in-tree documents and out-of-tree man pages that cover the details. Let the games begin. Signed-off-by: Lee Schermerhorn --- New in V4. Documentation/vm/numa | 184 +++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 146 insertions(+), 38 deletions(-) Index: linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa =================================================================== --- linux-2.6.34-rc3-mmotm-100405-1609.orig/Documentation/vm/numa 2010-04-07 09:49:13.000000000 -0400 +++ linux-2.6.34-rc3-mmotm-100405-1609/Documentation/vm/numa 2010-04-07 10:11:40.000000000 -0400 @@ -1,41 +1,149 @@ Started Nov 1999 by Kanoj Sarcar -The intent of this file is to have an uptodate, running commentary -from different people about NUMA specific code in the Linux vm. +What is NUMA? -What is NUMA? It is an architecture where the memory access times -for different regions of memory from a given processor varies -according to the "distance" of the memory region from the processor. -Each region of memory to which access times are the same from any -cpu, is called a node. On such architectures, it is beneficial if -the kernel tries to minimize inter node communications. Schemes -for this range from kernel text and read-only data replication -across nodes, and trying to house all the data structures that -key components of the kernel need on memory on that node. - -Currently, all the numa support is to provide efficient handling -of widely discontiguous physical memory, so architectures which -are not NUMA but can have huge holes in the physical address space -can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM. - -The initial port includes NUMAizing the bootmem allocator code by -encapsulating all the pieces of information into a bootmem_data_t -structure. Node specific calls have been added to the allocator. -In theory, any platform which uses the bootmem allocator should -be able to put the bootmem and mem_map data structures anywhere -it deems best. - -Each node's page allocation data structures have also been encapsulated -into a pg_data_t. The bootmem_data_t is just one part of this. To -make the code look uniform between NUMA and regular UMA platforms, -UMA platforms have a statically allocated pg_data_t too (contig_page_data). -For the sake of uniformity, the function num_online_nodes() is also defined -for all platforms. As we run benchmarks, we might decide to NUMAize -more variables like low_on_memory, nr_free_pages etc into the pg_data_t. - -The NUMA aware page allocation code currently tries to allocate pages -from different nodes in a round robin manner. This will be changed to -do concentratic circle search, starting from current node, once the -NUMA port achieves more maturity. The call alloc_pages_node has been -added, so that drivers can make the call and not worry about whether -it is running on a NUMA or UMA platform. +This question can be answered from a couple of perspectives: the +hardware view and the Linux software view. + +From the hardware perspective, a NUMA system is a computer platform that +comprises multiple components or assemblies each of which may contain 0 +or more cpus, local memory, and/or IO buses. For brevity and to +disambiguate the hardware view of these physical components/assemblies +from the software abstraction thereof, we'll call the components/assemblies +'cells' in this document. + +Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset +of the system--although some components necessary for a stand-alone SMP system +may not be populated on any given cell. The cells of the NUMA system are +connected together with some sort of system interconnect--e.g., a crossbar or +point-to-point link are common types of NUMA system interconnects. Both of +these types of interconnects can be aggregated to create NUMA platforms with +cells at multiple distances from other cells. + +For Linux, the NUMA platforms of interest are primarily what is known as Cache +Coherent NUMA or CCNuma systems. With CCNUMA systems, all memory is visible +to and accessible from any cpu attached to any cell and cache coherency +is handled in hardware by the processor caches and/or the system interconnect. + +Memory access time and effective memory bandwidth varies depending on how far +away the cell containing the cpu or io bus making the memory access is from the +cell containing the target memory. For example, access to memory by cpus +attached to the same cell will experience faster access times and higher +bandwidths than accesses to memory on other, remote cells. NUMA platforms +can have cells at multiple remote distances from any given cell. + +Platform vendors don't build NUMA systems just to make software developers' +lives interesting. Rather, this architecture is a means to provide scalable +memory bandwidth. However, to achieve scalable memory bandwidth, system and +application software must arrange for a large majority of the memory references +[cache misses] to be to "local" memory--memory on the same cell, if any--or +to the closest cell with memory. + +This leads to the Linux software view of a NUMA system: + +Linux divides the system's hardware resources into multiple software +abstractions called "nodes". Linux maps the nodes onto the physical cells +of the hardware platform, abstracting away some of the details for some +architectures. As with physical cells, software nodes may contain 0 or more +cpus, memory and/or IO buses. And, again, memory access times to memory on +"closer" nodes [nodes that map to closer cells] will generally experience +faster access times and higher effective bandwidth than accesses to more +remote cells. + +For some architectures, such as x86, Linux will "hide" any node representing a +physical cell that has no memory attached, and reassign any cpus attached to +that cell to a node representing a cell that does have memory. Thus, on +these architectures, one cannot assume that all cpus that Linux associates with +a given node will see the same local memory access times and bandwidth. + +In addition, for some architectures, again x86 is an example, Linux supports +the emulation of additional nodes. For NUMA emulation, linux will carve up +the existing nodes--or the system memory for non-NUMA platforms--into multiple +nodes. Each emulated node will manage a fraction of the underlying cells' +physical memory. Numa emluation is useful for testing NUMA kernel and +application features on non-NUMA platforms, and as a sort of memory resource +management mechanism when used together with cpusets. +[See Documentation/cgroups/cpusets.txt] + +For each node with memory, Linux constructs an independent memory management +subsystem, complete with its own free page lists, in-use page lists, usage +statistics and locks to mediate access. In addition, Linux constructs for +each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], +an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a +selected zone/node cannot satisfy the allocation request. This situation, +when a zone's has no available memory to satisfy a request, is called +'overflow" or "fallback". + +Because some nodes contain multiple zones containing different types of +memory, Linux must decide whether to order the zonelists such that allocations +fall back to the same zone type on a different node, or to a different zone +type on the same node. This is an important consideration because some zones, +such as DMA or DMA32, represent relatively scarce resources. Linux chooses +a default zonelist order based on the sizes of the various zone types relative +to the total memory of the node and the total memory of the system. The +default zonelist order may be overridden using the numa_zonelist_order kernel +boot parameter or sysctl. [See Documentation/kernel-parameters.txt and +Documentation/sysctl/vm.txt] + +By default, Linux will attempt to satisfy memory allocation requests from the +node to which the cpu that executes the request is assigned. Specifically, +Linux will attempt to allocate from the first node in the appropriate zonelist +for the node where the request originates. This is called "local allocation." +If the "local" node cannot satisfy the request, the kernel will examine other +nodes' zones in the selected zonelist looking for the first zone in the list +that can satisfy the request. + +Local allocation will tend to keep subsequent access to the allocated memory +"local" to the underlying physical resources and off the system interconnect-- +as long as the task on whose behalf the kernel allocated some memory does not +later migrate away from that memory. The Linux scheduler is aware of the +NUMA topology of the platform--embodied in the "scheduling domains" data +structures [See Documentation/scheduler/sched-domains.txt]--and the scheduler +attempts to minimize task migration to distant scheduling domains. However, +the scheduler does not take a task's NUMA footprint into account directly. +Thus, under sufficient imbalance, tasks can migrate between nodes, remote +from their initial node and kernel data structures. + +System administrators and application designers can restrict a tasks migration +to improve NUMA locality using various cpu affinity command line interfaces, +such as taskset(1) and numactl(1), and program interfaces such as +sched_setaffinity(2). Further, one can modify the kernel's default local +allocation behavior using Linux NUMA memory policy. +[See Documentation/vm/numa_memory_policy.] + +System administrators can restrict the cpus and nodes' memories that a non- +privileged user can specify in the scheduling or NUMA commands and functions +using control groups and cpusets. [See Documentation/cgroups/cpusets.txt] + +On architectures that do not hide memoryless nodes, Linux will include only +zones [nodes] with memory in the zonelists. This means that for a memoryless +node the "local memory node"--the node of the first zone in cpu's node's +zonelist--will not be the node itself. Rather, it will be the node that the +kernel selected as the nearest node with memory when it built the zonelists. +So, default, local allocations will succeed with the kernel supplying the +closest available memory. This is a consequence of the same mechanism that +allows such allocations to fallback to other nearby nodes when a node that +does contain memory overflows. + +Some kernel allocations do not want or cannot tolerate this allocation fallback +behavior. Rather they want to be sure they get memory from the specified node +or get notified that the node has no free memory. This is usually the case when +a subsystem allocates per cpu memory resources, for example. + +A typical model for making such an allocation is to obtain the node id of the +node to which the "current cpu" is attached using one of the kernel's +numa_node_id() or cpu_to_node() functions and then request memory from only +the node id returned. When such an allocation fails, the requesting subsystem +may revert to its own fallback path. The slab kernel memory allocator is an +example of this. Or, the subsystem may chose to disable or not to enable +itself on allocation failure. The kernel profiling subsystem is an example of +this. + +If the architecture supports [does not hide] memoryless nodes, then cpus +attached to memoryless nodes would always incur the fallback path overhead +or some subsystems would fail to initialize if they attempted to allocated +memory exclusively from the a node without memory. To support such +architectures transparently, kernel subsystems can use the numa_mem_id() +or cpu_to_mem() function to locate the "local memory node" for the calling or +specified cpu. Again, this is the same node from which default, local page +allocations will be attempted. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 804196B021C for ; Thu, 15 Apr 2010 20:54:44 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o3G0sfZf009293 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Fri, 16 Apr 2010 09:54:41 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 49D6845DE52 for ; Fri, 16 Apr 2010 09:54:41 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 0F0A045DE50 for ; Fri, 16 Apr 2010 09:54:41 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id CF62FE08009 for ; Fri, 16 Apr 2010 09:54:40 +0900 (JST) Received: from ml13.s.css.fujitsu.com (ml13.s.css.fujitsu.com [10.249.87.103]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 7E5EDE08008 for ; Fri, 16 Apr 2010 09:54:40 +0900 (JST) Date: Fri, 16 Apr 2010 09:50:45 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 8/8] numa: update Documentation/vm/numa, add memoryless node info Message-Id: <20100416095045.46ab6552.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100415173042.8801.17049.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173042.8801.17049.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton List-ID: On Thu, 15 Apr 2010 13:30:42 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > Kamezawa Hiroyuki requested documentation for the numa_mem_id() > and slab related changes. He suggested Documentation/vm/numa for > this documentation. Looking at this file, it seems to me to be > hopelessly out of date relative to current Linux NUMA support. > At the risk of going down a rathole, I have made an attempt to > rewrite the doc at a slightly higher level [I think] and provide > pointers to other in-tree documents and out-of-tree man pages that > cover the details. > > Let the games begin. > > Signed-off-by: Lee Schermerhorn > Thank you, seems very nice and covers almost all range we have to explain to new comers. My eye can't check details enough but...;) Reviewed-by: KAMEZAWA Hiroyuki I think this patch itself is very good. Being more greedy... Hmm, from user's view, I feel quick guide of /sys/devices/system/node/ and /sys/devices/system/node/node0/numastat can be added somewhere. (Documentation/numastat.txt is not under /vm :( ) And one more important? thing. [kamezawa@firextal Documentation]$ cat /sys/bus/pci/devices/0000\:00\:01.0/numa_node -1 PCI device (and other??) has numa_node_id in it, if it has locality information. I hear some guy had to be aware locality of NIC to do high-throuput network transaction. Then, "how to get device's locality via sysfs" is worth to be written. And mentioning what "nid = -1" means may help new comer. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id E1CE86B01EF for ; Sun, 18 Apr 2010 22:36:49 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o3J2ajAl021970 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Mon, 19 Apr 2010 11:36:45 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 9131A45DE4D for ; Mon, 19 Apr 2010 11:36:45 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 5B59045DE50 for ; Mon, 19 Apr 2010 11:36:45 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 3ABB61DB804D for ; Mon, 19 Apr 2010 11:36:45 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id CD5521DB8040 for ; Mon, 19 Apr 2010 11:36:44 +0900 (JST) Date: Mon, 19 Apr 2010 11:32:47 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation Message-Id: <20100419113247.27fd0ea0.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100415172956.8801.18133.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton List-ID: On Thu, 15 Apr 2010 13:29:56 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > Rework the generic version of the numa_node_id() function to use the > new generic percpu variable infrastructure. > > Guard the new implementation with a new config option: > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > Archs which support this new implemention will default this option > to 'y' when NUMA is configured. This config option could be removed > if/when all archs switch over to the generic percpu implementation > of numa_node_id(). Arch support involves: > > 1) converting any existing per cpu variable implementations to use > this implementation. x86_64 is an instance of such an arch. > 2) archs that don't use a per cpu variable for numa_node_id() will > need to initialize the new per cpu variable "numa_node" as cpus > are brought on-line. ia64 is an example. > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > when NUMA is configured. This is required because I have > retained the old implementation by default to allow archs to > be modified incrementally, as desired. > > Subsequent patches will convert x86_64 and ia64 to use this > implemenation. > > Signed-off-by: Lee Schermerhorn Reviewed-by: KAMEZAWA Hiroyuki > > --- > > V0: > # From cl@linux-foundation.org Wed Nov 4 10:36:12 2009 > # Date: Wed, 4 Nov 2009 12:35:14 -0500 (EST) > # From: Christoph Lameter > # To: Lee Schermerhorn > # Subject: Re: [PATCH/RFC] slab: handle memoryless nodes efficiently > # > # I have a very early form of a draft of a patch here that genericizes > # numa_node_id(). Uses the new generic this_cpu_xxx stuff. > # > # Not complete. > > V1: > + split out x86 specific changes to subsequent patch > + split out "numa_mem_id()" and related changes to separate patch > + moved generic definitions of __this_cpu_xxx from linux/percpu.h > to asm-generic/percpu.h where asm/percpu.h and other asm hdrs > can use them. > + export new percpu symbol 'numa_node' in mm/percpu.h > + include in for use by new > numa_node_id(). > > V2: > + add back the #ifndef/#endif guard around numa_node_id() so that archs > can override generic definition > + add generic stub for set_numa_node() > + use generic percpu numa_node_id() only if enabled by > CONFIG_USE_PERCPU_NUMA_NODE_ID > to allow incremental per arch support. This option could be removed when/if > all archs that support NUMA support this option. > > V3: > + separated the rework of linux/percpu.h into another [preceding] patch. > + moved definition of the numa_node percpu variable from mm/percpu.c to > mm/page-alloc.c > + moved premature definition of cpu_to_mem() to later patch. > > V4: > + topology.h: include rather than > Requires Tejun Heo's percpu.h/slab.h cleanup series > > include/linux/topology.h | 33 ++++++++++++++++++++++++++++----- > mm/page_alloc.c | 5 +++++ > 2 files changed, 33 insertions(+), 5 deletions(-) > > Index: linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/mm/page_alloc.c 2010-04-07 10:04:04.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/mm/page_alloc.c 2010-04-07 10:10:23.000000000 -0400 > @@ -56,6 +56,11 @@ > #include > #include "internal.h" > > +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > +DEFINE_PER_CPU(int, numa_node); > +EXPORT_PER_CPU_SYMBOL(numa_node); > +#endif > + > /* > * Array of node states. > */ > Index: linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/include/linux/topology.h 2010-04-07 09:49:13.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/include/linux/topology.h 2010-04-07 10:10:23.000000000 -0400 > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include > > #ifndef node_has_online_mem > @@ -203,8 +204,35 @@ int arch_update_cpu_topology(void); > #ifndef SD_NODE_INIT > #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!! > #endif > + > #endif /* CONFIG_NUMA */ > > +#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID > +DECLARE_PER_CPU(int, numa_node); > + > +#ifndef numa_node_id > +/* Returns the number of the current Node. */ > +#define numa_node_id() __this_cpu_read(numa_node) > +#endif > + > +#ifndef cpu_to_node > +#define cpu_to_node(__cpu) per_cpu(numa_node, (__cpu)) > +#endif > + > +#ifndef set_numa_node > +#define set_numa_node(__node) percpu_write(numa_node, __node) > +#endif > + > +#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */ > + > +/* Returns the number of the current Node. */ > +#ifndef numa_node_id > +#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) > + > +#endif > + > +#endif /* [!]CONFIG_USE_PERCPU_NUMA_NODE_ID */ > + > #ifndef topology_physical_package_id > #define topology_physical_package_id(cpu) ((void)(cpu), -1) > #endif > @@ -218,9 +246,4 @@ int arch_update_cpu_topology(void); > #define topology_core_cpumask(cpu) cpumask_of(cpu) > #endif > > -/* Returns the number of the current Node. */ > -#ifndef numa_node_id > -#define numa_node_id() (cpu_to_node(raw_smp_processor_id())) > -#endif > - > #endif /* _LINUX_TOPOLOGY_H */ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id DD6C86B01EF for ; Sun, 18 Apr 2010 22:55:35 -0400 (EDT) Received: from m5.gw.fujitsu.co.jp ([10.0.50.75]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o3J2tXKl024626 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Mon, 19 Apr 2010 11:55:33 +0900 Received: from smail (m5 [127.0.0.1]) by outgoing.m5.gw.fujitsu.co.jp (Postfix) with ESMTP id 00F8D45DE4E for ; Mon, 19 Apr 2010 11:55:33 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (s5.gw.fujitsu.co.jp [10.0.50.95]) by m5.gw.fujitsu.co.jp (Postfix) with ESMTP id D532945DE51 for ; Mon, 19 Apr 2010 11:55:32 +0900 (JST) Received: from s5.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id B2D8A1DB805B for ; Mon, 19 Apr 2010 11:55:32 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.249.87.104]) by s5.gw.fujitsu.co.jp (Postfix) with ESMTP id 23B191DB8040 for ; Mon, 19 Apr 2010 11:55:29 +0900 (JST) Date: Mon, 19 Apr 2010 11:51:34 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [PATCH 3/8] numa: ia64: use generic percpu var numa_node_id() implementation Message-Id: <20100419115134.bd756fdb.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100415173009.8801.67345.sendpatchset@localhost.localdomain> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173009.8801.67345.sendpatchset@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Lee Schermerhorn Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton List-ID: On Thu, 15 Apr 2010 13:30:09 -0400 Lee Schermerhorn wrote: > Against: 2.6.34-rc3-mmotm-100405-1609 > > ia64: Use generic percpu implementation of numa_node_id() > + intialize per cpu 'numa_node' > + remove ia64 cpu_to_node() macro; use generic > + define CONFIG_USE_PERCPU_NUMA_NODE_ID when NUMA configured > > Signed-off-by: Lee Schermerhorn > Reviewed-by: Christoph Lameter > Reviewd-by: KAMEZAWA Hiroyuki BTW, Could add some explanation about "when numa_node_id() turns to be available" ? IIUC, - BOOT cpu ... after smp_prepare_boot_cpu() - Other cpu .. after smp_init() (i.e. always.) Right ? I'm sorry if it's well-known. Thanks, -Kame > --- > > New in V2 > > V3, V4: no change > > arch/ia64/Kconfig | 4 ++++ > arch/ia64/include/asm/topology.h | 5 ----- > arch/ia64/kernel/smpboot.c | 6 ++++++ > 3 files changed, 10 insertions(+), 5 deletions(-) > > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/kernel/smpboot.c 2010-04-07 10:03:38.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/kernel/smpboot.c 2010-04-07 10:10:27.000000000 -0400 > @@ -390,6 +390,11 @@ smp_callin (void) > > fix_b0_for_bsp(); > > + /* > + * numa_node_id() works after this. > + */ > + set_numa_node(cpu_to_node_map[cpuid]); > + > ipi_call_lock_irq(); > spin_lock(&vector_lock); > /* Setup the per cpu irq handling data structures */ > @@ -632,6 +637,7 @@ void __devinit smp_prepare_boot_cpu(void > { > cpu_set(smp_processor_id(), cpu_online_map); > cpu_set(smp_processor_id(), cpu_callin_map); > + set_numa_node(cpu_to_node_map[smp_processor_id()]); > per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; > paravirt_post_smp_prepare_boot_cpu(); > } > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/include/asm/topology.h 2010-04-07 09:49:13.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/include/asm/topology.h 2010-04-07 10:10:27.000000000 -0400 > @@ -26,11 +26,6 @@ > #define RECLAIM_DISTANCE 15 > > /* > - * Returns the number of the node containing CPU 'cpu' > - */ > -#define cpu_to_node(cpu) (int)(cpu_to_node_map[cpu]) > - > -/* > * Returns a bitmask of CPUs on Node 'node'. > */ > #define cpumask_of_node(node) ((node) == -1 ? \ > Index: linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig > =================================================================== > --- linux-2.6.34-rc3-mmotm-100405-1609.orig/arch/ia64/Kconfig 2010-04-07 10:04:03.000000000 -0400 > +++ linux-2.6.34-rc3-mmotm-100405-1609/arch/ia64/Kconfig 2010-04-07 10:10:27.000000000 -0400 > @@ -497,6 +497,10 @@ config HAVE_ARCH_NODEDATA_EXTENSION > def_bool y > depends on NUMA > > +config USE_PERCPU_NUMA_NODE_ID > + def_bool y > + depends on NUMA > + > config ARCH_PROC_KCORE_TEXT > def_bool y > depends on PROC_KCORE > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id A3C666B01F1 for ; Mon, 19 Apr 2010 09:22:48 -0400 (EDT) Subject: Re: [PATCH 1/8] numa: add generic percpu var numa_node_id() implementation From: Lee Schermerhorn In-Reply-To: <20100416133324.fcb1c168.akpm@linux-foundation.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415172956.8801.18133.sendpatchset@localhost.localdomain> <20100416133324.fcb1c168.akpm@linux-foundation.org> Content-Type: text/plain Date: Mon, 19 Apr 2010 09:22:31 -0400 Message-Id: <1271683352.10937.34.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , linux-arch@vger.kernel.org List-ID: On Fri, 2010-04-16 at 13:33 -0700, Andrew Morton wrote: > On Thu, 15 Apr 2010 13:29:56 -0400 > Lee Schermerhorn wrote: > > > Rework the generic version of the numa_node_id() function to use the > > new generic percpu variable infrastructure. > > > > Guard the new implementation with a new config option: > > > > CONFIG_USE_PERCPU_NUMA_NODE_ID. > > > > Archs which support this new implemention will default this option > > to 'y' when NUMA is configured. This config option could be removed > > if/when all archs switch over to the generic percpu implementation > > of numa_node_id(). Arch support involves: > > > > 1) converting any existing per cpu variable implementations to use > > this implementation. x86_64 is an instance of such an arch. > > 2) archs that don't use a per cpu variable for numa_node_id() will > > need to initialize the new per cpu variable "numa_node" as cpus > > are brought on-line. ia64 is an example. > > 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., > > when NUMA is configured. This is required because I have > > retained the old implementation by default to allow archs to > > be modified incrementally, as desired. > > > > Subsequent patches will convert x86_64 and ia64 to use this > > implemenation. > > So which arches _aren't_ converted? powerpc, sparc and alpha? Right. Plus ARM, mips, ... I could take a cut at other archs, but can't test them. I'm hoping that this patch doesn't break the existing implementation for them. It should be a no-op until the new support is enabled via Kconfig. The fact that both x86_64 and ia64 build with just this patch gives me some hope but not a lot of confidence. I see that you've merged the series with into -mm. We'll see what happens. Of course, no reports of errors could just mean no testing. > > Is there sufficient info here for the maintainers to be able to > perform the conversion with minimal head-scratching? Arch maintainers will need to chime in on that. I'd hoped that the list above and the examples of x86_64 and ia64 in the subsequent patches would suffice. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 1BDF36B01EF for ; Mon, 19 Apr 2010 09:29:28 -0400 (EDT) Subject: Re: [PATCH 0/8] Numa: Use Generic Per-cpu Variables for numa_*_id() From: Lee Schermerhorn In-Reply-To: <4BCA7A26.9040208@kernel.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <4BCA7A26.9040208@kernel.org> Content-Type: text/plain Date: Mon, 19 Apr 2010 09:29:20 -0400 Message-Id: <1271683760.10937.35.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Tejun Heo Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Mel Gorman , andi@firstfloor.org, Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, Andrew Morton , KAMEZAWA Hiroyuki List-ID: On Sun, 2010-04-18 at 12:19 +0900, Tejun Heo wrote: > On 04/16/2010 02:29 AM, Lee Schermerhorn wrote: > > Use Generic Per cpu infrastructure for numa_*_id() V4 > > > > Series Against: 2.6.34-rc3-mmotm-100405-1609 > > Other than the minor nitpicks, the patchset looks great to me. > Through which tree should this be routed? If no one else is gonna > take it, I can route it through percpu after patchset refresh. Andrew has merged this set into the -mm tree. I think that's fine and will proceed to address all of the comments there as incremental patches. I have comments/requests from yourself: 2/8: seconding Christoph's suggestion re: generic function to add generic function to set per cpu node id; plus suggestion to use numa_node_id() in common.c::cpu_init(). 4/8: lose the "#define numa_mem numa_node". I'll need to rework this. Currently, one can access the per cpu variable 'numa_node' directly as such. I added 'numa_mem' [actually got it from Christoph's starter patch] as an analog to numa_node. I/Christoph wanted to eliminate the redundant variable when it wasn't needed, but not break code that directly accesses it. Maybe better to not provide it at all? 5/8: wording error in patch description. Randy D and Kamezawa-san: comments on documentation patch Kame-san: request for clarification in 3/8 Thanks, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id EB0DF6B01EF for ; Wed, 12 May 2010 15:12:57 -0400 (EDT) Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node From: Lee Schermerhorn In-Reply-To: <20100512114900.a12c4b35.akpm@linux-foundation.org> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> <20100512114900.a12c4b35.akpm@linux-foundation.org> Content-Type: text/plain Date: Wed, 12 May 2010 15:11:43 -0400 Message-Id: <1273691503.6985.142.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi Kleen , Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki , Valdis.Kletnieks@vt.edu List-ID: On Wed, 2010-05-12 at 11:49 -0700, Andrew Morton wrote: > I have a note here that this patch "breaks slab.c". But I don't recall what > the problem was and I don't see a fix against this patch in your recently-sent > fixup series? Is that Valdis Kletnieks' issue? That was an i386 build. Happened because the earlier patches didn't properly default numa_mem_id() to numa_node_id() for the i386 build. The rework to those patches has fixed that. I have successfully built mmotm with the rework patches for i386+!NUMA. Valdis tested the series and confirmed that it fixed the problem. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 88C6C6B01E3 for ; Wed, 12 May 2010 16:03:47 -0400 (EDT) Subject: Re: [PATCH 6/8] numa: slab: use numa_mem_id() for slab local memory node From: Lee Schermerhorn In-Reply-To: <4170.1273692351@localhost> References: <20100415172950.8801.60358.sendpatchset@localhost.localdomain> <20100415173030.8801.84836.sendpatchset@localhost.localdomain> <20100512114900.a12c4b35.akpm@linux-foundation.org> <1273691503.6985.142.camel@useless.americas.hpqcorp.net> <4170.1273692351@localhost> Content-Type: text/plain Date: Wed, 12 May 2010 16:03:35 -0400 Message-Id: <1273694615.6985.153.camel@useless.americas.hpqcorp.net> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Valdis.Kletnieks@vt.edu Cc: Andrew Morton , linux-mm@kvack.org, linux-numa@vger.kernel.org, Tejun Heo , Mel Gorman , Andi Kleen , Christoph Lameter , Nick Piggin , David Rientjes , eric.whitney@hp.com, KAMEZAWA Hiroyuki List-ID: On Wed, 2010-05-12 at 15:25 -0400, Valdis.Kletnieks@vt.edu wrote: > On Wed, 12 May 2010 15:11:43 EDT, Lee Schermerhorn said: > > On Wed, 2010-05-12 at 11:49 -0700, Andrew Morton wrote: > > > I have a note here that this patch "breaks slab.c". But I don't recall what > > > the problem was and I don't see a fix against this patch in your recently-sent > > > fixup series? > > > > Is that Valdis Kletnieks' issue? That was an i386 build. Happened > > because the earlier patches didn't properly default numa_mem_id() to > > numa_node_id() for the i386 build. The rework to those patches has > > fixed that. I have successfully built mmotm with the rework patches > > for i386+!NUMA. Valdis tested the series and confirmed that it fixed > > the problem. > > I thought the problem was common to both i386 and X86_64 non-NUMA (which is > where I hit the problem). In any case, builds OK for me now. The x86_64 !NUMA issue was another one I introduced in the rework -- patch 1/7 first version you tested. Fixed in the current version. Happened because x86_64 defines it's own fallback for numa_node_id(). See the description of patch 1/7. Turns out x86_64 builds fine with NUMA or !NUMA if I just remove the !NUMA numa_node_id() definition. I'll submit that patch shortly. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org