* [PATCH 1/3] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim @ 2010-04-30 4:33 Anton Blanchard 2010-04-30 4:34 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Anton Blanchard 0 siblings, 1 reply; 6+ messages in thread From: Anton Blanchard @ 2010-04-30 4:33 UTC (permalink / raw) To: benh; +Cc: linuxppc-dev I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets enabled via this: /* * If another node is sufficiently far away then it is better * to reclaim pages in a zone before going off node. */ if (distance > RECLAIM_DISTANCE) zone_reclaim_mode = 1; Since we use the default value of 20 for REMOTE_DISTANCE and 20 for RECLAIM_DISTANCE it never kicks in. The local to remote bandwidth ratios can be quite large on System p machines so it makes sense for us to reclaim clean pagecache locally before going off node. The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables zone reclaim. Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: powerpc.git/arch/powerpc/include/asm/topology.h =================================================================== --- powerpc.git.orig/arch/powerpc/include/asm/topology.h 2010-02-18 14:26:45.736821967 +1100 +++ powerpc.git/arch/powerpc/include/asm/topology.h 2010-02-18 14:51:24.793071748 +1100 @@ -8,6 +8,16 @@ struct device_node; #ifdef CONFIG_NUMA +/* + * Before going off node we want the VM to try and reclaim from the local + * node. It does this if the remote distance is larger than RECLAIM_DISTANCE. + * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of + * 20, we never reclaim and go off node straight away. + * + * To fix this we choose a smaller value of RECLAIM_DISTANCE. + */ +#define RECLAIM_DISTANCE 10 + #include <asm/mmzone.h> static inline int cpu_to_node(int cpu) ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/3] powerpc: Add form 1 NUMA affinity 2010-04-30 4:33 [PATCH 1/3] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard @ 2010-04-30 4:34 ` Anton Blanchard 2010-04-30 4:43 ` [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance Anton Blanchard 2010-04-30 7:24 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Benjamin Herrenschmidt 0 siblings, 2 replies; 6+ messages in thread From: Anton Blanchard @ 2010-04-30 4:34 UTC (permalink / raw) To: benh; +Cc: linuxppc-dev Firmware changed the way it represents memory and cpu affinity on POWER7. Unfortunately the old method now caps the topology to work around issues with legacy operating systems. For Linux to get the correct topology we need to use the new form 1 affinity information. We set the form 1 field in the client architecture, and if we see "1" in the ibm,associativity-form property firmware supports form 1 affinity and we should look at the first field in the ibm,associativity-reference-points array. If not we use the second field as we always have. Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: linux-2.6.34-rc3/arch/powerpc/kernel/prom_init.c =================================================================== --- linux-2.6.34-rc3.orig/arch/powerpc/kernel/prom_init.c 2010-04-02 07:52:10.000000000 -0500 +++ linux-2.6.34-rc3/arch/powerpc/kernel/prom_init.c 2010-04-07 07:06:45.000000000 -0500 @@ -653,6 +653,7 @@ #else #define OV5_CMO 0x00 #endif +#define OV5_TYPE1_AFFINITY 0x80 /* Type 1 NUMA affinity */ /* Option Vector 6: IBM PAPR hints */ #define OV6_LINUX 0x02 /* Linux is our OS */ @@ -706,7 +707,7 @@ OV5_DONATE_DEDICATE_CPU | OV5_MSI, 0, OV5_CMO, - 0, + OV5_TYPE1_AFFINITY, 0, 0, 0, Index: linux-2.6.34-rc3/arch/powerpc/mm/numa.c =================================================================== --- linux-2.6.34-rc3.orig/arch/powerpc/mm/numa.c 2010-04-07 07:06:32.000000000 -0500 +++ linux-2.6.34-rc3/arch/powerpc/mm/numa.c 2010-04-07 09:43:48.000000000 -0500 @@ -242,10 +243,11 @@ */ static int __init find_min_common_depth(void) { - int depth; + int depth, index; const unsigned int *ref_points; struct device_node *rtas_root; unsigned int len; + struct device_node *options; rtas_root = of_find_node_by_path("/rtas"); @@ -258,11 +260,23 @@ * configuration (should be all 0's) and the second is for a normal * NUMA configuration. */ + index = 1; ref_points = of_get_property(rtas_root, "ibm,associativity-reference-points", &len); + /* + * For type 1 affinity information we want the first field + */ + options = of_find_node_by_path("/options"); + if (options) { + const char *str; + str = of_get_property(options, "ibm,associativity-form", NULL); + if (str && !strcmp(str, "1")) + index = 0; + } + if ((len >= 2 * sizeof(unsigned int)) && ref_points) { - depth = ref_points[1]; + depth = ref_points[index]; } else { dbg("NUMA: ibm,associativity-reference-points not found.\n"); depth = -1; ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance 2010-04-30 4:34 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Anton Blanchard @ 2010-04-30 4:43 ` Anton Blanchard 2010-05-06 6:50 ` Benjamin Herrenschmidt 2010-04-30 7:24 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Benjamin Herrenschmidt 1 sibling, 1 reply; 6+ messages in thread From: Anton Blanchard @ 2010-04-30 4:43 UTC (permalink / raw) To: benh; +Cc: linuxppc-dev Form 1 affinity allows multiple entries in ibm,associativity-reference-points which represent affinity domains in decreasing order of importance. The Linux concept of a node is always the first entry, but using the other values as an input to node_distance() allows the memory allocator to make better decisions on which node to go first when local memory has been exhausted. We keep things simple and create an array indexed by NUMA node, capped at 4 entries. Each time we lookup an associativity property we initialise the array which is overkill, but since we should only hit this path during boot it didn't seem worth adding a per node valid bit. Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: linux-2.6/arch/powerpc/include/asm/topology.h =================================================================== --- linux-2.6.orig/arch/powerpc/include/asm/topology.h 2010-04-29 15:58:58.000000000 +1000 +++ linux-2.6/arch/powerpc/include/asm/topology.h 2010-04-29 15:59:00.000000000 +1000 @@ -77,6 +77,9 @@ static inline int pcibus_to_node(struct .balance_interval = 1, \ } +extern int __node_distance(int, int); +#define node_distance(a, b) __node_distance(a, b) + extern void __init dump_numa_cpu_topology(void); extern int sysfs_add_device_to_node(struct sys_device *dev, int nid); Index: linux-2.6/arch/powerpc/mm/numa.c =================================================================== --- linux-2.6.orig/arch/powerpc/mm/numa.c 2010-04-29 15:58:59.000000000 +1000 +++ linux-2.6/arch/powerpc/mm/numa.c 2010-04-29 22:05:24.000000000 +1000 @@ -42,6 +42,12 @@ EXPORT_SYMBOL(node_data); static int min_common_depth; static int n_mem_addr_cells, n_mem_size_cells; +static int form1_affinity; + +#define MAX_DISTANCE_REF_POINTS 4 +static int distance_ref_points_depth; +static const unsigned int *distance_ref_points; +static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; static int __cpuinit fake_numa_create_new_node(unsigned long end_pfn, unsigned int *nid) @@ -179,6 +185,39 @@ static const u32 *of_get_usable_memory(s return prop; } +int __node_distance(int a, int b) +{ + int i; + int distance = LOCAL_DISTANCE; + + if (!form1_affinity) + return distance; + + for (i = 0; i < distance_ref_points_depth; i++) { + if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) + break; + + /* Double the distance for each NUMA level */ + distance *= 2; + } + + return distance; +} + +static void initialize_distance_lookup_table(int nid, + const unsigned int *associativity) +{ + int i; + + if (!form1_affinity) + return; + + for (i = 0; i < distance_ref_points_depth; i++) { + distance_lookup_table[nid][i] = + associativity[distance_ref_points[i]]; + } +} + /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ @@ -200,6 +239,10 @@ static int of_node_to_nid_single(struct /* POWER4 LPAR uses 0xffff as invalid node */ if (nid == 0xffff || nid >= MAX_NUMNODES) nid = -1; + + if (nid > 0 && tmp[0] >= distance_ref_points_depth) + initialize_distance_lookup_table(nid, tmp); + out: return nid; } @@ -226,26 +269,10 @@ int of_node_to_nid(struct device_node *d } EXPORT_SYMBOL_GPL(of_node_to_nid); -/* - * In theory, the "ibm,associativity" property may contain multiple - * associativity lists because a resource may be multiply connected - * into the machine. This resource then has different associativity - * characteristics relative to its multiple connections. We ignore - * this for now. We also assume that all cpu and memory sets have - * their distances represented at a common level. This won't be - * true for hierarchical NUMA. - * - * In any case the ibm,associativity-reference-points should give - * the correct depth for a normal NUMA system. - * - * - Dave Hansen <haveblue@us.ibm.com> - */ static int __init find_min_common_depth(void) { - int depth, index; - const unsigned int *ref_points; + int depth; struct device_node *rtas_root; - unsigned int len; struct device_node *options; rtas_root = of_find_node_by_path("/rtas"); @@ -254,35 +281,62 @@ static int __init find_min_common_depth( return -1; /* - * this property is 2 32-bit integers, each representing a level of - * depth in the associativity nodes. The first is for an SMP - * configuration (should be all 0's) and the second is for a normal - * NUMA configuration. + * This property is a set of 32-bit integers, each representing + * an index into the ibm,associativity nodes. + * + * With form 0 affinity the first integer is for an SMP configuration + * (should be all 0's) and the second is for a normal NUMA + * configuration. We have only one level of NUMA. + * + * With form 1 affinity the first integer is the most significant + * NUMA boundary and the following are progressively less significant + * boundaries. There can be more than one level of NUMA. */ - index = 1; - ref_points = of_get_property(rtas_root, - "ibm,associativity-reference-points", &len); + distance_ref_points = of_get_property(rtas_root, + "ibm,associativity-reference-points", + &distance_ref_points_depth); + + if (!distance_ref_points) + goto err; + + distance_ref_points_depth /= sizeof(int); - /* - * For type 1 affinity information we want the first field - */ options = of_find_node_by_path("/options"); if (options) { const char *str; str = of_get_property(options, "ibm,associativity-form", NULL); if (str && !strcmp(str, "1")) - index = 0; + form1_affinity = 1; } - if ((len >= 2 * sizeof(unsigned int)) && ref_points) { - depth = ref_points[index]; + if (form1_affinity) { + depth = distance_ref_points[0]; } else { - dbg("NUMA: ibm,associativity-reference-points not found.\n"); - depth = -1; + if (distance_ref_points_depth < 2) + goto err; + + depth = distance_ref_points[1]; } + + /* + * Warn and cap if the hardware supports more than + * MAX_DISTANCE_REF_POINTS domains. + */ + if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) { + printk(KERN_WARNING + "NUMA: distance array capped at %d entries\n", + MAX_DISTANCE_REF_POINTS); + distance_ref_points_depth = MAX_DISTANCE_REF_POINTS; + } + of_node_put(rtas_root); return depth; + +err: + dbg("NUMA: ibm,associativity-reference-points not found.\n"); + of_node_put(rtas_root); + return -1; } static void __init get_n_mem_cells(int *n_addr_cells, int *n_size_cells) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance 2010-04-30 4:43 ` [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance Anton Blanchard @ 2010-05-06 6:50 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 6+ messages in thread From: Benjamin Herrenschmidt @ 2010-05-06 6:50 UTC (permalink / raw) To: Anton Blanchard; +Cc: linuxppc-dev On Fri, 2010-04-30 at 14:43 +1000, Anton Blanchard wrote: > Form 1 affinity allows multiple entries in ibm,associativity-reference-points > which represent affinity domains in decreasing order of importance. The > Linux concept of a node is always the first entry, but using the other > values as an input to node_distance() allows the memory allocator to make > better decisions on which node to go first when local memory has been > exhausted. > > We keep things simple and create an array indexed by NUMA node, capped at > 4 entries. Each time we lookup an associativity property we initialise > the array which is overkill, but since we should only hit this path during > boot it didn't seem worth adding a per node valid bit. Ok, so pls dbl check my -next branch (I'm pushing a new one out today hopefully) and respin :-) 1 and 2 seem to be already there and 3 doesn't apply (non-trivial). Thanks ! Cheers, Ben. > Signed-off-by: Anton Blanchard <anton@samba.org> > --- > > Index: linux-2.6/arch/powerpc/include/asm/topology.h > =================================================================== > --- linux-2.6.orig/arch/powerpc/include/asm/topology.h 2010-04-29 15:58:58.000000000 +1000 > +++ linux-2.6/arch/powerpc/include/asm/topology.h 2010-04-29 15:59:00.000000000 +1000 > @@ -77,6 +77,9 @@ static inline int pcibus_to_node(struct > .balance_interval = 1, \ > } > > +extern int __node_distance(int, int); > +#define node_distance(a, b) __node_distance(a, b) > + > extern void __init dump_numa_cpu_topology(void); > > extern int sysfs_add_device_to_node(struct sys_device *dev, int nid); > Index: linux-2.6/arch/powerpc/mm/numa.c > =================================================================== > --- linux-2.6.orig/arch/powerpc/mm/numa.c 2010-04-29 15:58:59.000000000 +1000 > +++ linux-2.6/arch/powerpc/mm/numa.c 2010-04-29 22:05:24.000000000 +1000 > @@ -42,6 +42,12 @@ EXPORT_SYMBOL(node_data); > > static int min_common_depth; > static int n_mem_addr_cells, n_mem_size_cells; > +static int form1_affinity; > + > +#define MAX_DISTANCE_REF_POINTS 4 > +static int distance_ref_points_depth; > +static const unsigned int *distance_ref_points; > +static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; > > static int __cpuinit fake_numa_create_new_node(unsigned long end_pfn, > unsigned int *nid) > @@ -179,6 +185,39 @@ static const u32 *of_get_usable_memory(s > return prop; > } > > +int __node_distance(int a, int b) > +{ > + int i; > + int distance = LOCAL_DISTANCE; > + > + if (!form1_affinity) > + return distance; > + > + for (i = 0; i < distance_ref_points_depth; i++) { > + if (distance_lookup_table[a][i] == distance_lookup_table[b][i]) > + break; > + > + /* Double the distance for each NUMA level */ > + distance *= 2; > + } > + > + return distance; > +} > + > +static void initialize_distance_lookup_table(int nid, > + const unsigned int *associativity) > +{ > + int i; > + > + if (!form1_affinity) > + return; > + > + for (i = 0; i < distance_ref_points_depth; i++) { > + distance_lookup_table[nid][i] = > + associativity[distance_ref_points[i]]; > + } > +} > + > /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa > * info is found. > */ > @@ -200,6 +239,10 @@ static int of_node_to_nid_single(struct > /* POWER4 LPAR uses 0xffff as invalid node */ > if (nid == 0xffff || nid >= MAX_NUMNODES) > nid = -1; > + > + if (nid > 0 && tmp[0] >= distance_ref_points_depth) > + initialize_distance_lookup_table(nid, tmp); > + > out: > return nid; > } > @@ -226,26 +269,10 @@ int of_node_to_nid(struct device_node *d > } > EXPORT_SYMBOL_GPL(of_node_to_nid); > > -/* > - * In theory, the "ibm,associativity" property may contain multiple > - * associativity lists because a resource may be multiply connected > - * into the machine. This resource then has different associativity > - * characteristics relative to its multiple connections. We ignore > - * this for now. We also assume that all cpu and memory sets have > - * their distances represented at a common level. This won't be > - * true for hierarchical NUMA. > - * > - * In any case the ibm,associativity-reference-points should give > - * the correct depth for a normal NUMA system. > - * > - * - Dave Hansen <haveblue@us.ibm.com> > - */ > static int __init find_min_common_depth(void) > { > - int depth, index; > - const unsigned int *ref_points; > + int depth; > struct device_node *rtas_root; > - unsigned int len; > struct device_node *options; > > rtas_root = of_find_node_by_path("/rtas"); > @@ -254,35 +281,62 @@ static int __init find_min_common_depth( > return -1; > > /* > - * this property is 2 32-bit integers, each representing a level of > - * depth in the associativity nodes. The first is for an SMP > - * configuration (should be all 0's) and the second is for a normal > - * NUMA configuration. > + * This property is a set of 32-bit integers, each representing > + * an index into the ibm,associativity nodes. > + * > + * With form 0 affinity the first integer is for an SMP configuration > + * (should be all 0's) and the second is for a normal NUMA > + * configuration. We have only one level of NUMA. > + * > + * With form 1 affinity the first integer is the most significant > + * NUMA boundary and the following are progressively less significant > + * boundaries. There can be more than one level of NUMA. > */ > - index = 1; > - ref_points = of_get_property(rtas_root, > - "ibm,associativity-reference-points", &len); > + distance_ref_points = of_get_property(rtas_root, > + "ibm,associativity-reference-points", > + &distance_ref_points_depth); > + > + if (!distance_ref_points) > + goto err; > + > + distance_ref_points_depth /= sizeof(int); > > - /* > - * For type 1 affinity information we want the first field > - */ > options = of_find_node_by_path("/options"); > if (options) { > const char *str; > str = of_get_property(options, "ibm,associativity-form", NULL); > if (str && !strcmp(str, "1")) > - index = 0; > + form1_affinity = 1; > } > > - if ((len >= 2 * sizeof(unsigned int)) && ref_points) { > - depth = ref_points[index]; > + if (form1_affinity) { > + depth = distance_ref_points[0]; > } else { > - dbg("NUMA: ibm,associativity-reference-points not found.\n"); > - depth = -1; > + if (distance_ref_points_depth < 2) > + goto err; > + > + depth = distance_ref_points[1]; > } > + > + /* > + * Warn and cap if the hardware supports more than > + * MAX_DISTANCE_REF_POINTS domains. > + */ > + if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) { > + printk(KERN_WARNING > + "NUMA: distance array capped at %d entries\n", > + MAX_DISTANCE_REF_POINTS); > + distance_ref_points_depth = MAX_DISTANCE_REF_POINTS; > + } > + > of_node_put(rtas_root); > > return depth; > + > +err: > + dbg("NUMA: ibm,associativity-reference-points not found.\n"); > + of_node_put(rtas_root); > + return -1; > } > > static void __init get_n_mem_cells(int *n_addr_cells, int *n_size_cells) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] powerpc: Add form 1 NUMA affinity 2010-04-30 4:34 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Anton Blanchard 2010-04-30 4:43 ` [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance Anton Blanchard @ 2010-04-30 7:24 ` Benjamin Herrenschmidt 2010-04-30 7:33 ` Anton Blanchard 1 sibling, 1 reply; 6+ messages in thread From: Benjamin Herrenschmidt @ 2010-04-30 7:24 UTC (permalink / raw) To: Anton Blanchard; +Cc: linuxppc-dev On Fri, 2010-04-30 at 14:34 +1000, Anton Blanchard wrote: > Firmware changed the way it represents memory and cpu affinity on POWER7. > Unfortunately the old method now caps the topology to work around issues > with legacy operating systems. For Linux to get the correct topology we > need to use the new form 1 affinity information. > > We set the form 1 field in the client architecture, and if we see "1" in the > ibm,associativity-form property firmware supports form 1 affinity and > we should look at the first field in the ibm,associativity-reference-points > array. If not we use the second field as we always have. > > Signed-off-by: Anton Blanchard <anton@samba.org> > --- I sent your previous version of that one to Linus, it's already up. Can you check it's allright ? Cheers, Ben. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] powerpc: Add form 1 NUMA affinity 2010-04-30 7:24 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Benjamin Herrenschmidt @ 2010-04-30 7:33 ` Anton Blanchard 0 siblings, 0 replies; 6+ messages in thread From: Anton Blanchard @ 2010-04-30 7:33 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: linuxppc-dev Hi Ben, > I sent your previous version of that one to Linus, it's already up. Can > you check it's allright ? No change to this patch, but I thought I would send them as a series since they build on each other. I'll double check mainline looks good. Anton ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-05-06 6:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-30 4:33 [PATCH 1/3] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim Anton Blanchard 2010-04-30 4:34 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Anton Blanchard 2010-04-30 4:43 ` [PATCH 3/3] powerpc: Use form 1 affinity to setup node distance Anton Blanchard 2010-05-06 6:50 ` Benjamin Herrenschmidt 2010-04-30 7:24 ` [PATCH 2/3] powerpc: Add form 1 NUMA affinity Benjamin Herrenschmidt 2010-04-30 7:33 ` Anton Blanchard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).