public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* fix zonelist ordering for NUMA
@ 2004-02-24  9:20 j-nomura
  2004-02-24 17:13 ` Jesse Barnes
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: j-nomura @ 2004-02-24  9:20 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: Text/Plain, Size: 669 bytes --]

Hello,

The attached patch makes use of arch-dependent info for building zonelist.
The patch uses ACPI SLIT for ia64.
Other arch may have their own method to determine the order.

This kind of ordering is very important for the NUMA system in which
the distance between nodes is not uniform.

The patch doing this was posted by Jesse Barnes in linux-ia64:
http://marc.theaimsgroup.com/?t=106383477500001&r=1&w=2
however, I couldn't find it in current tree...

The sorting can be extended to, for example, more fine grained round-robin
like Erich suggested. But let's start from the simple one.

Any comments?

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

[-- Attachment #2: ia64-numa-zoneordering.diff --]
[-- Type: Text/Plain, Size: 3396 bytes --]

--- linux/mm/page_alloc.c	2004/02/18 07:25:09	1.1.1.25
+++ linux/mm/page_alloc.c	2004/02/24 09:02:29
@@ -1074,6 +1074,13 @@ static int __init build_zonelists_node(p
 	return j;
 }
 
+#ifndef HAVE_ARCH_SORTED_NODE_DATA
+/*
+ * By default, the order of node data is unchanged.
+ */
+#define SORTED_NODE_DATA(base, idx) NODE_DATA((base+idx)%numnodes)
+#endif
+
 static void __init build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
@@ -1100,12 +1107,12 @@ static void __init build_zonelists(pg_da
  		 * building the zones for node N, we make sure that the
  		 * zones coming right after the local ones are those from
  		 * node N+1 (modulo N)
+ 		 * Multi-level NUMA system can use arch-dependent node data
+		 * list. (e.g. sorted by distance)
  		 */
- 		for (node = local_node + 1; node < numnodes; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- 		for (node = 0; node < local_node; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+ 		for (node = 1; node < numnodes; node++)
+ 			j = build_zonelists_node(SORTED_NODE_DATA(local_node, node), zonelist, j, k);
  
 		zonelist->zones[j++] = NULL;
 	} 
--- linux/include/asm-ia64/numa.h	2004/02/18 07:21:42	1.1.1.8
+++ linux/include/asm-ia64/numa.h	2004/02/24 09:02:29
@@ -65,7 +65,11 @@ extern int paddr_to_nid(unsigned long pa
 
 #define local_nodeid (cpu_to_node_map[smp_processor_id()])
 
+#define HAVE_ARCH_SORTED_NODE_DATA
+#define SORTED_NODE_DATA(base, idx) NODE_DATA(nodes_by_distance[base][idx])
+extern int __initdata nodes_by_distance[MAX_NUMNODES][MAX_NUMNODES];
+
 #else /* !CONFIG_NUMA */
 
 #define paddr_to_nid(addr)	0
--- linux/arch/ia64/mm/discontig.c	2004/02/18 07:23:08	1.1.1.8
+++ linux/arch/ia64/mm/discontig.c	2004/02/24 09:02:29
@@ -47,6 +47,53 @@ static struct early_node_data mem_data[N
 #define NODEDATA_ALIGN(addr, node)						\
 	((((addr) + 1024*1024-1) & ~(1024*1024-1)) + (node)*PERCPU_PAGE_SIZE)
 
+/*
+ * node list sorted by distance
+ *
+ * For example, if the SLIT looks like below:
+ *     10 30 20
+ *     20 10 30
+ *     30 20 10
+ *
+ * nodes_by_distance[][] will be:
+ *      0  2  1
+ *      1  0  2
+ *      2  1  0
+ */
+int __initdata nodes_by_distance[MAX_NUMNODES][MAX_NUMNODES];
+
+/**
+ * build_sorted_node_list - build nodes_by_distance matrix from ACPI SLIT
+ *
+ * Called in early stage to create matrix for SORTED_NODE_DATA().
+ * The function depends on node_distance (=numa_slit) and numnodes.
+ */ 
+static void __init build_sorted_node_list(void)
+{
+	int i, j, k, n;
+	int dist, min, next_min;
+
+	for(i = 0; i < numnodes; i++) {
+		/* index 0 always points to self */
+		nodes_by_distance[i][0] = i;
+		/* sorting for node i */
+		for(j = 1, min = 0; j < numnodes; min = next_min) {
+			/* slit entry is u8 */
+			next_min = INT_MAX;
+			for(k = 0; k < numnodes; k++) {
+				n = (i+k)%numnodes; /* permutation */
+				dist = node_distance(i,n);
+				if (dist == min && i != n)
+					nodes_by_distance[i][j++] = n;
+				else if (dist > min && dist < next_min)
+					next_min = dist;
+			}
+			if (next_min == INT_MAX)
+				break;
+		}
+	}
+}
+
 /**
  * build_node_maps - callback to setup bootmem structs for each node
  * @start: physical start of range
@@ -333,6 +380,7 @@ void __init find_memory(void)
 
 	reserve_pernode_space();
 	initialize_pernode_data();
+	build_sorted_node_list();
 
 	max_pfn = max_low_pfn;
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: fix zonelist ordering for NUMA
  2004-02-24  9:20 fix zonelist ordering for NUMA j-nomura
@ 2004-02-24 17:13 ` Jesse Barnes
  2004-02-25  5:01 ` j-nomura
  2004-02-25 16:54 ` Jesse Barnes
  2 siblings, 0 replies; 4+ messages in thread
From: Jesse Barnes @ 2004-02-24 17:13 UTC (permalink / raw)
  To: linux-ia64

On Tue, Feb 24, 2004 at 06:20:28PM +0900, j-nomura@ce.jp.nec.com wrote:
> The attached patch makes use of arch-dependent info for building zonelist.
> The patch uses ACPI SLIT for ia64.
> Other arch may have their own method to determine the order.
> 
> This kind of ordering is very important for the NUMA system in which
> the distance between nodes is not uniform.
> 
> The patch doing this was posted by Jesse Barnes in linux-ia64:
> http://marc.theaimsgroup.com/?t\x106383477500001&r=1&w=2
> however, I couldn't find it in current tree...

Yeah, I haven't pushed it yet (I didn't think it was ready yet and I
haven't done a good version for 2.6 yet).

> The sorting can be extended to, for example, more fine grained round-robin
> like Erich suggested. But let's start from the simple one.
> 
> Any comments?

Yeah, it looks ok.  What I was hoping to do in the patch that ultimately
gets in:

  1) make it arch independent
     this means having arch code populate a SLIT-like table for use by
     the generic zonelist building code
  2) handle the cases that Erich talked about a bit better
  3) some systems have pgdats w/o any CPUs associated with them, they
     need to be dealt with differently than regular nodes, maybe as
     extensions to an existing node

The final routine might look something like (many thanks to pj for
hitting me with a cluebat about this):


/**
 * find_next_best_node - find the next node that should appear in a given
 *    node's fallback list
 * @node: node whose fallback list we're appending
 *
 * We use a number of factors to determine which is the next node that should
 * appear on a given node's fallback list.  The node should not have appeared
 * already in @node's fallback list, and it should be the next closest node
 * according to the distance array (which contains arbitrary distance values
 * from each node to each node in the system), and should also prefer nodes
 * with no CPUs, since presumably they'll have very little allocation pressure
 * on them otherwise.
 */
int find_next_best_node(int node)
{
	int i, val, min_val, best_node;

	for (i = 0; i < numnodes; i++) {
		/* Don't want a node to appear more than once */
		if (node_present(node, i))
		    continue;

		/* Use the distance array to find the distance */
		val = node_distance(node, i);

		/* Give preference to headless and unused nodes */
		val += nid_enabled_cpu_count[i] * 255;
		val += node_load[i];

		if (val < min_val) {
			min_val = val;
			best_node = i;
		}
	}

	return best_node;
}

Jesse

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: fix zonelist ordering for NUMA
  2004-02-24  9:20 fix zonelist ordering for NUMA j-nomura
  2004-02-24 17:13 ` Jesse Barnes
@ 2004-02-25  5:01 ` j-nomura
  2004-02-25 16:54 ` Jesse Barnes
  2 siblings, 0 replies; 4+ messages in thread
From: j-nomura @ 2004-02-25  5:01 UTC (permalink / raw)
  To: linux-ia64

Hi,

> > The sorting can be extended to, for example, more fine grained round-robin
> > like Erich suggested. But let's start from the simple one.
> > 
> > Any comments?
> 
> Yeah, it looks ok.

Thank you.

> What I was hoping to do in the patch that ultimately
> gets in:
> 
>   1) make it arch independent
>      this means having arch code populate a SLIT-like table for use by
>      the generic zonelist building code

I would like to hear the comments from people on other arch.
If the same ordering rule can be applicable for others, it's nice.

>   2) handle the cases that Erich talked about a bit better
>   3) some systems have pgdats w/o any CPUs associated with them, they
>      need to be dealt with differently than regular nodes, maybe as
>      extensions to an existing node
> 
> The final routine might look something like (many thanks to pj for
> hitting me with a cluebat about this):

It looks cleaner than mine. I'll try it.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: fix zonelist ordering for NUMA
  2004-02-24  9:20 fix zonelist ordering for NUMA j-nomura
  2004-02-24 17:13 ` Jesse Barnes
  2004-02-25  5:01 ` j-nomura
@ 2004-02-25 16:54 ` Jesse Barnes
  2 siblings, 0 replies; 4+ messages in thread
From: Jesse Barnes @ 2004-02-25 16:54 UTC (permalink / raw)
  To: linux-ia64

On Wed, Feb 25, 2004 at 02:01:16PM +0900, j-nomura@ce.jp.nec.com wrote:
> >   1) make it arch independent
> >      this means having arch code populate a SLIT-like table for use by
> >      the generic zonelist building code
> 
> I would like to hear the comments from people on other arch.
> If the same ordering rule can be applicable for others, it's nice.

Martin, does a scheme like this sound ok with you?  Arch specific code
would populate a node distance table, which would be used to build each
pgdat->zonelist in a smarter way than we do currently.

> >   2) handle the cases that Erich talked about a bit better
> >   3) some systems have pgdats w/o any CPUs associated with them, they
> >      need to be dealt with differently than regular nodes, maybe as
> >      extensions to an existing node
> > 
> > The final routine might look something like (many thanks to pj for
> > hitting me with a cluebat about this):
> 
> It looks cleaner than mine. I'll try it.

Cool, thanks.

Jesse

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-02-25 16:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-24  9:20 fix zonelist ordering for NUMA j-nomura
2004-02-24 17:13 ` Jesse Barnes
2004-02-25  5:01 ` j-nomura
2004-02-25 16:54 ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox