public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [Lse-tech] fix zonelist ordering for NUMA
@ 2004-02-24 19:49 Matthew Dobson
  2004-02-25  5:02 ` j-nomura
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Matthew Dobson @ 2004-02-24 19:49 UTC (permalink / raw)
  To: linux-ia64

On Tue, 2004-02-24 at 01:20, j-nomura@ce.jp.nec.com wrote:
> - 		for (node = local_node + 1; node < numnodes; node++)
> - 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
> - 		for (node = 0; node < local_node; node++)
> - 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
> + 		for (node = 1; node < numnodes; node++)
> + 			j = build_zonelists_node(SORTED_NODE_DATA(local_node, node), zonelist, j, k);

Shouldn't that for loop be:
	for (node = 0; node < numnodes; node++)
		j = ....

or we'll be leaving a node out of the zonelists, right?

-Matt


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
@ 2004-02-25  5:02 ` j-nomura
  2004-02-25 10:59 ` j-nomura
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: j-nomura @ 2004-02-25  5:02 UTC (permalink / raw)
  To: linux-ia64

Hi,

> > + 		for (node = 1; node < numnodes; node++)
> > + 			j = build_zonelists_node(SORTED_NODE_DATA(local_node, node), zonelist, j, k);
> 
> Shouldn't that for loop be:
> 	for (node = 0; node < numnodes; node++)
> 		j = ....
> 
> or we'll be leaving a node out of the zonelists, right?

No. The zonelist for the local_node is built before the code above.

The whole code looks like:

> local_node = pgdat->node_id;
> ...
> j = build_zonelists_node(pgdat, zonelist, j, k);
> ...
> for (node = 1; node < numnodes; node++)
>     j = build_zonelists_node(SORTED_NODE_DATA(local_node,node), zonelist, j, k);

Actually, we might be able to clean it up like:

> for (idx = 0; idx < numnodes; idx++)
>     j = build_zonelists_node(SORTED_NODE_DATA(local_node,idx), zonelist, j, k);

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
  2004-02-25  5:02 ` j-nomura
@ 2004-02-25 10:59 ` j-nomura
  2004-02-25 11:28 ` Christoph Hellwig
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: j-nomura @ 2004-02-25 10:59 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: Text/Plain, Size: 669 bytes --]

I cleaned up the patch based on the comments from Jesse and Matthew.

>   1) make it arch independent
>      this means having arch code populate a SLIT-like table for use by
>      the generic zonelist building code

I moved the whole function to mm/page_alloc.c.

>   3) some systems have pgdats w/o any CPUs associated with them, they
>      need to be dealt with differently than regular nodes, maybe as
>      extensions to an existing node

Headless node is prefered over the nodes with same distance.

>   2) handle the cases that Erich talked about a bit better

Any idea for doing it in generic way?

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

[-- Attachment #2: numa-zonelist.diff --]
[-- Type: Text/Plain, Size: 2892 bytes --]

--- linux/mm/page_alloc.c	2004/02/18 07:25:09	1.1.1.25
+++ linux/mm/page_alloc.c	2004/02/25 10:28:35
@@ -1074,9 +1074,63 @@ static int __init build_zonelists_node(p
 	return j;
 }
 
+/**
+ * find_next_best_node - find the next node that should appear in a given
+ *    node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: pointer to the bitmap of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list.  The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+#ifndef node_distance
+#define node_distance(from,to) (1)
+#endif
+#define PENALTY_FOR_NODE_WITH_CPUS  (1)
+
+static int __init find_next_best_node(int node, void *used_node_mask)
+{
+	int i, n, val;
+	int min_val = INT_MAX;
+	int best_node = -1;
+
+	for(i = 0; i < numnodes; i++) {
+		/* Start from local node */
+		n = (node+i)%numnodes;
+
+		/* Don't want a node to appear more than once */
+		if (test_bit(n, used_node_mask))
+			continue;
+
+		/* Use the distance array to find the distance */
+		val = node_distance(node, n);
+
+		/* Give preference to headless and unused nodes */
+		if (node_to_cpumask(n))
+			val += PENALTY_FOR_NODE_WITH_CPUS;
+
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+
+	if (best_node >= 0)
+		set_bit(best_node, used_node_mask);
+
+	return best_node;
+}
+
 static void __init build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
+	DECLARE_BITMAP(used_mask, MAX_NUMNODES);
 
 	local_node = pgdat->node_id;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -1092,19 +1146,9 @@ static void __init build_zonelists(pg_da
 		if (i & __GFP_DMA)
 			k = ZONE_DMA;
 
- 		j = build_zonelists_node(pgdat, zonelist, j, k);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
- 		for (node = local_node + 1; node < numnodes; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
- 		for (node = 0; node < local_node; node++)
- 			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		CLEAR_BITMAP(used_mask, MAX_NUMNODES);
+		while((node = find_next_best_node(local_node, used_mask)) >= 0)
+	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
  
 		zonelist->zones[j++] = NULL;
 	} 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
  2004-02-25  5:02 ` j-nomura
  2004-02-25 10:59 ` j-nomura
@ 2004-02-25 11:28 ` Christoph Hellwig
  2004-02-25 16:52 ` Jesse Barnes
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2004-02-25 11:28 UTC (permalink / raw)
  To: linux-ia64

On Wed, Feb 25, 2004 at 07:59:33PM +0900, j-nomura@ce.jp.nec.com wrote:
> +#ifndef node_distance
> +#define node_distance(from,to) (1)
> +#endif

please move this to toplogy.h - it'll be usefull in a few other places.

> +#define PENALTY_FOR_NODE_WITH_CPUS  (1)

should probably also go to topology.h to allow architectures to change it.

> +	for(i = 0; i < numnodes; i++) {

minor  nitpick: please leave a space between the for and the opening brace


> +		while((node = find_next_best_node(local_node, used_mask)) >= 0)

same here.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
                   ` (2 preceding siblings ...)
  2004-02-25 11:28 ` Christoph Hellwig
@ 2004-02-25 16:52 ` Jesse Barnes
  2004-02-27  5:31 ` j-nomura
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Jesse Barnes @ 2004-02-25 16:52 UTC (permalink / raw)
  To: linux-ia64

On Wed, Feb 25, 2004 at 07:59:33PM +0900, j-nomura@ce.jp.nec.com wrote:
> I cleaned up the patch based on the comments from Jesse and Matthew.
> 
> >   1) make it arch independent
> >      this means having arch code populate a SLIT-like table for use by
> >      the generic zonelist building code
> 
> I moved the whole function to mm/page_alloc.c.

Looks even better, that was fast! :)

> >   3) some systems have pgdats w/o any CPUs associated with them, they
> >      need to be dealt with differently than regular nodes, maybe as
> >      extensions to an existing node
> 
> Headless node is prefered over the nodes with same distance.

I'd be curious to hear about others with similar configurations.  On
sn2, we may have multiple headless nodes for each node with CPUs.  In
such a configuration, it seems best to have each node with CPUs 'own' a
set of headless nodes, and allocate from them even if they're further
away than other nodes with CPUs.  I don't think we have to worry about
that too much now though, since the algorithm below could be tweaked to
do just that easier than the simple sort code I did awhile back.

> >   2) handle the cases that Erich talked about a bit better
> 
> Any idea for doing it in generic way?

We could adjust 'val' below based on an array that weights each node as
it's added to a zonelist.  I think that would be up to the caller of
find_next_best_node() to adjust, but would be used in the routine below.
Doing it that way would allow the balancing that Erich was talking about
as well as the headless node stuff we want for sn2.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
                   ` (3 preceding siblings ...)
  2004-02-25 16:52 ` Jesse Barnes
@ 2004-02-27  5:31 ` j-nomura
  2004-02-27 12:19 ` j-nomura
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: j-nomura @ 2004-02-27  5:31 UTC (permalink / raw)
  To: linux-ia64

Hi,

> > >   2) handle the cases that Erich talked about a bit better
> > 
> > Any idea for doing it in generic way?
> 
> We could adjust 'val' below based on an array that weights each node as
> it's added to a zonelist.  I think that would be up to the caller of
> find_next_best_node() to adjust, but would be used in the routine below.
> Doing it that way would allow the balancing that Erich was talking about
> as well as the headless node stuff we want for sn2.

Needs consideration for this.
I think the find_next_best_node-loop must be outer to for-each-zonelist loop
to do this kind of weighing.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
                   ` (4 preceding siblings ...)
  2004-02-27  5:31 ` j-nomura
@ 2004-02-27 12:19 ` j-nomura
  2004-02-27 17:33 ` Jesse Barnes
  2004-03-01 14:08 ` j-nomura
  7 siblings, 0 replies; 9+ messages in thread
From: j-nomura @ 2004-02-27 12:19 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: Text/Plain, Size: 1269 bytes --]

I updated the patch.
This version of build_zonelists() is NUMA-specific one.
It will handle the situation Erich pointed out better.

For example, when the node_distance(i,j) looks like below: 

   10 15 15 15 20 20 20 20 
   15 10 15 15 20 20 20 20 
   15 15 10 15 20 20 20 20 
   15 15 15 10 20 20 20 20 
   20 20 20 20 10 15 15 15 
   20 20 20 20 15 10 15 15 
   20 20 20 20 15 15 10 15 
   20 20 20 20 15 15 15 10 

The previous patch generates zonelist in the following order:
   Node#00  0   1   2   3   4   5   6   7 
   Node#01  1   2   3   0   4   5   6   7 
   Node#02  2   3   0   1   4   5   6   7 
   Node#03  3   0   1   2   4   5   6   7 
   Node#04  4   5   6   7   0   1   2   3 
   Node#05  5   6   7   4   0   1   2   3 
   Node#06  6   7   4   5   0   1   2   3 
   Node#07  7   4   5   6   0   1   2   3 

With this patch, the order looks like:
   Node#00  0   1   2   3   4   5   6   7 
   Node#01  1   2   3   0   5   6   7   4 
   Node#02  2   3   0   1   6   7   4   5 
   Node#03  3   0   1   2   7   4   5   6 
   Node#04  4   5   6   7   0   1   2   3 
   Node#05  5   6   7   4   1   2   3   0 
   Node#06  6   7   4   5   2   3   0   1 
   Node#07  7   4   5   6   3   0   1   2 

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

[-- Attachment #2: numa-zonelist2.diff --]
[-- Type: Text/Plain, Size: 4525 bytes --]

--- linux/mm/page_alloc.c	2004/02/18 07:25:09	1.1.1.25
+++ linux/mm/page_alloc.c	2004/02/27 10:55:52
@@ -1074,6 +1074,106 @@ static int __init build_zonelists_node(p
 	return j;
 }
 
+#ifdef CONFIG_NUMA
+#define MAX_NODE_LOAD (numnodes)
+static int __initdata node_load[MAX_NUMNODES];
+/**
+ * find_next_best_node - find the next node that should appear in a given
+ *    node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: pointer to the bitmap of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list.  The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+static int __init find_next_best_node(int node, void *used_node_mask)
+{
+	int i, n, val;
+	int min_val = INT_MAX;
+	int best_node = -1;
+
+	for (i = 0; i < numnodes; i++) {
+		/* Start from local node */
+		n = (node+i)%numnodes;
+
+		/* Don't want a node to appear more than once */
+		if (test_bit(n, used_node_mask))
+			continue;
+
+		/* Use the distance array to find the distance */
+		val = node_distance(node, n);
+
+		/* Give preference to headless and unused nodes */
+		if (node_to_cpumask(n))
+			val += PENALTY_FOR_NODE_WITH_CPUS;
+
+		/* Slight preference for less loaded node */
+		val *= (MAX_NODE_LOAD*MAX_NUMNODES);
+		val += node_load[n];
+
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+
+	if (best_node >= 0)
+		set_bit(best_node, used_node_mask);
+
+	return best_node;
+}
+
+static void __init build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+	int prev_node, load;
+	struct zonelist *zonelist;
+	DECLARE_BITMAP(used_mask, MAX_NUMNODES);
+
+	/* initialize zonelists */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		memset(zonelist, 0, sizeof(*zonelist));
+		zonelist->zones[0] = NULL;
+	}
+
+	/* NUMA-aware ordering of nodes */
+	local_node = pgdat->node_id;
+	load = numnodes;
+	prev_node = local_node;
+	CLEAR_BITMAP(used_mask, MAX_NUMNODES);
+	while ((node = find_next_best_node(local_node, used_mask)) >= 0) {
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+		if (node_distance(local_node, node) != node_distance(local_node, prev_node))
+			node_load[node] += load;
+		prev_node = node;
+		load--;
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			zonelist = pgdat->node_zonelists + i;
+			for (j = 0; zonelist->zones[j] != NULL; j++);
+			
+			k = ZONE_NORMAL;
+			if (i & __GFP_HIGHMEM)
+				k = ZONE_HIGHMEM;
+			if (i & __GFP_DMA)
+				k = ZONE_DMA;
+
+	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+			zonelist->zones[j] = NULL;
+		}
+	}
+}
+#else
 static void __init build_zonelists(pg_data_t *pgdat)
 {
 	int i, j, k, node, local_node;
@@ -1109,7 +1209,8 @@ static void __init build_zonelists(pg_da
 		zonelist->zones[j++] = NULL;
 	} 
 }
+#endif
 
 void __init build_all_zonelists(void)
 {
--- linux/include/asm-generic/topology.h	2004/02/18 07:21:59	1.1.1.5
+++ linux/include/asm-generic/topology.h	2004/02/27 10:55:52
@@ -44,6 +44,12 @@
 #ifndef pcibus_to_cpumask
 #define pcibus_to_cpumask(bus)	(cpu_online_map)
 #endif
+#ifndef node_distance
+#define node_distance(from,to)	(from != to)
+#endif
+#ifndef PENALTY_FOR_NODE_WITH_CPUS
+#define PENALTY_FOR_NODE_WITH_CPUS	(1)
+#endif
 
 /* Cross-node load balancing interval. */
 #ifndef NODE_BALANCE_RATE
Index: linux/include/asm-i386/topology.h
===================================================================
RCS file: /home/cvsadm/cvsroot/linux2.5/linux/include/asm-i386/topology.h,v
retrieving revision 1.1.1.6
diff -u -p -u -p -r1.1.1.6 topology.h
--- linux/include/asm-i386/topology.h	2004/02/18 07:22:14	1.1.1.6
+++ linux/include/asm-i386/topology.h	2004/02/27 10:55:52
@@ -66,6 +66,12 @@ static inline cpumask_t pcibus_to_cpumas
 	return node_to_cpumask(mp_bus_id_to_node[bus]);
 }
 
+/* Node-to-Node distance */
+static inline int node_distance(int from, int to)
+{
+	return (from != to);
+}
+
 /* Cross-node load balancing interval. */
 #define NODE_BALANCE_RATE 100
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
                   ` (5 preceding siblings ...)
  2004-02-27 12:19 ` j-nomura
@ 2004-02-27 17:33 ` Jesse Barnes
  2004-03-01 14:08 ` j-nomura
  7 siblings, 0 replies; 9+ messages in thread
From: Jesse Barnes @ 2004-02-27 17:33 UTC (permalink / raw)
  To: linux-ia64

On Fri, Feb 27, 2004 at 09:19:58PM +0900, j-nomura@ce.jp.nec.com wrote:
> I updated the patch.
> This version of build_zonelists() is NUMA-specific one.
> It will handle the situation Erich pointed out better.

This patch looks great to me.  Care to send it on to Andrew for
inclusion into his tree?

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Lse-tech] fix zonelist ordering for NUMA
  2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
                   ` (6 preceding siblings ...)
  2004-02-27 17:33 ` Jesse Barnes
@ 2004-03-01 14:08 ` j-nomura
  7 siblings, 0 replies; 9+ messages in thread
From: j-nomura @ 2004-03-01 14:08 UTC (permalink / raw)
  To: linux-ia64

> > I updated the patch.
> > This version of build_zonelists() is NUMA-specific one.
> > It will handle the situation Erich pointed out better.
> 
> This patch looks great to me.  Care to send it on to Andrew for
> inclusion into his tree?

Thank you. I'll send it to Andrew.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-03-01 14:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-24 19:49 [Lse-tech] fix zonelist ordering for NUMA Matthew Dobson
2004-02-25  5:02 ` j-nomura
2004-02-25 10:59 ` j-nomura
2004-02-25 11:28 ` Christoph Hellwig
2004-02-25 16:52 ` Jesse Barnes
2004-02-27  5:31 ` j-nomura
2004-02-27 12:19 ` j-nomura
2004-02-27 17:33 ` Jesse Barnes
2004-03-01 14:08 ` j-nomura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox