[PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies
@ 2014-10-08 19:37 riel
  2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

This patch set integrates two algorithms I have previously tested,
one for glueless mesh NUMA topologies, where NUMA nodes communicate
with far-away nodes through intermediary nodes, and backplane
topologies, where communication with far-away NUMA nodes happens
through backplane controllers (which cannot run tasks).

Due to the inavailability of 8 node systems, and the fact that I
am flying out to Linuxcon Europe / Plumbers / KVM Forum on Friday,
I have not tested these patches yet. However, with a conference (and
many familiar faces) coming up, it seemed like a good idea to get
the code out there, anyway.

The algorithms have been tested before, on both kinds of system.
The new thing about this series is that both algorithms have been
integrated into the same code base, and new code to select the
preferred_nid for tasks in numa groups.

Placement of tasks on smaller, directly connected, NUMA systems
should not be affected at all by this patch series.

I am interested in reviews, as well as test results on larger
NUMA systems :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 1/5] sched,numa: build table of node hop distance
  2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
@ 2014-10-08 19:37 ` riel
  2014-10-12 13:17   ` Peter Zijlstra
  2014-10-08 19:37 ` [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system riel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

From: Rik van Riel <riel@redhat.com>

In order to more efficiently figure out where to place workloads
that span multiple NUMA nodes, it makes sense to estimate how
many hops away nodes are from each other.

Also add some comments to sched_init_numa.

Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/topology.h |  1 +
 kernel/sched/core.c      | 35 +++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index dda6ee5..33002f4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -47,6 +47,7 @@
 		if (nr_cpus_node(node))
 
 int arch_update_cpu_topology(void);
+extern int node_hops(int i, int j);
 
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE		10
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a4ad05..0cf501e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6076,6 +6076,7 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 #ifdef CONFIG_NUMA
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
+static int *sched_domains_numa_hops;
 static struct cpumask ***sched_domains_numa_masks;
 static int sched_domains_curr_level;
 #endif
@@ -6247,6 +6248,19 @@ static void sched_numa_warn(const char *str)
 	printk(KERN_WARNING "\n");
 }
 
+int node_hops(int i, int j)
+{
+	if (!sched_domains_numa_hops)
+		return 0;
+
+	return sched_domains_numa_hops[i * nr_node_ids + j];
+}
+
+static void set_node_hops(int i, int j, int hops)
+{
+	sched_domains_numa_hops[i * nr_node_ids + j] = hops;
+}
+
 static bool find_numa_distance(int distance)
 {
 	int i;
@@ -6273,6 +6287,10 @@ static void sched_init_numa(void)
 	if (!sched_domains_numa_distance)
 		return;
 
+	sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * nr_node_ids, GFP_KERNEL);
+	if (!sched_domains_numa_hops)
+		return;
+
 	/*
 	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 	 * unique distances in the node_distance() table.
@@ -6340,7 +6358,7 @@ static void sched_init_numa(void)
 
 	/*
 	 * Now for each level, construct a mask per node which contains all
-	 * cpus of nodes that are that many hops away from us.
+	 * cpus of nodes that are that many hops away from us and closer by.
 	 */
 	for (i = 0; i < level; i++) {
 		sched_domains_numa_masks[i] =
@@ -6348,6 +6366,9 @@ static void sched_init_numa(void)
 		if (!sched_domains_numa_masks[i])
 			return;
 
+		/* A node is 0 hops away from itself. */
+		set_node_hops(i, i, 0);
+
 		for (j = 0; j < nr_node_ids; j++) {
 			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
 			if (!mask)
@@ -6356,10 +6377,20 @@ static void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for (k = 0; k < nr_node_ids; k++) {
-				if (node_distance(j, k) > sched_domains_numa_distance[i])
+				int distance = node_distance(j, k);
+				if (distance > sched_domains_numa_distance[i])
 					continue;
 
+				/* All CPUs at distance or less. */
 				cpumask_or(mask, mask, cpumask_of_node(k));
+
+				/*
+				 * The number of hops is one larger than i,
+				 * because sched_domains_numa_distance[]
+				 * excludes the local distance.
+				 */
+				if (distance == sched_domains_numa_distance[i])
+					set_node_hops(j, k, i+1);
 			}
 		}
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/5] sched,numa: build table of node hop distance
  2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
@ 2014-10-12 13:17   ` Peter Zijlstra
  2014-10-12 13:28     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-12 13:17 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Wed, Oct 08, 2014 at 03:37:26PM -0400, riel@redhat.com wrote:
> +	sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * nr_node_ids, GFP_KERNEL);
> +	if (!sched_domains_numa_hops)
> +		return;

That's potentially a _BIG_ table (1M for a 512 node system).
The node_distance has magic allocations and is of u8 size, is there any
way we can re-use node_distance and avoid a second O(n^2) allocation?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/5] sched,numa: build table of node hop distance
  2014-10-12 13:17   ` Peter Zijlstra
@ 2014-10-12 13:28     ` Rik van Riel
  2014-10-14  6:47       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-10-12 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/12/2014 09:17 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:26PM -0400, riel@redhat.com wrote:
>> +	sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * nr_node_ids, GFP_KERNEL);
>> +	if (!sched_domains_numa_hops)
>> +		return;
>
> That's potentially a _BIG_ table (1M for a 512 node system).
> The node_distance has magic allocations and is of u8 size, is there any
> way we can re-use node_distance and avoid a second O(n^2) allocation?

You are right, this should be a u8 at the least.

Beyond that, I am not convinced that merging things into
the same array is worthwhile, since (IIRC) nr_node_ids
should be set to the actual number of nodes on the system
by then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/5] sched,numa: build table of node hop distance
  2014-10-12 13:28     ` Rik van Riel
@ 2014-10-14  6:47       ` Peter Zijlstra
  2014-10-14  7:49         ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-14  6:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Sun, Oct 12, 2014 at 09:28:04AM -0400, Rik van Riel wrote:
> On 10/12/2014 09:17 AM, Peter Zijlstra wrote:
> >On Wed, Oct 08, 2014 at 03:37:26PM -0400, riel@redhat.com wrote:
> >>+	sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * nr_node_ids, GFP_KERNEL);
> >>+	if (!sched_domains_numa_hops)
> >>+		return;
> >
> >That's potentially a _BIG_ table (1M for a 512 node system).
> >The node_distance has magic allocations and is of u8 size, is there any
> >way we can re-use node_distance and avoid a second O(n^2) allocation?
> 
> You are right, this should be a u8 at the least.
> 
> Beyond that, I am not convinced that merging things into
> the same array is worthwhile, since (IIRC) nr_node_ids
> should be set to the actual number of nodes on the system
> by then.

The thing is, it looks like all you do is compare hop distance, and the
order of the hop distances is the exact same order as the regular numa
distance. I could not find a place where you use the actual hop value.

So if all you're interested in is the relative ordering, that should be
the same for both.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 1/5] sched,numa: build table of node hop distance
  2014-10-14  6:47       ` Peter Zijlstra
@ 2014-10-14  7:49         ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-10-14  7:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/14/2014 02:47 AM, Peter Zijlstra wrote:
> On Sun, Oct 12, 2014 at 09:28:04AM -0400, Rik van Riel wrote:
>> On 10/12/2014 09:17 AM, Peter Zijlstra wrote:
>>> On Wed, Oct 08, 2014 at 03:37:26PM -0400, riel@redhat.com wrote:
>>>> +	sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * nr_node_ids, GFP_KERNEL);
>>>> +	if (!sched_domains_numa_hops)
>>>> +		return;
>>>
>>> That's potentially a _BIG_ table (1M for a 512 node system).
>>> The node_distance has magic allocations and is of u8 size, is there any
>>> way we can re-use node_distance and avoid a second O(n^2) allocation?
>>
>> You are right, this should be a u8 at the least.
>>
>> Beyond that, I am not convinced that merging things into
>> the same array is worthwhile, since (IIRC) nr_node_ids
>> should be set to the actual number of nodes on the system
>> by then.
>
> The thing is, it looks like all you do is compare hop distance, and the
> order of the hop distances is the exact same order as the regular numa
> distance. I could not find a place where you use the actual hop value.

I use the actual hop distances when doing the scoring
for glueless mesh topologies, in patch 4/5.

> So if all you're interested in is the relative ordering, that should be
> the same for both.
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system
  2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
  2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
@ 2014-10-08 19:37 ` riel
  2014-10-12 14:30   ` Peter Zijlstra
  2014-10-08 19:37 ` [PATCH RFC 3/5] sched,numa: preparations for complex topology placement riel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

From: Rik van Riel <riel@redhat.com>

Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.

Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.

On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.

On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.

Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/topology.h |  7 +++++++
 kernel/sched/core.c      | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 33002f4..bf40d46 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -49,6 +49,13 @@
 int arch_update_cpu_topology(void);
 extern int node_hops(int i, int j);
 
+enum numa_topology_type {
+	NUMA_DIRECT,
+	NUMA_GLUELESS_MESH,
+	NUMA_BACKPLANE,
+};
+extern enum numa_topology_type sched_numa_topology_type;
+
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE		10
 #define REMOTE_DISTANCE		20
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0cf501e..1898914 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6075,6 +6075,7 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 
 #ifdef CONFIG_NUMA
 static int sched_domains_numa_levels;
+enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
 static int *sched_domains_numa_hops;
 static struct cpumask ***sched_domains_numa_masks;
@@ -6276,6 +6277,56 @@ static bool find_numa_distance(int distance)
 	return false;
 }
 
+/*
+ * A system can have three types of NUMA topology:
+ * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
+ * NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes
+ * NUMA_BACKPLANE: nodes can reach other nodes through a backplane
+ *
+ * The difference between a glueless mesh topology and a backplane
+ * topology lies in whether communication between not directly
+ * connected nodes goes through intermediary nodes (where programs
+ * could run), or through backplane controllers. This affects
+ * placement of programs.
+ *
+ * The type of topology can be discerned with the following tests:
+ * - If the maximum distance between any nodes is 1 hop, the system
+ *   is directly connected.
+ * - If for two nodes A and B, located N > 1 hops away from each other,
+ *   there is an intermediary node C, which is < N hops away from both
+ *   nodes A and B, the system is a glueless mesh.
+ */
+static void init_numa_topology_type(void)
+{
+	int a, b, c, n;
+
+	n = sched_domains_numa_levels;
+
+	if (n <= 1)
+		sched_numa_topology_type = NUMA_DIRECT;
+
+	for_each_online_node(a) {
+		for_each_online_node(b) {
+			/* Find two nodes furthest removed from each other. */
+			if (node_hops(a, b) < n)
+				continue;
+
+			/* Is there an intermediary node between a and b? */
+			for_each_online_node(c) {
+				if (node_hops(a, c) < n &&
+				    node_hops(b, c) < n) {
+					sched_numa_topology_type =
+							NUMA_GLUELESS_MESH;
+					return;
+				}
+			}
+
+			sched_numa_topology_type = NUMA_BACKPLANE;
+			return;
+		}
+	}
+}
+
 static void sched_init_numa(void)
 {
 	int next_distance, curr_distance = node_distance(0, 0);
@@ -6425,6 +6476,8 @@ static void sched_init_numa(void)
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = level;
+
+	init_numa_topology_type();
 }
 
 static void sched_domains_numa_masks_set(int cpu)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system
  2014-10-08 19:37 ` [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system riel
@ 2014-10-12 14:30   ` Peter Zijlstra
  2014-10-13  7:12     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-12 14:30 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Wed, Oct 08, 2014 at 03:37:27PM -0400, riel@redhat.com wrote:
> +static void init_numa_topology_type(void)
> +{
> +	int a, b, c, n;
> +
> +	n = sched_domains_numa_levels;
> +
> +	if (n <= 1)
> +		sched_numa_topology_type = NUMA_DIRECT;
> +
> +	for_each_online_node(a) {
> +		for_each_online_node(b) {
> +			/* Find two nodes furthest removed from each other. */
> +			if (node_hops(a, b) < n)
> +				continue;
> +
> +			/* Is there an intermediary node between a and b? */
> +			for_each_online_node(c) {
> +				if (node_hops(a, c) < n &&
> +				    node_hops(b, c) < n) {
> +					sched_numa_topology_type =
> +							NUMA_GLUELESS_MESH;
> +					return;
> +				}
> +			}
> +
> +			sched_numa_topology_type = NUMA_BACKPLANE;
> +			return;
> +		}
> +	}
> +}

We can find max_distance nodes in sched_init_numa(), right? Could we not
avoid this second iteration?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system
  2014-10-12 14:30   ` Peter Zijlstra
@ 2014-10-13  7:12     ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-10-13  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/12/2014 10:30 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:27PM -0400, riel@redhat.com wrote:
>> +static void init_numa_topology_type(void)
>> +{
>> +	int a, b, c, n;
>> +
>> +	n = sched_domains_numa_levels;
>> +
>> +	if (n <= 1)
>> +		sched_numa_topology_type = NUMA_DIRECT;
>> +
>> +	for_each_online_node(a) {
>> +		for_each_online_node(b) {
>> +			/* Find two nodes furthest removed from each other. */
>> +			if (node_hops(a, b) < n)
>> +				continue;
>> +
>> +			/* Is there an intermediary node between a and b? */
>> +			for_each_online_node(c) {
>> +				if (node_hops(a, c) < n &&
>> +				    node_hops(b, c) < n) {
>> +					sched_numa_topology_type =
>> +							NUMA_GLUELESS_MESH;
>> +					return;
>> +				}
>> +			}
>> +
>> +			sched_numa_topology_type = NUMA_BACKPLANE;
>> +			return;
>> +		}
>> +	}
>> +}
>
> We can find max_distance nodes in sched_init_numa(), right? Could we not
> avoid this second iteration?
>
It's not about finding the max distance, but about finding
two nodes that are at the maximum distance from each other,
in order to see whether there are intermediate nodes
between the two.

I suppose we could just directly access the node_hops
tables to get two nodes that are the furthest away from
each other, but I suspect that searching that table
directly will not make a significant difference compared
with using the macros above.

Especially since this code is only ever run once at system
bootup.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 3/5] sched,numa: preparations for complex topology placement
  2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
  2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
  2014-10-08 19:37 ` [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system riel
@ 2014-10-08 19:37 ` riel
  2014-10-12 14:37   ` Peter Zijlstra
  2014-10-08 19:37 ` [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

From: Rik van Riel <riel@redhat.com>

Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/topology.h |  1 +
 kernel/sched/core.c      |  2 +-
 kernel/sched/fair.c      | 57 +++++++++++++++++++++++++++++++-----------------
 3 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index bf40d46..f8dfad9 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -47,6 +47,7 @@
 		if (nr_cpus_node(node))
 
 int arch_update_cpu_topology(void);
+extern int sched_domains_numa_levels;
 extern int node_hops(int i, int j);
 
 enum numa_topology_type {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1898914..2528f97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6074,7 +6074,7 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 }
 
 #ifdef CONFIG_NUMA
-static int sched_domains_numa_levels;
+int sched_domains_numa_levels;
 enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
 static int *sched_domains_numa_hops;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d44052..8b3f884 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,9 +930,10 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
  * larger multiplier, in order to group tasks together that are almost
  * evenly spread out between numa nodes.
  */
-static inline unsigned long task_weight(struct task_struct *p, int nid)
+static inline unsigned long task_weight(struct task_struct *p, int nid,
+					int hops)
 {
-	unsigned long total_faults;
+	unsigned long faults, total_faults;
 
 	if (!p->numa_faults_memory)
 		return 0;
@@ -942,15 +943,25 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
 	if (!total_faults)
 		return 0;
 
-	return 1000 * task_faults(p, nid) / total_faults;
+	faults = task_faults(p, nid);
+	return 1000 * faults / total_faults;
 }
 
-static inline unsigned long group_weight(struct task_struct *p, int nid)
+static inline unsigned long group_weight(struct task_struct *p, int nid,
+					 int hops)
 {
-	if (!p->numa_group || !p->numa_group->total_faults)
+	unsigned long faults, total_faults;
+
+	if (!p->numa_group)
+		return 0;
+
+	total_faults = p->numa_group->total_faults;
+
+	if (!total_faults)
 		return 0;
 
-	return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+	faults = group_faults(p, nid);
+	return 1000 * faults / total_faults;
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
@@ -1083,6 +1094,7 @@ struct task_numa_env {
 	struct numa_stats src_stats, dst_stats;
 
 	int imbalance_pct;
+	int hops;
 
 	struct task_struct *best_task;
 	long best_imp;
@@ -1162,6 +1174,7 @@ static void task_numa_compare(struct task_numa_env *env,
 	long load;
 	long imp = env->p->numa_group ? groupimp : taskimp;
 	long moveimp = imp;
+	int hops = env->hops;
 
 	rcu_read_lock();
 	cur = ACCESS_ONCE(dst_rq->curr);
@@ -1185,8 +1198,8 @@ static void task_numa_compare(struct task_numa_env *env,
 		 * in any group then look only at task weights.
 		 */
 		if (cur->numa_group == env->p->numa_group) {
-			imp = taskimp + task_weight(cur, env->src_nid) -
-			      task_weight(cur, env->dst_nid);
+			imp = taskimp + task_weight(cur, env->src_nid, hops) -
+			      task_weight(cur, env->dst_nid, hops);
 			/*
 			 * Add some hysteresis to prevent swapping the
 			 * tasks within a group over tiny differences.
@@ -1200,11 +1213,11 @@ static void task_numa_compare(struct task_numa_env *env,
 			 * instead.
 			 */
 			if (cur->numa_group)
-				imp += group_weight(cur, env->src_nid) -
-				       group_weight(cur, env->dst_nid);
+				imp += group_weight(cur, env->src_nid, hops) -
+				       group_weight(cur, env->dst_nid, hops);
 			else
-				imp += task_weight(cur, env->src_nid) -
-				       task_weight(cur, env->dst_nid);
+				imp += task_weight(cur, env->src_nid, hops) -
+				       task_weight(cur, env->dst_nid, hops);
 		}
 	}
 
@@ -1303,7 +1316,7 @@ static int task_numa_migrate(struct task_struct *p)
 	};
 	struct sched_domain *sd;
 	unsigned long taskweight, groupweight;
-	int nid, ret;
+	int nid, ret, hops;
 	long taskimp, groupimp;
 
 	/*
@@ -1331,12 +1344,13 @@ static int task_numa_migrate(struct task_struct *p)
 		return -EINVAL;
 	}
 
-	taskweight = task_weight(p, env.src_nid);
-	groupweight = group_weight(p, env.src_nid);
-	update_numa_stats(&env.src_stats, env.src_nid);
 	env.dst_nid = p->numa_preferred_nid;
-	taskimp = task_weight(p, env.dst_nid) - taskweight;
-	groupimp = group_weight(p, env.dst_nid) - groupweight;
+	hops = env.hops = node_hops(env.src_nid, env.dst_nid);
+	taskweight = task_weight(p, env.src_nid, hops);
+	groupweight = group_weight(p, env.src_nid, hops);
+	update_numa_stats(&env.src_stats, env.src_nid);
+	taskimp = task_weight(p, env.dst_nid, hops) - taskweight;
+	groupimp = group_weight(p, env.dst_nid, hops) - groupweight;
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
@@ -1348,12 +1362,15 @@ static int task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
+			hops = node_hops(env.src_nid, env.dst_nid);
+
 			/* Only consider nodes where both task and groups benefit */
-			taskimp = task_weight(p, nid) - taskweight;
-			groupimp = group_weight(p, nid) - groupweight;
+			taskimp = task_weight(p, nid, hops) - taskweight;
+			groupimp = group_weight(p, nid, hops) - groupweight;
 			if (taskimp < 0 && groupimp < 0)
 				continue;
 
+			env.hops = hops;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
 			task_numa_find_cpu(&env, taskimp, groupimp);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/5] sched,numa: preparations for complex topology placement
  2014-10-08 19:37 ` [PATCH RFC 3/5] sched,numa: preparations for complex topology placement riel
@ 2014-10-12 14:37   ` Peter Zijlstra
  2014-10-13  7:12     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-12 14:37 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Wed, Oct 08, 2014 at 03:37:28PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Preparatory patch for adding NUMA placement on systems with
> complex NUMA topology. Also fix a potential divide by zero
> in group_weight()
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/topology.h |  1 +
>  kernel/sched/core.c      |  2 +-
>  kernel/sched/fair.c      | 57 +++++++++++++++++++++++++++++++-----------------
>  3 files changed, 39 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index bf40d46..f8dfad9 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -47,6 +47,7 @@
>  		if (nr_cpus_node(node))
>  
>  int arch_update_cpu_topology(void);
> +extern int sched_domains_numa_levels;
>  extern int node_hops(int i, int j);
>  

I'm not sure we want to share this globally, please consider using
kernel/sched/sched.h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/5] sched,numa: preparations for complex topology placement
  2014-10-12 14:37   ` Peter Zijlstra
@ 2014-10-13  7:12     ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-10-13  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/12/2014 10:37 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:28PM -0400, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> Preparatory patch for adding NUMA placement on systems with
>> complex NUMA topology. Also fix a potential divide by zero
>> in group_weight()
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> ---
>>   include/linux/topology.h |  1 +
>>   kernel/sched/core.c      |  2 +-
>>   kernel/sched/fair.c      | 57 +++++++++++++++++++++++++++++++-----------------
>>   3 files changed, 39 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index bf40d46..f8dfad9 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -47,6 +47,7 @@
>>   		if (nr_cpus_node(node))
>>
>>   int arch_update_cpu_topology(void);
>> +extern int sched_domains_numa_levels;
>>   extern int node_hops(int i, int j);
>>
>
> I'm not sure we want to share this globally, please consider using
> kernel/sched/sched.h
>
Good point. I will do that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies
  2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
                   ` (2 preceding siblings ...)
  2014-10-08 19:37 ` [PATCH RFC 3/5] sched,numa: preparations for complex topology placement riel
@ 2014-10-08 19:37 ` riel
  2014-10-12 14:53   ` Peter Zijlstra
  2014-10-08 19:37 ` [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology riel
       [not found] ` <4168C988EBDF2141B4E0B6475B6A73D126F58E4F@G6W2504.americas.hpqcorp.net>
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

From: Rik van Riel <riel@redhat.com>

In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.

In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.

Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.

Placement policy on directly connected NUMA systems is not affected.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b3f884..fb22caf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -924,6 +924,65 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
 		group->faults_cpu[task_faults_idx(nid, 1)];
 }
 
+/* Handle placement on systems where not all nodes are directly connected. */
+static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
+					int hoplimit, bool task)
+{
+	unsigned long score = 0;
+	int node;
+
+	/*
+	 * All nodes are directly connected, and the same distance
+	 * from each other. No need for fancy placement algorithms.
+	 */
+	if (sched_numa_topology_type == NUMA_DIRECT)
+		return 0;
+
+	for_each_online_node(node) {
+		unsigned long faults;
+		int hops = node_hops(nid, node);
+
+		/*
+		 * The furthest away nodes in the system are not interesting
+		 * for placement; nid was already counted.
+		 */
+		if (hops == sched_domains_numa_levels || node == nid)
+			continue;
+
+		/*
+		 * On systems with a backplane NUMA topology, compare groups
+		 * of nodes, and move tasks towards the group with the most
+		 * memory accesses. When comparing two nodes at distance
+		 * "hoplimit", only nodes closer by than "hoplimit" are part
+		 * of each group. Skip other nodes.
+		 */
+		if (sched_numa_topology_type == NUMA_BACKPLANE &&
+					hops >= hoplimit)
+			continue;
+
+		/* Add up the faults from nearby nodes. */
+		if (task)
+			faults = task_faults(p, node);
+		else
+			faults = group_faults(p, node);
+
+		/*
+		 * On systems with a glueless mesh NUMA topology, there are
+		 * no fixed "groups of nodes". Instead, nodes that are not
+		 * directly connected bounce traffic through intermediate
+		 * nodes; a numa_group can occupy any set of nodes. Counting
+		 * the faults on nearby hops progressively less as distance
+		 * increases seems to result in good task placement.
+		 */
+		if (sched_numa_topology_type == NUMA_GLUELESS_MESH)
+			faults >>= hops;
+
+		score += faults;
+	}
+
+	return score;
+}
+
 /*
  * These return the fraction of accesses done by a particular task, or
  * task group, on a particular numa node.  The group weight is given a
@@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
 		return 0;
 
 	faults = task_faults(p, nid);
+	faults += score_nearby_nodes(p, nid, hops, true);
+
 	return 1000 * faults / total_faults;
 }
 
@@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
 		return 0;
 
 	faults = group_faults(p, nid);
+	faults += score_nearby_nodes(p, nid, hops, false);
+
 	return 1000 * faults / total_faults;
 }
 
@@ -1363,6 +1426,11 @@ static int task_numa_migrate(struct task_struct *p)
 				continue;
 
 			hops = node_hops(env.src_nid, env.dst_nid);
+			if (sched_numa_topology_type == NUMA_BACKPLANE &&
+						hops != env.hops) {
+				taskweight = task_weight(p, env.src_nid, hops);
+				groupweight = group_weight(p, env.src_nid, hops);
+			}
 
 			/* Only consider nodes where both task and groups benefit */
 			taskimp = task_weight(p, nid, hops) - taskweight;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies
  2014-10-08 19:37 ` [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies riel
@ 2014-10-12 14:53   ` Peter Zijlstra
  2014-10-13  7:15     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-12 14:53 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Wed, Oct 08, 2014 at 03:37:29PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> In order to do task placement on systems with complex NUMA topologies,
> it is necessary to count the faults on nodes nearby the node that is
> being examined for a potential move.
> 
> In case of a system with a backplane interconnect, we are dealing with
> groups of NUMA nodes; each of the nodes within a group is the same number
> of hops away from nodes in other groups in the system. Optimal placement
> on this topology is achieved by counting all nearby nodes equally. When
> comparing nodes A and B at distance N, nearby nodes are those at distances
> smaller than N from nodes A or B.
> 
> Placement strategy on a system with a glueless mesh NUMA topology needs
> to be different, because there are no natural groups of nodes determined
> by the hardware. Instead, when dealing with two nodes A and B at distance
> N, N >= 2, there will be intermediate nodes at distance < N from both nodes
> A and B. Good placement can be achieved by right shifting the faults on
> nearby nodes by the number of hops from the node being scored. In this
> context, a nearby node is any node less than the maximum distance in the
> system away from the node. Those nodes are skipped for efficiency reasons,
> there is no real policy reason to do so.


> +/* Handle placement on systems where not all nodes are directly connected. */
> +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
> +					int hoplimit, bool task)
> +{
> +	unsigned long score = 0;
> +	int node;
> +
> +	/*
> +	 * All nodes are directly connected, and the same distance
> +	 * from each other. No need for fancy placement algorithms.
> +	 */
> +	if (sched_numa_topology_type == NUMA_DIRECT)
> +		return 0;
> +
> +	for_each_online_node(node) {

> +	}
> +
> +	return score;
> +}

> @@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
>  		return 0;
>  
>  	faults = task_faults(p, nid);
> +	faults += score_nearby_nodes(p, nid, hops, true);
> +
>  	return 1000 * faults / total_faults;
>  }

> @@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
>  		return 0;
>  
>  	faults = group_faults(p, nid);
> +	faults += score_nearby_nodes(p, nid, hops, false);
> +
>  	return 1000 * faults / total_faults;
>  }

So this makes {task,group}_weight() O(nr_nodes), and we call these
function from O(nr_nodes) loops, giving a total of O(nr_nodes^2)
computational complexity, right?

Seems important to mention; I realize this is only for !DIRECT, but
still, I bet the real large people (those same 512 nodes we had
previous) would not really appreciate this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies
  2014-10-12 14:53   ` Peter Zijlstra
@ 2014-10-13  7:15     ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-10-13  7:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/12/2014 10:53 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:29PM -0400, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> In order to do task placement on systems with complex NUMA topologies,
>> it is necessary to count the faults on nodes nearby the node that is
>> being examined for a potential move.
>>
>> In case of a system with a backplane interconnect, we are dealing with
>> groups of NUMA nodes; each of the nodes within a group is the same number
>> of hops away from nodes in other groups in the system. Optimal placement
>> on this topology is achieved by counting all nearby nodes equally. When
>> comparing nodes A and B at distance N, nearby nodes are those at distances
>> smaller than N from nodes A or B.
>>
>> Placement strategy on a system with a glueless mesh NUMA topology needs
>> to be different, because there are no natural groups of nodes determined
>> by the hardware. Instead, when dealing with two nodes A and B at distance
>> N, N >= 2, there will be intermediate nodes at distance < N from both nodes
>> A and B. Good placement can be achieved by right shifting the faults on
>> nearby nodes by the number of hops from the node being scored. In this
>> context, a nearby node is any node less than the maximum distance in the
>> system away from the node. Those nodes are skipped for efficiency reasons,
>> there is no real policy reason to do so.
>
>
>> +/* Handle placement on systems where not all nodes are directly connected. */
>> +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
>> +					int hoplimit, bool task)
>> +{
>> +	unsigned long score = 0;
>> +	int node;
>> +
>> +	/*
>> +	 * All nodes are directly connected, and the same distance
>> +	 * from each other. No need for fancy placement algorithms.
>> +	 */
>> +	if (sched_numa_topology_type == NUMA_DIRECT)
>> +		return 0;
>> +
>> +	for_each_online_node(node) {
>
>> +	}
>> +
>> +	return score;
>> +}
>
>> @@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
>>   		return 0;
>>
>>   	faults = task_faults(p, nid);
>> +	faults += score_nearby_nodes(p, nid, hops, true);
>> +
>>   	return 1000 * faults / total_faults;
>>   }
>
>> @@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
>>   		return 0;
>>
>>   	faults = group_faults(p, nid);
>> +	faults += score_nearby_nodes(p, nid, hops, false);
>> +
>>   	return 1000 * faults / total_faults;
>>   }
>
> So this makes {task,group}_weight() O(nr_nodes), and we call these
> function from O(nr_nodes) loops, giving a total of O(nr_nodes^2)
> computational complexity, right?
>
> Seems important to mention; I realize this is only for !DIRECT, but
> still, I bet the real large people (those same 512 nodes we had
> previous) would not really appreciate this.
>
If desired, we can set up a nodemask for each node that
covers only the nodes that are less than the maximum
distance away from each other.

Even on the very large systems, that is likely to be
less than a dozen nodes.

I am not sure whether to add that complexity now, or
whether that should be considered premature optimization :)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology
  2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
                   ` (3 preceding siblings ...)
  2014-10-08 19:37 ` [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies riel
@ 2014-10-08 19:37 ` riel
  2014-10-12 14:56   ` Peter Zijlstra
       [not found] ` <4168C988EBDF2141B4E0B6475B6A73D126F58E4F@G6W2504.americas.hpqcorp.net>
  5 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-10-08 19:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mgorman, chegu_vinod, mingo, efault, vincent.guittot

From: Rik van Riel <riel@redhat.com>

On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.

The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb22caf..17ebf41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1642,6 +1642,87 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 	return delta;
 }
 
+/*
+ * Determine the preferred nid for a task in a numa_group. This needs to
+ * be done in a way that produces consistent results with group_weight,
+ * otherwise workloads might not converge.
+ */ 
+static int preferred_group_nid(struct task_struct *p, int nid)
+{
+	nodemask_t nodes;
+	int hops;
+
+	/* Direct connections between all NUMA nodes. */
+	if (sched_numa_topology_type == NUMA_DIRECT)
+		return nid;
+
+	/*
+	 * On a system with glueless mesh NUMA topology, group_weight
+	 * scores nodes according to the number of NUMA hinting faults on
+	 * both the node itself, and on nearby nodes.
+	 */
+	if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
+		unsigned long score, max_score = 0;
+		int node, max_node = nid;
+
+		hops = sched_domains_numa_levels;
+
+		for_each_online_node(node) {
+			score = group_weight(p, node, hops);
+			if (score > max_score) {
+				max_score = score;
+				max_node = node;
+			}
+		}
+		return max_node;
+	}
+
+	/*
+	 * Finding the preferred nid in a system with NUMA backplane
+	 * interconnect topology is more involved. The goal is to locate
+	 * tasks from numa_groups near each other in the system, and
+	 * untangle workloads from different sides of the system. This requires
+	 * searching down the hierarchy of node groups, recursively searching
+	 * inside the highest scoring group of nodes. The nodemask tricks
+	 * keep the complexity of the search down.
+	 */
+	nodes = node_online_map;
+	for (hops = sched_domains_numa_levels; hops; hops--) {
+		unsigned long max_faults = 0;
+		nodemask_t max_group;
+		int a, b;
+
+		for_each_node_mask(a, nodes) {
+			unsigned long faults = 0;
+			nodemask_t this_group;
+			nodes_clear(this_group);
+
+			/* Sum group's NUMA faults; includes a==b case. */
+			for_each_node_mask(b, nodes) {
+				if (node_hops(a, b) < hops) {
+					faults += group_faults(p, b);
+					node_set(b, this_group);
+					node_clear(b, nodes);
+				}
+			}
+
+			/* Remember the top group. */
+			if (faults > max_faults) {
+				max_faults = faults;
+				max_group = this_group;
+				/*
+				 * subtle: once hops==1 there is just one
+				 * node left, which is the preferred nid.
+				 */
+				nid = a;
+			}
+		}
+		/* Next round, evaluate the nodes within max_group. */
+		nodes = max_group;
+	}
+	return nid;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
 	int seq, nid, max_nid = -1, max_group_nid = -1;
@@ -1724,7 +1805,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_group) {
 		update_numa_active_node_mask(p->numa_group);
 		spin_unlock_irq(group_lock);
-		max_nid = max_group_nid;
+		max_nid = preferred_group_nid(p, max_group_nid);
 	}
 
 	if (max_faults) {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology
  2014-10-08 19:37 ` [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology riel
@ 2014-10-12 14:56   ` Peter Zijlstra
  2014-10-13  7:17     ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-10-12 14:56 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On Wed, Oct 08, 2014 at 03:37:30PM -0400, riel@redhat.com wrote:
> +static int preferred_group_nid(struct task_struct *p, int nid)
> +{
> +	nodemask_t nodes;
> +	int hops;
> +
> +	/* Direct connections between all NUMA nodes. */
> +	if (sched_numa_topology_type == NUMA_DIRECT)
> +		return nid;
> +
> +	/*
> +	 * On a system with glueless mesh NUMA topology, group_weight
> +	 * scores nodes according to the number of NUMA hinting faults on
> +	 * both the node itself, and on nearby nodes.
> +	 */
> +	if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
> +		unsigned long score, max_score = 0;
> +		int node, max_node = nid;
> +
> +		hops = sched_domains_numa_levels;
> +
> +		for_each_online_node(node) {
> +			score = group_weight(p, node, hops);
> +			if (score > max_score) {
> +				max_score = score;
> +				max_node = node;
> +			}
> +		}
> +		return max_node;
> +	}

This too is O(nr_nodes^2), right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology
  2014-10-12 14:56   ` Peter Zijlstra
@ 2014-10-13  7:17     ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-10-13  7:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mgorman, chegu_vinod, mingo, efault,
	vincent.guittot

On 10/12/2014 10:56 AM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:37:30PM -0400, riel@redhat.com wrote:
>> +static int preferred_group_nid(struct task_struct *p, int nid)
>> +{
>> +	nodemask_t nodes;
>> +	int hops;
>> +
>> +	/* Direct connections between all NUMA nodes. */
>> +	if (sched_numa_topology_type == NUMA_DIRECT)
>> +		return nid;
>> +
>> +	/*
>> +	 * On a system with glueless mesh NUMA topology, group_weight
>> +	 * scores nodes according to the number of NUMA hinting faults on
>> +	 * both the node itself, and on nearby nodes.
>> +	 */
>> +	if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
>> +		unsigned long score, max_score = 0;
>> +		int node, max_node = nid;
>> +
>> +		hops = sched_domains_numa_levels;
>> +
>> +		for_each_online_node(node) {
>> +			score = group_weight(p, node, hops);
>> +			if (score > max_score) {
>> +				max_score = score;
>> +				max_node = node;
>> +			}
>> +		}
>> +		return max_node;
>> +	}
>
> This too is O(nr_nodes^2), right?
>
It is, but I suspect the glueless mesh topologies are
never larger than on the order of a dozen nodes or so.

Would you prefer me to make the optimization I proposed
in the other email, or should I just add in a comment
stating that that optimization could be made if it turns
out to be necessary?

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <4168C988EBDF2141B4E0B6475B6A73D126F58E4F@G6W2504.americas.hpqcorp.net>]

[parent not found: <54367446.3020603@redhat.com>]

* RE: [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies
       [not found]   ` <54367446.3020603@redhat.com>
@ 2014-10-10 18:44     ` Vinod, Chegu
  0 siblings, 0 replies; 19+ messages in thread
From: Vinod, Chegu @ 2014-10-10 18:44 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel@vger.kernel.org
  Cc: peterz@infradead.org, mgorman@suse.de, mingo@kernel.org,
	efault@gmx.de, vincent.guittot@linaro.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 12829 bytes --]

>This patch set integrates two algorithms I have previously tested,
>one for glueless mesh NUMA topologies, where NUMA nodes communicate
>with far-away nodes through intermediary nodes, and backplane
>topologies, where communication with far-away NUMA nodes happens
>through backplane controllers (which cannot run tasks).
>
>Due to the inavailability of 8 node systems, and the fact that I
>am flying out to Linuxcon Europe / Plumbers / KVM Forum on Friday,
>I have not tested these patches yet. However, with a conference (and
>many familiar faces) coming up, it seemed like a good idea to get
>the code out there, anyway.
>
>The algorithms have been tested before, on both kinds of system.
>The new thing about this series is that both algorithms have been
>integrated into the same code base, and new code to select the
>preferred_nid for tasks in numa groups.
>
>Placement of tasks on smaller, directly connected, NUMA systems
>should not be affected at all by this patch series.
>
>I am interested in reviews, as well as test results on larger
>NUMA systems :)


Tested-by: Chegu Vinod <chegu_vinod@hp.com>

---

Hi Rik,

Applied your RFC patches along with the patch : https://lkml.org/lkml/2014/10/9/604   
on a 3.17.0 kernel on ran a quick test on two platforms with different NUMA topologies:

1) 8 socket Westmere-EX system  (HT-off)
2) 8 socket IvyBridge-EX prototype system  (HT-on)
       
On both these system  ran 4 instances of a 2-socket wide SpecJbb2005 workload and  
then ran 2 instances of a 4-socket wide SpecJbb2005 workload.  Repeated the 
experiment 10 times.

Having these patches enabled the desired NUMA nodes to selected for both sized 
workloads on both these systems...which is quite an encouraging sign !

Pl. see below a sampling of memory consumed by the java processes (as reported 
by numastat captured 120 seconds into each of the runs).

Thanks
Vinod

---


1) 8 socket Westmere-EX system with the following SLIT table :


node distances:
node   0   1   2   3   4   5   6   7
  0:  10  12  17  17  19  19  19  19
  1:  12  10  17  17  19  19  19  19
  2:  17  17  10  12  19  19  19  19
  3:  17  17  12  10  19  19  19  19
  4:  19  19  19  19  10  12  17  17
  5:  19  19  19  19  12  10  17  17
  6:  19  19  19  19  17  17  10  12
  7:  19  19  19  19  17  17  12  10

1a) 4 x  2 socket wide instances  (each instance has 20 warehouse threads)

Note: The correct NUMA nodes got picked 9 out of 10 runs..

PID           Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
15758 (java)       0     33   1904   1542      0      9      0     14  3503
15759 (java)       1     27      0     36   1694   1738      5      5  3506
15760 (java)    1884   1607      8      9      0     11      0      2  3521
15761 (java)       3     26      2     10      0      2   1553   1929  3525

16536 (java)       2      0      1     23      1     31   1934   1526  3517
16537 (java)    1831   1614      0      9      0     33      2     11  3501
16538 (java)       1      0      0      5   1940   1537      2     11  3497
16539 (java)       1      1   1596   1868      0     25     12      8  3511

17504 (java)       1      1      1      9   1769   1687      3     45  3516     <
17505 (java)       1      4      1      6      6      3   1522   1971  3514
17506 (java)    1413      1      9   2039      0      9      2     33  3506  <
17507 (java)     114   1887   1430      5      7     16      3     37  3500  <

17933 (java)       0      1      0     14   1707   1751      8     28  3510
17934 (java)       0      0   1745   1729      0      7      7     33  3522
17935 (java)    1700   1771      4     25     10      3      1     28  3542
17936 (java)      16     15      0      9      0      0   1702   1765  3507

18344 (java)       5      0      7      4   1567   1910      2     11  3507
18345 (java)       1      3   1557   1932     17     26      6      8  3550
18346 (java)       8      0      2     10      4     29   1317   2124  3497
18347 (java)    1591   1856      0      8      8     28      0     16  3508

18753 (java)    1737   1786      4      5      0      1      0     18  3552
18754 (java)       5     28      0      4   1617   1833      3     26  3515
18755 (java)      21     27      0      6      0      0   1518   1937  3509
18756 (java)       1     25   1690   1785      0      0      0      8  3509

19166 (java)      41      1   1829   1628      0      9      3     13  3523
19167 (java)    1737   1736      3      8      0      1     10     15  3511
19168 (java)      33      8      8     13      2      2   1546   1905  3517
19169 (java)      36      9      4      5   1634   1802      0     10  3500

19983 (java)    1617   1818      8     34     12      5      3      2  3501
19984 (java)       0      1      2     41      0      0   1724   1743  3513
19985 (java)       1      3   1901   1570      0      0      3      9  3487
19986 (java)       3     21      0     32   1634   1816      0      6  3514

20947 (java)       0      0      2     27   1652   1819      0      3  3506
20948 (java)       0      0      2      9      1     33   1848   1635  3528
20949 (java)       1      0   1873   1597      1     31      0      4  3508
20950 (java)    1624   1821     10      9      3     33      0      3  3503

21354 (java)       1     17     30      8    601   1400      1   1445  3503  <
21355 (java)       0      3   1542   1963      0     16      1      9  3536  
21356 (java)    1761   1672     27      8      8      8      1     10  3494   <
21357 (java)       0      0     25      3   1221      8   1843    410  3511

1b) 2 x 4 socket wide instances  (each instance has 40 warehouse threads)

Note: The correct NUMA nodes got picked 9 out of 10 runs.

PID           Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
21847 (java)       1      0      1     16    821   1154   1241   1330  4563
21848 (java)    1274   1497    917    807     14      3      1     25  4537

22120 (java)       1     26      1      5    860    948   1367   1348  4555
22121 (java)    1269   1392    929    896      1      1      5     13  4505

23298 (java)     847    829   1346   1470      0     30      7      3  4533
23299 (java)       3      4      0     15    876   1130   1184   1308  4521

23603 (java)    1277   1447    951    817      2      1      8     20  4521
23604 (java)       1     10      1     28    876    831   1343   1460  4551

23874 (java)       0      3      2     37    864    853   1309   1473  4542
23875 (java)    1452   1382    859    873      1      1      1     10  4579

24148 (java)       1      1      1     21   1132   1296   1084   1039  4574
24149 (java)     896    836   1223   1615      0     30      1      8  4611

24421 (java)    1447   1413    818    849      1      0      0     24  4551
24422 (java)       1     27      1      6    865    817   1420   1427  4563

25064 (java)     861    907   1344   1404      0     24      0      7  4547
25065 (java)       1      1      0     14   1197   1376    827   1168  4583

25881 (java)       1      3      0     19   1043   1204   1123   1167  4559
25882 (java)     884    874   1402   1323      0      0      1     37  4522

26150 (java)       1      1      6    781     11    886   1347   1514  4546         <
26151 (java)    1129   1322    811    459    760     11      9     29  4531     <


----
2) 8 socket IvyBridge-EX prototype system with the following SLIT table :

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  30  30  30  30  30  30
  1:  16  10  30  30  30  30  30  30
  2:  30  30  10  16  30  30  30  30
  3:  30  30  16  10  30  30  30  30
  4:  30  30  30  30  10  16  30  30
  5:  30  30  30  30  16  10  30  30
  6:  30  30  30  30  30  30  10  16
  7:  30  30  30  30  30  30  16  10

2a) 4 x  2 socket wide instances  (each instance has 60 warehouse threads)

PID           Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
18680 (java)       2     18      1     20      1     12   3626   4820  8500
18681 (java)       1      8      5    174   4721   3579     12      3  8504
18682 (java)       4     18   5002   3461      9     13     13      3  8523
18683 (java)    5045   3428      3      7      1     18      6      3  8511

19745 (java)    5472   2984      4      7      8     22     10     20  8527
19746 (java)       8     30   3467   4976      3     22      7      5  8519
19747 (java)       2     18      4      9   4390   4068      1      2  8493
19748 (java)       5      7      2     11      4     16   4251   4186  8482

20796 (java)    4176   4301      4      6      9     13      5      6  8517
20797 (java)      13     47     27      1   3311   5127      5      3  8535
20798 (java)       6      5   2973   5498      3     19     29     11  8544
20799 (java)       9      7      9    130      2     14   3034   5315  8519

21852 (java)       6      1     14     14     24   2043   2346   4035  8484
21853 (java)    3551   4841     10     23      4     11     86      4  8532
21854 (java)       2      1   3194   5286      2      5     17     12  8519
21855 (java)      11      3      1     12   6943   1288    166     78  8502

22907 (java)       5      1   4165   4307      8     15     21      8  8531
22908 (java)       4     10      2      2   3066   5421      4      1  8511
22909 (java)       3      2      5      8      6     16   3544   4943  8527
22910 (java)    3340   5127      1     18      6     16      9      4  8520

23955 (java)       7      2     19      3   2436   6000      2     47  8515
23956 (java)      12      2   2785   5668     25      5      3     25  8525
23957 (java)    3651   4790     15      2     24      8      1     17  8508
23958 (java)       9      4      5      1     13      5   4794   3688  8519

25011 (java)    3532   4931      1      6      8      7      3     32  8520
25012 (java)       7     14      2      8   2763   5683      1     28  8506
25013 (java)      13     12      1     11     18      5   5373   3095  8528
25014 (java)      15      9   5455   3013      4      7      8     11  8522

26060 (java)       1      1      1      3     29     28   4651   3784  8498
26061 (java)       2      1   6245   2212     13     39      0      5  8519
26062 (java)      68   5294      1      3   1879   1262      1      3  8511
26063 (java)    7522    637      1     78    107    160      1      6  8512

27116 (java)       2     37      2      2      9      5   5110   3368  8535
27117 (java)    5376   3115     12      4     10      3      5      6  8532
27118 (java)       9     22   2850   5610      4     17      1     18  8531
27119 (java)      10     26      2      3   3472   4987      1      5  8505

28162 (java)    3332   5103      1      4      8      4      7     45  8504
28163 (java)      12      4      3     18   3079   5383      1     24  8524
28164 (java)      17      9   5533   2921      9      5      1     16  8510
28165 (java)       7      2      1     16     10      7   5166   3299  8508


2b) 2 x 4 socket wide instances  (each instance has 120 warehouse threads)

PID           Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
29228 (java)       2      8   2394   3868     15     82   1130   1085  8585
29229 (java)    2824   3773     19      6    963    923     24     59  8589

29964 (java)    1084   1080     28     31   2841   3497      8      2  8569
29965 (java)      16     18   1186   1077      3      6   2841   3436  8583

30683 (java)      19   4281   1830   2137     73     94     13     26  8472
30684 (java)    3763     31     10     10     75     56   2174   2394  8514

31399 (java)       6     25   2608   3116     10     18   1424   1388  8595
31400 (java)    1518   1532      2     12   2259   3188     20     32  8562

32113 (java)     692    620     28     18   3141   4033     25     20  8577
32114 (java)       8      8   2231   2390      3     15   1955   1974  8583

32827 (java)    2324   2663     79   3423     24     20      8     13  8554
32828 (java)      15      3   2448     26    515    855   2214   2489  8566

33543 (java)    1090   1129      3     23     22     25   2923   3366  8581
33544 (java)      12     10   2686   3375   1745    726      4     22  8580

34258 (java)    2127   1928   1924   2558      7     18      8      4  8574
34259 (java)       5     23      8     11   1182   1266   2731   3347  8572

34978 (java)      16     14   1617   1964     14     23   2258   2677  8583
34979 (java)    1721   1704     62     19   2339   2726      6     10  8588

35692 (java)    3599   2721     50     58      6      9    700   1421  8564
35693 (java)      18     49     74     83   3162   4822    309     46  8563



ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-10-14  7:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-08 19:37 [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies riel
2014-10-08 19:37 ` [PATCH RFC 1/5] sched,numa: build table of node hop distance riel
2014-10-12 13:17   ` Peter Zijlstra
2014-10-12 13:28     ` Rik van Riel
2014-10-14  6:47       ` Peter Zijlstra
2014-10-14  7:49         ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system riel
2014-10-12 14:30   ` Peter Zijlstra
2014-10-13  7:12     ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 3/5] sched,numa: preparations for complex topology placement riel
2014-10-12 14:37   ` Peter Zijlstra
2014-10-13  7:12     ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies riel
2014-10-12 14:53   ` Peter Zijlstra
2014-10-13  7:15     ` Rik van Riel
2014-10-08 19:37 ` [PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology riel
2014-10-12 14:56   ` Peter Zijlstra
2014-10-13  7:17     ` Rik van Riel
     [not found] ` <4168C988EBDF2141B4E0B6475B6A73D126F58E4F@G6W2504.americas.hpqcorp.net>
     [not found]   ` <54367446.3020603@redhat.com>
2014-10-10 18:44     ` [PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies Vinod, Chegu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).