[PATCH 0/3] sched,numa: further numa balancing fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] sched,numa: further numa balancing fixes
@ 2014-06-14 19:18 riel
  2014-06-14 19:18 ` [PATCH 1/3] sched,numa: use group's max nid as task's preferred nid riel
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: riel @ 2014-06-14 19:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, mgorman, chegu_vinod

A few more bug fixes that seem to improve convergence of
"perf bench numa mem -m -0 -P 1000 -p X -t Y" for various
values of X and Y, on both 4 and 8 node systems.

This does not address the issue I highlighted Friday:

https://lkml.org/lkml/2014/6/13/529

I have an idea on how to fix that issue, but implementing
that as part of this series would be silly, since I would
have to rip the code back out and completely rewrite it
once I started working on placement for systems with complex
NUMA topologies. Better to leave that code for then...

Patches are against the latest -next and -tip trees.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] sched,numa: use group's max nid as task's preferred nid
  2014-06-14 19:18 [PATCH 0/3] sched,numa: further numa balancing fixes riel
@ 2014-06-14 19:18 ` riel
  2014-06-14 19:18 ` [PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced riel
  2014-06-14 19:18 ` [PATCH 3/3] sched,numa: use effective_load to balance NUMA loads riel
  2 siblings, 0 replies; 4+ messages in thread
From: riel @ 2014-06-14 19:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

>From task_numa_placement, always try to consolidate the tasks
in a group on the group's top nid.

In case this task is part of a group that is interleaved over
multiple nodes, task_numa_migrate will set the task's preferred
nid to the best node it could find for the task, so this patch
will cause at most one run through task_numa_migrate.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..86c35d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1595,23 +1595,8 @@ static void task_numa_placement(struct task_struct *p)
 
 	if (p->numa_group) {
 		update_numa_active_node_mask(p->numa_group);
-		/*
-		 * If the preferred task and group nids are different,
-		 * iterate over the nodes again to find the best place.
-		 */
-		if (max_nid != max_group_nid) {
-			unsigned long weight, max_weight = 0;
-
-			for_each_online_node(nid) {
-				weight = task_weight(p, nid) + group_weight(p, nid);
-				if (weight > max_weight) {
-					max_weight = weight;
-					max_nid = nid;
-				}
-			}
-		}
-
 		spin_unlock_irq(group_lock);
+		max_nid = max_group_nid;
 	}
 
 	/* Preferred node as the node with the most faults */
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced
  2014-06-14 19:18 [PATCH 0/3] sched,numa: further numa balancing fixes riel
  2014-06-14 19:18 ` [PATCH 1/3] sched,numa: use group's max nid as task's preferred nid riel
@ 2014-06-14 19:18 ` riel
  2014-06-14 19:18 ` [PATCH 3/3] sched,numa: use effective_load to balance NUMA loads riel
  2 siblings, 0 replies; 4+ messages in thread
From: riel @ 2014-06-14 19:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

Currently the NUMA code scales the load on each node with the
amount of CPU power available on that node, but it does not
apply any adjustment to the load of the task that is being
moved over.

On systems with SMT/HT, this results in a task being weighed
much more heavily than a CPU core, and a task move that would
even out the load between nodes being disallowed.

The correct thing is to apply the power correction to the
numbers after we have first applied the move of the tasks'
loads to them.

This also allows us to do the power correction with a multiplication,
rather than a division.

Also drop two function arguments for load_too_unbalanced, since it
takes various factors from env already.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86c35d6..976dd73 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1062,7 +1062,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	if (!cpus)
 		return;
 
-	ns->load = (ns->load * SCHED_CAPACITY_SCALE) / ns->compute_capacity;
 	ns->task_capacity =
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE);
 	ns->has_free_capacity = (ns->nr_running < ns->task_capacity);
@@ -1096,18 +1095,30 @@ static void task_numa_assign(struct task_numa_env *env,
 	env->best_cpu = env->dst_cpu;
 }
 
-static bool load_too_imbalanced(long orig_src_load, long orig_dst_load,
-				long src_load, long dst_load,
+static bool load_too_imbalanced(long src_load, long dst_load,
 				struct task_numa_env *env)
 {
 	long imb, old_imb;
+	long orig_src_load, orig_dst_load;
+	long src_capacity, dst_capacity;
+
+	/*
+	 * The load is corrected for the CPU capacity available on each node.
+	 *
+	 * src_load        dst_load
+	 * ------------ vs ---------
+	 * src_capacity    dst_capacity
+	 */
+	src_capacity = env->src_stats.compute_capacity;
+	dst_capacity = env->dst_stats.compute_capacity;
 
 	/* We care about the slope of the imbalance, not the direction. */
 	if (dst_load < src_load)
 		swap(dst_load, src_load);
 
 	/* Is the difference below the threshold? */
-	imb = dst_load * 100 - src_load * env->imbalance_pct;
+	imb = dst_load * src_capacity * 100 -
+	      src_load * dst_capacity * env->imbalance_pct;
 	if (imb <= 0)
 		return false;
 
@@ -1115,10 +1126,14 @@ static bool load_too_imbalanced(long orig_src_load, long orig_dst_load,
 	 * The imbalance is above the allowed threshold.
 	 * Compare it with the old imbalance.
 	 */
+	orig_src_load = env->src_stats.load;
+	orig_dst_load = env->dst_stats.load;
+
 	if (orig_dst_load < orig_src_load)
 		swap(orig_dst_load, orig_src_load);
 
-	old_imb = orig_dst_load * 100 - orig_src_load * env->imbalance_pct;
+	old_imb = orig_dst_load * src_capacity * 100 -
+		  orig_src_load * dst_capacity * env->imbalance_pct;
 
 	/* Would this change make things worse? */
 	return (imb > old_imb);
@@ -1136,8 +1151,7 @@ static void task_numa_compare(struct task_numa_env *env,
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
-	long orig_src_load, src_load;
-	long orig_dst_load, dst_load;
+	long src_load, dst_load;
 	long load;
 	long imp = (groupimp > 0) ? groupimp : taskimp;
 
@@ -1211,13 +1225,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * In the overloaded case, try and keep the load balanced.
 	 */
 balance:
-	orig_dst_load = env->dst_stats.load;
-	orig_src_load = env->src_stats.load;
-
-	/* XXX missing capacity terms */
 	load = task_h_load(env->p);
-	dst_load = orig_dst_load + load;
-	src_load = orig_src_load - load;
+	dst_load = env->dst_stats.load + load;
+	src_load = env->src_stats.load - load;
 
 	if (cur) {
 		load = task_h_load(cur);
@@ -1225,8 +1235,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		src_load += load;
 	}
 
-	if (load_too_imbalanced(orig_src_load, orig_dst_load,
-				src_load, dst_load, env))
+	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
 
 assign:
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] sched,numa: use effective_load to balance NUMA loads
  2014-06-14 19:18 [PATCH 0/3] sched,numa: further numa balancing fixes riel
  2014-06-14 19:18 ` [PATCH 1/3] sched,numa: use group's max nid as task's preferred nid riel
  2014-06-14 19:18 ` [PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced riel
@ 2014-06-14 19:18 ` riel
  2 siblings, 0 replies; 4+ messages in thread
From: riel @ 2014-06-14 19:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, mgorman, chegu_vinod

From: Rik van Riel <riel@redhat.com>

When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places
on a CPU is determined by the group the task is in. This is conveniently
calculated for us by effective_load(), which task_numa_compare should
use.

The active groups on the source and destination CPU can be different,
so the calculation needs to be done separately for each CPU.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 976dd73..aafc37c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1151,6 +1151,7 @@ static void task_numa_compare(struct task_numa_env *env,
 	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
+	struct task_group *tg;
 	long src_load, dst_load;
 	long load;
 	long imp = (groupimp > 0) ? groupimp : taskimp;
@@ -1225,14 +1226,21 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * In the overloaded case, try and keep the load balanced.
 	 */
 balance:
+	src_load = env->src_stats.load;
+	dst_load = env->dst_stats.load;
+
+	/* Calculate the effect of moving env->p from src to dst. */
 	load = task_h_load(env->p);
-	dst_load = env->dst_stats.load + load;
-	src_load = env->src_stats.load - load;
+	tg = task_group(env->p);
+	src_load += effective_load(tg, env->src_cpu, -load, -load);
+	dst_load += effective_load(tg, env->dst_cpu, load, load);
 
 	if (cur) {
+		/* Cur moves in the opposite direction. */
 		load = task_h_load(cur);
-		dst_load -= load;
-		src_load += load;
+		tg = task_group(cur);
+		src_load += effective_load(tg, env->src_cpu, load, load);
+		dst_load += effective_load(tg, env->dst_cpu, -load, -load);
 	}
 
 	if (load_too_imbalanced(src_load, dst_load, env))
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-06-14 19:40 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-14 19:18 [PATCH 0/3] sched,numa: further numa balancing fixes riel
2014-06-14 19:18 ` [PATCH 1/3] sched,numa: use group's max nid as task's preferred nid riel
2014-06-14 19:18 ` [PATCH 2/3] sched,numa: move power adjustment into load_too_imbalanced riel
2014-06-14 19:18 ` [PATCH 3/3] sched,numa: use effective_load to balance NUMA loads riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox