[PATCH RFC] sched,numa: decay wakee_flips instead of zeroing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing
@ 2014-05-16  4:13 Rik van Riel
  2014-05-16  6:14 ` [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Rik van Riel @ 2014-05-16  4:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, chegu_vinod, mingo, umgwanakikbuti

Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.

However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.

By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f01e2f1..0381b11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4009,7 +4009,7 @@ static void record_wakee(struct task_struct *p)
 	 * about the loss.
 	 */
 	if (jiffies > current->wakee_flip_decay_ts + HZ) {
-		current->wakee_flips = 0;
+		current->wakee_flips >>= 1;
 		current->wakee_flip_decay_ts = jiffies;
 	}
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time
  2014-05-16  4:13 [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Rik van Riel
@ 2014-05-16  6:14 ` Rik van Riel
  2014-05-16 13:38   ` Peter Zijlstra
  2014-05-16 13:22 ` [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 2014-05-16  6:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, chegu_vinod, mingo, umgwanakikbuti

I do not have performance numbers in yet, but with this patch on
top of all the previous ones, I see RMA/LMA ratios (as reported
by numatop) drop well below 1 for a SPECjbb2013, most of the time.

In other words, we actually manage to run the processes where
their memory is most of the time. Sometimes tasks get moved away
from their memory for a bit, but I suspect this patch is a move
in the right direction.

With luck I will have performance numbers this afternoon (EST).

---8<---

Subject: sched,numa: move tasks to preferred_node at wakeup time

If a task is caught running off-node at wakeup time, check if it can
be moved to its preferred node without creating a load imbalance.

This is only done when the system has already decided not to do an
affine wakeup.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0381b11..bb5b048 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,6 +4178,79 @@ static int wake_wide(struct task_struct *p)
 	return 0;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static int numa_balance_on_wake(struct task_struct *p, int prev_cpu)
+{
+	long load, src_load, dst_load;
+	int cur_node = cpu_to_node(prev_cpu);
+	struct numa_group *numa_group = ACCESS_ONCE(p->numa_group);
+	struct sched_domain *sd;
+	struct task_numa_env env = {
+		.p = p,
+		.best_task = NULL,
+		.best_imp = 0,
+		.best_cpu = -1
+	};
+
+	if (!sched_feat(NUMA))
+		return prev_cpu;
+
+	if (p->numa_preferred_nid == -1)
+		return prev_cpu;
+
+	if (p->numa_preferred_nid == cur_node);
+		return prev_cpu;
+
+	if (numa_group && node_isset(cur_node, numa_group->active_nodes))
+		return prev_cpu;
+
+	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
+	if (sd)
+		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+
+	/*
+	 * Cpusets can break the scheduler domain tree into smaller
+	 * balance domains, some of which do not cross NUMA boundaries.
+	 * Tasks that are "trapped" in such domains cannot be migrated
+	 * elsewhere, so there is no point in (re)trying.
+	 */
+	if (unlikely(!sd)) {
+		p->numa_preferred_nid = cur_node;
+		return prev_cpu;
+	}
+
+	/*
+	 * Only allow p to move back to its preferred nid if
+	 * that does not create an imbalance that would cause
+	 * the load balancer to move a task around later.
+	 */
+	env.src_nid = cur_node;
+	env.dst_nid = p->numa_preferred_nid;
+
+	update_numa_stats(&env.src_stats, env.src_nid);
+	update_numa_stats(&env.dst_stats, env.dst_nid);
+
+	dst_load = env.dst_stats.load;
+	src_load = env.src_stats.load;
+
+	/* XXX missing power terms */
+	load = task_h_load(p);
+	dst_load += load;
+	src_load -= load;
+
+	if (load_too_imbalanced(env.src_stats.load, env.dst_stats.load,
+				src_load, dst_load, &env))
+		return prev_cpu;
+
+	return cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+}
+#else
+static int numa_balance_on_wake(struct task_struct *p, int prev_cpu)
+{
+	return prev_cpu;
+}
+#endif
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
@@ -4440,6 +4513,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 		prev_cpu = cpu;
+	else if (sd_flag & SD_WAKE_AFFINE)
+		prev_cpu = numa_balance_on_wake(p, prev_cpu);
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing
  2014-05-16  4:13 [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Rik van Riel
  2014-05-16  6:14 ` [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time Rik van Riel
@ 2014-05-16 13:22 ` Peter Zijlstra
  2014-05-19 13:11 ` [tip:sched/core] sched,numa: Decay " tip-bot for Rik van Riel
  2014-05-22 12:29 ` [tip:sched/core] sched/numa: Decay -> " tip-bot for Rik van Riel
  3 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2014-05-16 13:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, chegu_vinod, mingo, umgwanakikbuti

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

On Fri, May 16, 2014 at 12:13:32AM -0400, Rik van Riel wrote:
> Affine wakeups have the potential to interfere with NUMA placement.
> If a task wakes up too many other tasks, affine wakeups will get
> disabled.
> 
> However, regardless of how many other tasks it wakes up, it gets
> re-enabled once a second, potentially interfering with NUMA
> placement of other tasks.
> 
> By decaying wakee_wakes in half instead of zeroing it, we can avoid
> that problem for some workloads.

See https://lkml.org/lkml/2013/7/2/110 and further

> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4f01e2f1..0381b11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4009,7 +4009,7 @@ static void record_wakee(struct task_struct *p)
>  	 * about the loss.
>  	 */
>  	if (jiffies > current->wakee_flip_decay_ts + HZ) {
> -		current->wakee_flips = 0;
> +		current->wakee_flips >>= 1;
>  		current->wakee_flip_decay_ts = jiffies;
>  	}

Would it make sense to do something like:

	now = jiffies;
	while (current->wakee_flips && now > current->wakee_flip_decay_ts + HZ) {
		current->wakee_flips >>= 1;
		current->wakee_flip_decay_ts += HZ;
	}
	if (unlikely(now > current->wakee_flip_decay_ts + HZ))
		current->wakee_flip_decay_ts = now;

Or is that over engineering things?



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time
  2014-05-16  6:14 ` [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time Rik van Riel
@ 2014-05-16 13:38   ` Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2014-05-16 13:38 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, chegu_vinod, mingo, umgwanakikbuti

[-- Attachment #1: Type: text/plain, Size: 2445 bytes --]

On Fri, May 16, 2014 at 02:14:50AM -0400, Rik van Riel wrote:
> +#ifdef CONFIG_NUMA_BALANCING
> +static int numa_balance_on_wake(struct task_struct *p, int prev_cpu)
> +{
> +	long load, src_load, dst_load;

> +	int cur_node = cpu_to_node(prev_cpu);
> +	struct numa_group *numa_group = ACCESS_ONCE(p->numa_group);
> +	struct sched_domain *sd;
> +	struct task_numa_env env = {
> +		.p = p,
> +		.best_task = NULL,
> +		.best_imp = 0,
> +		.best_cpu = -1
> +	};

That's all code, ideally you'd move that after we're done checking the
reasons to not do work, say somehere like...

> +
> +	if (!sched_feat(NUMA))
> +		return prev_cpu;

Yah.. :-( I think some people changed that to numabalancing_enabled.

Fixing that is still on the todo list somewhere.

> +
> +	if (p->numa_preferred_nid == -1)
> +		return prev_cpu;
> +
> +	if (p->numa_preferred_nid == cur_node);
> +		return prev_cpu;
> +
> +	if (numa_group && node_isset(cur_node, numa_group->active_nodes))
> +		return prev_cpu;
> +
> +	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> +	if (sd)
> +		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +
> +	/*
> +	 * Cpusets can break the scheduler domain tree into smaller
> +	 * balance domains, some of which do not cross NUMA boundaries.
> +	 * Tasks that are "trapped" in such domains cannot be migrated
> +	 * elsewhere, so there is no point in (re)trying.
> +	 */
> +	if (unlikely(!sd)) {

How about you bail early, and then have the above test evaporate?

> +		p->numa_preferred_nid = cur_node;
> +		return prev_cpu;
> +	}

.. here.

> +
> +	/*
> +	 * Only allow p to move back to its preferred nid if
> +	 * that does not create an imbalance that would cause
> +	 * the load balancer to move a task around later.
> +	 */
> +	env.src_nid = cur_node;
> +	env.dst_nid = p->numa_preferred_nid;
> +
> +	update_numa_stats(&env.src_stats, env.src_nid);
> +	update_numa_stats(&env.dst_stats, env.dst_nid);
> +
> +	dst_load = env.dst_stats.load;
> +	src_load = env.src_stats.load;
> +
> +	/* XXX missing power terms */
> +	load = task_h_load(p);
> +	dst_load += load;
> +	src_load -= load;
> +
> +	if (load_too_imbalanced(env.src_stats.load, env.dst_stats.load,
> +				src_load, dst_load, &env))
> +		return prev_cpu;

So I'm thinking that load_too_imbalanced() is from another patch I
haven't yet seen, lemme go see if you did send it and I missed it.

> +
> +	return cpumask_first(cpumask_of_node(p->numa_preferred_nid));
> +}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:sched/core] sched,numa: Decay wakee_flips instead of zeroing
  2014-05-16  4:13 [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Rik van Riel
  2014-05-16  6:14 ` [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time Rik van Riel
  2014-05-16 13:22 ` [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Peter Zijlstra
@ 2014-05-19 13:11 ` tip-bot for Rik van Riel
  2014-05-22 12:29 ` [tip:sched/core] sched/numa: Decay -> " tip-bot for Rik van Riel
  3 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-19 13:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, riel, hpa, mingo, peterz, tglx

Commit-ID:  5658b4f43e63f8c7b4a27995dcb2cf43a52ee398
Gitweb:     http://git.kernel.org/tip/5658b4f43e63f8c7b4a27995dcb2cf43a52ee398
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 16 May 2014 00:13:32 -0400
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 22:02:43 +0900

sched,numa: Decay wakee_flips instead of zeroing

Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.

However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.

By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.

Cc: chegu_vinod@hp.com
Cc: mingo@kernel.org
Cc: umgwanakikbuti@gmail.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 503f750..c9617b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4065,7 @@ static void record_wakee(struct task_struct *p)
 	 * about the loss.
 	 */
 	if (jiffies > current->wakee_flip_decay_ts + HZ) {
-		current->wakee_flips = 0;
+		current->wakee_flips >>= 1;
 		current->wakee_flip_decay_ts = jiffies;
 	}
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [tip:sched/core] sched/numa: Decay -> wakee_flips instead of zeroing
  2014-05-16  4:13 [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Rik van Riel
                   ` (2 preceding siblings ...)
  2014-05-19 13:11 ` [tip:sched/core] sched,numa: Decay " tip-bot for Rik van Riel
@ 2014-05-22 12:29 ` tip-bot for Rik van Riel
  3 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Rik van Riel @ 2014-05-22 12:29 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, riel, hpa, mingo, peterz, tglx

Commit-ID:  096aa33863a5e48de52d2ff30e0801b7487944f4
Gitweb:     http://git.kernel.org/tip/096aa33863a5e48de52d2ff30e0801b7487944f4
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 16 May 2014 00:13:32 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 22 May 2014 11:16:41 +0200

sched/numa: Decay ->wakee_flips instead of zeroing

Affine wakeups have the potential to interfere with NUMA placement.
If a task wakes up too many other tasks, affine wakeups will get
disabled.

However, regardless of how many other tasks it wakes up, it gets
re-enabled once a second, potentially interfering with NUMA
placement of other tasks.

By decaying wakee_wakes in half instead of zeroing it, we can avoid
that problem for some workloads.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: chegu_vinod@hp.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 503f750..c9617b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4065,7 @@ static void record_wakee(struct task_struct *p)
 	 * about the loss.
 	 */
 	if (jiffies > current->wakee_flip_decay_ts + HZ) {
-		current->wakee_flips = 0;
+		current->wakee_flips >>= 1;
 		current->wakee_flip_decay_ts = jiffies;
 	}
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-05-22 12:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-16  4:13 [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Rik van Riel
2014-05-16  6:14 ` [PATCH RFC] sched,numa: move tasks to preferred_node at wakeup time Rik van Riel
2014-05-16 13:38   ` Peter Zijlstra
2014-05-16 13:22 ` [PATCH RFC] sched,numa: decay wakee_flips instead of zeroing Peter Zijlstra
2014-05-19 13:11 ` [tip:sched/core] sched,numa: Decay " tip-bot for Rik van Riel
2014-05-22 12:29 ` [tip:sched/core] sched/numa: Decay -> " tip-bot for Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox