Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Josef Bacik <jbacik@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	riel@redhat.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	morten.rasmussen@arm.com, kernel-team <Kernel-team@fb.com>
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
Date: Sat, 04 Jul 2015 17:57:42 +0200	[thread overview]
Message-ID: <1436025462.17152.37.camel@gmail.com> (raw)
In-Reply-To: <1435905658.6418.52.camel@gmail.com>

On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:

> Hm.  Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
> 
> Ponder..

Since it was too hot to do outside chores (any excuse will do;)...

If we're (read /me) on track, the bellow should help.  Per my tracing,
it may want a wee bit of toning down actually, though when I trace
virgin source I expect to see the same, namely Xorg and friends having
"wide-load" tattooed across their hindquarters earlier than they should.
It doesn't seem to hurt anything, but then demolishing a single llc box
is a tad more difficult than demolishing a NUMA box.


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when waking one of its N minions.

Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
the 1:N dispatcher of work, thus worth some extra effort.

Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 kernel/sched/fair.c     |   89 +++++++++++++++++++++++++-----------------------
 kernel/sched/features.h |    6 +++
 2 files changed, 54 insertions(+), 41 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, void *clear);
 static unsigned long task_h_load(struct task_struct *p);
 
 static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL);
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
-	int factor = this_cpu_read(sd_llc_size);
-
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
+	unsigned long waker_flips = current->wakee_flips;
+	unsigned long wakee_flips = p->wakee_flips;
+	int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+	if (waker_flips < wakee_flips) {
+		swap(waker_flips, wakee_flips);
+		/* Tell the caller that we're waking a 1:N waker */
+		ret += sched_feat(WAKE_WIDE_BALANCE);
 	}
-
-	return 0;
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return ret;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, void *clear)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
 	if (idle_cpu(target))
-		return target;
+		goto done;
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) {
+		target = i;
+		goto done;
+	}
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	return target;
 done:
+	if (clear)
+		*(void **)clear = 0;
+
 	return target;
 }
 /*
@@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
-	int want_affine = 0;
+	int new_cpu = prev_cpu;
+	int want_affine = 0, want_balance = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = wake_wide(p);
+		want_balance = want_affine > 1;
+		want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		if (!want_affine && !want_balance)
+			goto select;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct *
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag || want_balance)
 			sd = tmp;
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+select:
+		new_cpu = select_idle_sibling(p, new_cpu, &sd);
 	}
 
 	while (sd) {
 		struct sched_group *group;
 		int weight;
 
-		if (!(sd->flags & sd_flag)) {
+		if (!(sd->flags & sd_flag) && !want_balance) {
 			sd = sd->child;
 			continue;
 		}
@@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_BALANCE, false)

next prev parent reply	other threads:[~2015-07-05 10:29 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28  3:46 ` Mike Galbraith
2015-05-28  9:49   ` Morten Rasmussen
2015-05-28 10:57     ` Mike Galbraith
2015-05-28 11:48       ` Morten Rasmussen
2015-05-28 11:49         ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05   ` Peter Zijlstra
2015-05-28 14:27     ` Josef Bacik
2015-05-29 21:03     ` Josef Bacik
2015-05-30  3:55       ` Mike Galbraith
2015-06-01 19:38       ` Josef Bacik
2015-06-01 20:42         ` Peter Zijlstra
2015-06-01 21:03           ` Josef Bacik
2015-06-02 17:12           ` Josef Bacik
2015-06-03 14:12             ` Rik van Riel
2015-06-03 14:24               ` Peter Zijlstra
2015-06-03 14:49                 ` Josef Bacik
2015-06-03 15:30                 ` Mike Galbraith
2015-06-03 15:57                   ` Josef Bacik
2015-06-03 16:53                     ` Mike Galbraith
2015-06-03 17:16                       ` Josef Bacik
2015-06-03 17:43                         ` Mike Galbraith
2015-06-03 20:34                           ` Josef Bacik
2015-06-04  4:52                             ` Mike Galbraith
2015-06-01 22:15         ` Rik van Riel
2015-06-11 20:33     ` Josef Bacik
2015-06-12  3:42       ` Rik van Riel
2015-06-12  5:35     ` Mike Galbraith
2015-06-17 18:06       ` Josef Bacik
2015-06-18  0:55         ` Mike Galbraith
2015-06-18  3:46           ` Josef Bacik
2015-06-18  4:12             ` Mike Galbraith
2015-07-02 17:44               ` Josef Bacik
2015-07-03  6:40                 ` Mike Galbraith
2015-07-03  9:29                   ` Mike Galbraith
2015-07-04 15:57                   ` Mike Galbraith [this message]
2015-07-05  7:17                     ` Mike Galbraith
2015-07-06  5:13                       ` Mike Galbraith
2015-07-06 14:34                         ` Josef Bacik
2015-07-06 18:36                           ` Mike Galbraith
2015-07-06 19:41                             ` Josef Bacik
2015-07-07  4:01                               ` Mike Galbraith
2015-07-07  9:43                                 ` [patch] " Mike Galbraith
2015-07-07 13:40                                   ` Josef Bacik
2015-07-07 15:24                                     ` Mike Galbraith
2015-07-07 17:06                                   ` Josef Bacik
2015-07-08  6:13                                     ` [patch] sched: beef up wake_wide() Mike Galbraith
2015-07-09 13:26                                       ` Peter Zijlstra
2015-07-09 14:07                                         ` Mike Galbraith
2015-07-09 14:46                                           ` Mike Galbraith
2015-07-10  5:19                                         ` Mike Galbraith
2015-07-10 13:41                                           ` Josef Bacik
2015-07-10 20:59                                           ` Josef Bacik
2015-07-11  3:11                                             ` Mike Galbraith
2015-07-13 13:53                                               ` Josef Bacik
2015-07-14 11:19                                               ` Peter Zijlstra
2015-07-14 13:49                                                 ` Mike Galbraith
2015-07-14 14:07                                                   ` Peter Zijlstra
2015-07-14 14:17                                                     ` Mike Galbraith
2015-07-14 15:04                                                       ` Peter Zijlstra
2015-07-14 15:39                                                         ` Mike Galbraith
2015-07-14 16:01                                                           ` Josef Bacik
2015-07-14 17:59                                                             ` Mike Galbraith
2015-07-15 17:11                                                               ` Josef Bacik
2015-08-03 17:07                                                           ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:49     ` Ingo Molnar
2015-05-28 12:15       ` Mike Galbraith
2015-05-28 12:19         ` Peter Zijlstra
2015-05-28 12:29           ` Ingo Molnar
2015-05-28 15:22           ` David Ahern
2015-05-28 11:55 ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1436025462.17152.37.camel@gmail.com \
    --to=umgwanakikbuti@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=jbacik@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.