Re: [rfc][patch] select_idle_sibling() inducing bouncing on westmere

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Mike Galbraith <efault@gmx.de>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Paul Turner <pjt@google.com>,
	Arjan Van De Ven <arjan@linux.intel.com>
Subject: Re: [rfc][patch] select_idle_sibling() inducing bouncing on westmere
Date: Thu, 24 May 2012 15:17:11 +0200	[thread overview]
Message-ID: <1337865431.9783.148.camel@laptop> (raw)
In-Reply-To: <1337857490.7300.19.camel@marge.simpson.net>

On Thu, 2012-05-24 at 13:04 +0200, Mike Galbraith wrote:
> sched: fix select_idle_sibling() induced bouncing
> 
> Traversing an entire package is not only expensive, it also leads to tasks
> bouncing all over a partially idle and possible quite large package.  Fix
> that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
> to motivate that one other CPU, if it's busy, tough, it may then try it's
> SMT sibling, but that's all this optimization is allowed to cost.
> 
> Sibling cache buddies are cross-wired to prevent bouncing.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/core.c   |   40 +++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/fair.c   |   28 +++++++++-------------------
>  3 files changed, 49 insertions(+), 20 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -928,6 +928,7 @@ struct sched_domain {
>         struct sched_domain *parent;    /* top domain must be null terminated */
>         struct sched_domain *child;     /* bottom domain must be null terminated */
>         struct sched_group *groups;     /* the balancing groups of the domain */
> +       struct sched_group *sibling;    /* group assigned to select_idle_sibling() */

A better name would be idle_sibling, or possibly idle_buddy.

Sibling is oft times understood to mean SMT-sibling, confusion reigns.

Also, it looks like you're explicitly going down the sched_domains to a
single cpu group. If that's the exact purpose of this, to point to a
particular cpu, make it an int and do away with the group/cpumask bits.

>         unsigned long min_interval;     /* Minimum balance interval ms */
>         unsigned long max_interval;     /* Maximum balance interval ms */
>         unsigned int busy_factor;       /* less balancing by factor if busy */
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5888,9 +5888,47 @@ static void update_top_cache_domain(int
>         int id = cpu;
>  
>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> -       if (sd)
> +       if (sd) {
> +               struct sched_domain *tmp = sd;
> +               struct sched_group *sg = tmp->groups, *prev = sg;
> +               int smt = 0, right = 1;
> +
>                 id = cpumask_first(sched_domain_span(sd));
>  
> +               /*
> +                * Assign a 'buddy' CPU for select_idle_sibling()
> +                * to try to motivate.  These point at each other
> +                * at the MC level, and at own sibling at SIBLING
> +                * to prevent mad bouncing of tasks on a package
> +                * with many cores/siblings.
> +                */
> +               while (cpumask_first(sched_group_cpus(sg)) != id)
> +                       sg = sg->next;

ok, because we're staring at @cpu's sched domains, @id need not be in
the first group.

> +               /*
> +                * Ok, have first group, should we point right or left?
> +                * sg is tmp->groups again when done, ie our group.
> +                */
> +               while (!cpumask_test_cpu(cpu, sched_group_cpus(sg))) {
> +                       prev = sg;
> +                       sg = sg->next;
> +                       right = !right;
> +               }

Slightly confused by this.. So we find @id's group, then we iterate
until we find @cpu's group (sd->groups in fact), but need the iteration
count to find if its even or odd numbered.

Now couldn't you have used the iteration count on the first while()
loop?

> +               /* A CPU went down, never point back to package start. */
> +               if (right && cpumask_first(sched_group_cpus(sg->next)) == id)
> +                       right = 0;

Slightly more confusion, unplugged cpus aren't part of the sched_domain
structure...

> +               sg = right ? sg->next : prev;

So if it was odd we go one more, if it was even we go one back..

Suppose we have 6 cores sharing a cache and no smt, then cpu0 would have
no iterations and right == 1, so we pick cpu1. cpu1 will end up on cpu0
and we continue like 3-4 4-3 etc.

> +               do {
> +                       if (smt)
> +                               sg = tmp->groups->next;
> +                       rcu_assign_pointer(tmp->sibling, sg);
> +                       smt = 1;
> +               } while ((tmp = tmp->child));

Oh, wait we keep a ->sibling pointer for each level.. 

So here we go down, somehow always picking the second smt sibling:

   core0      core1

 smt0 smt1   smt0 smt1

So cpu0 would end up at smt1, cpu1 would end up at smt0, crossing them
nicely.

For power7 with 4 smt it would end up as 1230 I guess.

So if we then make it a 6 core 2 smt machine we get:

core 0->1, 1->0, 2->3, 3->2 etc..

but at the same time we get the smt stuff.. I suppose you mean to select
an SMT sibling from core1, however your tmp = tmp->child goes down the
topology on core0, selecting core0 threads instead.

Surely that's not the intention?

> +       }
> +
>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>         per_cpu(sd_llc_id, cpu) = id;
>  }
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2655,29 +2655,19 @@ static int select_idle_sibling(struct ta
>                 return prev_cpu;
>  
>         /*
> +        * Otherwise, check assigned siblings to find an elegible idle cpu.
>          */
>         sd = rcu_dereference(per_cpu(sd_llc, target));
> +       for_each_lower_domain(sd) {
> +               sg = rcu_dereference(sd->sibling);
> +               for_each_cpu_and(i, sched_group_cpus(sg), tsk_cpus_allowed(p)) {
> +                       if (idle_cpu(i))
> +                               return i;
> +                       break;
> +               }

Ah, I think I see the smt thing..

So suppose cpu0 on our 6 core 2 thread system is doing its thing here,
the first round will try both threads from core1, the second level will
then try our own smt sibling.

>         }
> +
>         return target;
>  }

And I guess the performance improvement comes from simply doing less
work, right?

Did you do you numbers with the distro NR_CPUS=4096 bloat?

Somewhat related, Arjan recently told me we should try and avoid waking
an idle core for tasks that will run very short. Now currently we don't
have runtime estimation (anymore), but pjt's load tracking stuff would
re-introduce that.

Now if only pjt would re-surface... :-)

next prev parent reply	other threads:[~2012-05-24 13:17 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-24 11:04 [rfc][patch] select_idle_sibling() inducing bouncing on westmere Mike Galbraith
2012-05-24 13:17 ` Peter Zijlstra [this message]
2012-05-24 13:20   ` Peter Zijlstra
2012-05-25  6:14     ` Mike Galbraith
2012-05-26  6:37       ` Mike Galbraith
2012-05-26  7:29         ` Peter Zijlstra
2012-05-26  8:27           ` Mike Galbraith
2012-05-27  9:17             ` Mike Galbraith
2012-05-27 11:02               ` Mike Galbraith
2012-05-27 11:12               ` Mike Galbraith
2012-05-27 14:11               ` Arjan van de Ven
2012-05-27 14:29                 ` Mike Galbraith
2012-05-27 14:32                   ` Mike Galbraith
2012-05-29 18:58           ` Andreas Herrmann
2012-05-25  6:08   ` Mike Galbraith
2012-05-25  8:06     ` Mike Galbraith
2012-06-05 14:30   ` Mike Galbraith
2012-06-11 16:57     ` [patch v3] sched: fix select_idle_sibling() induced bouncing Mike Galbraith
2012-06-11 17:22       ` Peter Zijlstra
2012-06-11 17:55         ` Mike Galbraith
2012-06-11 18:53           ` Suresh Siddha
2012-06-12  3:18             ` Mike Galbraith
2012-06-20 10:48               ` [tip:sched/core] sched: Improve scalability via 'CPU buddies', which withstand random perturbations tip-bot for Mike Galbraith
2012-07-24 14:18               ` tip-bot for Mike Galbraith
2012-06-19  8:47         ` [patch v3] sched: fix select_idle_sibling() induced bouncing Paul Turner
2012-06-06 10:17   ` [rfc][patch] select_idle_sibling() inducing bouncing on westmere Mike Galbraith
2012-06-06 10:38     ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1337865431.9783.148.camel@laptop \
    --to=a.p.zijlstra@chello.nl \
    --cc=arjan@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pjt@google.com \
    --cc=suresh.b.siddha@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox