Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Michael Wang <wangyun@linux.vnet.ibm.com>
To: Mike Galbraith <bitbucket@online.de>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, mingo@kernel.org, a.p.zijlstra@chello.nl
Subject: Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()
Date: Wed, 23 Jan 2013 11:01:54 +0800	[thread overview]
Message-ID: <50FF52A2.9010604@linux.vnet.ibm.com> (raw)
In-Reply-To: <1358854481.5782.405.camel@marge.simpson.net>

On 01/22/2013 07:34 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 16:56 +0800, Michael Wang wrote: 
>> On 01/22/2013 04:03 PM, Mike Galbraith wrote:
>> [snip]
>>> ... 
>>>>>
>>>>> That was with your change backed out, and the q/d below applied.
>>>>
>>>> So that change will help to solve the issue? good to know :)
>>>>
>>>> But it will invoke wake_affine() with out any delay, the benefit
>>>> of the patch set will be reduced a lot...
>>>
>>> Yeah, I used size large hammer.
>>>
>>>> I think this change help to solve the issue because it avoid jump
>>>> into balance path when wakeup for any cases, I think we can do
>>>> some change like below to achieve this and meanwhile gain benefit
>>>> from delay wake_affine().
>>>
>>> Yup, I killed it all the way dead.  I'll see what below does.
>>>
>>> I don't really see the point of the wake_affine() change in this set
>>> though.  Its purpose is to decide if a pull is ok or not.  If we don't
>>> need its opinion when we look for an (momentarily?) idle core in
>>> this_domain, we shouldn't need it at all, and could just delete it.
>>
>> I have a question here, so wake_affine() is:
>> A. check whether it is balance to pull.
>> B. check whether it's better to pull than not.
> 
> A, "is it ok to move this guy to where red hot data awaits" is the way
> it has always been used, ie "can we move him here without upsetting
> balance too much", with sync hint meaning the waker is likely going to
> sleep very soon, so we pretend he's already gone when looking at load.
> That sync hint btw doesn't have anywhere near as much real meaning as
> would be nice to have.

Agree.

> 
>> I suppose it's A, so my logical is:
>> 1. find idle cpu in prev domain.
>> 2. if failed and affine, find idle cpu in current domain.
> 
> Hm.  If cpu and prev_cpu are cache affine, you already searched both.
> 

Well, it's true if affine cpus means their sd topology are always same,
but do we have a promise on it?

>> 3. if find idle cpu in current domain, check whether it is balance to
>> pull by wake_affine().
>> 4. if all failed, two choice, go to balance path or directly return
>> prev_cpu.
>>
>> So I still need wake_affine() for a final check, but to be honest, I
>> really doubt about whether it worth to care about balance while waking
>> up, if the task just run several ns then sleep again, it's totally
>> worthless...
> 
> Not totally worthless, but questionable yes.  It really matters most
> when wakee was way over there in cache foo for whatever reason, has no
> big footprint, and red hot data is waiting here in cache bar.  Light
> tasks migrating helps communicating buddies find each other and perform.
> 
> The new NUMA stuff will help heavy tasks, but it won't help with light
> tasks that could benefit by moving to the data.  We currently try to
> migrate on wakeup, if we do stop doing that, we may get hurt more often
> than not, dunno.  Benchmarks will tell.

Agree, for this patch set, before got the proof that do balance is worse
than not while waking up, I will choose do, well, I think the answer
won't be so easy, we need some number to show how a task is worth to do
balance while waking up, I even doubt whether it is possible to have
such number...

> 
>>> If we ever enter balance_path, we can't possibly induce imbalance without
>>> there being something broken in that path, no?
>>
>> So your opinion is, some thing broken in the new balance path?
> 
> I don't know that, but it is a logical bug candidate.

And you are right, I think I got some thing from the debug info you
showed, thanks again for that :)

> 
>>> BTW, it could well be that an unpatched kernel will collapse as well if
>>> WAKE_BALANCE is turned on.  I've never tried that on a largish box, as
>>> doing any of the wakeup time optional stuff used to make tbench scream.
>>>
>>>> Since the issue could not been reproduced on my side, I don't know
>>>> whether the patch benefit or not, so if you are willing to send out
>>>> a formal patch, I would be glad to include it in my patch set ;-)
>>>
>>> Just changing to scan prev_cpu before considering pulling would put a
>>> big dent in the bouncing cow problem, but that's the intriguing thing
>>> about this set.. 
>>
>> So that's my first question, if wake_affine() return 1 means it's better
>> to pull than not, then the new way may be harmful, but if it's just told
>> us, pull won't break the balance, then I still think, current domain is
>> just a backup, not the candidate of first choice.
> 
> wake_affine() doesn't know if it'll be a good or bad move, it only says
> go for it, load numbers are within parameters.
> 
>> can we have the tbench and pgbench big box gain without
>>> a lot of pain to go with it?  Small boxen will surely benefit, pretty
>>> much can't be hurt, but what about all those fast/light tasks that won't
>>> hop across nodes to red hot data?
>>
>> I don't get it... a task won't hop means a task always been selected to
>> run on prev_cpu?
> 
> Yes.  If wakee is light, has no large footprint to later have to drag to
> its new home, moving it to the hot data is a win.
> 
>> We will assign idle cpu if we found, but if not, we can use prev_cpu or
>> go to balance path and find one, so what's the problem here?
> 
> Wakeup latency may be low, but the task can still perform badly due to
> misses.  In the tbench case, cross node data misses aren't anywhere near
> as bad as the _everything_ is a miss you get from waker/wakee bouncing
> all over a single shared L3. 

Hmm...that seems like another point which could only be directed by
benchmarks.

Regards,
Michael Wang

> 
>>> No formal patch is likely to result from any testing I do atm at least.
>>> I'm testing your patches because I see potential, I really want it to
>>> work out, but have to see it do that with my own two beady eyeballs ;-)
>>
>> Got it.
>>
>>>
>>>> And another patch below below is a debug one, which will print out
>>>> all the sbm info, so we could check whether it was initialized
>>>> correctly, just use command "dmesg | grep WYT" to show the map.
>>
>> What about this patch? May be the wrong map is the killer on balance
>> path, should we check it? ;-)
> 
> Yeah,I haven't actually looked for any booboos, just ran it straight out
> of the box ;-)
> 
>> Regards,
>> Michael Wang
>>
>>>>
>>>> Regards,
>>>> Michael Wang
>>>>
>>>> ---
>>>>  kernel/sched/fair.c |   42 +++++++++++++++++++++++++-----------------
>>>>  1 files changed, 25 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 2aa26c1..4361333 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>>>  }
>>>>
>>>>  /*
>>>> - * Try and locate an idle CPU in the sched_domain.
>>>> + * Try and locate an idle CPU in the sched_domain, return -1 if failed.
>>>>   */
>>>>  static int select_idle_sibling(struct task_struct *p, int target)
>>>>  {
>>>> @@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>>>
>>>>                         target = cpumask_first_and(sched_group_cpus(sg),
>>>>                                         tsk_cpus_allowed(p));
>>>> -                       goto done;
>>>> +                       return target;
>>>>  next:
>>>>                         sg = sg->next;
>>>>                 } while (sg != sd->groups);
>>>>         }
>>>> -done:
>>>> -       return target;
>>>> +
>>>> +       return -1;
>>>>  }
>>>>
>>>>  /*
>>>> @@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>>>                  * may has already been cached on prev_cpu, and usually
>>>>                  * they require low latency.
>>>>                  *
>>>> -                * So firstly try to locate an idle cpu shared the cache
>>>> +                * Therefor, balance path in such case will cause damage
>>>> +                * and bring benefit synchronously, wakeup on prev_cpu
>>>> +                * may better than wakeup on a new lower load cpu for the
>>>> +                * cached memory, and we never know.
>>>> +                *
>>>> +                * So the principle is, try to find an idle cpu as close to
>>>> +                * prev_cpu as possible, if failed, just take prev_cpu.
>>>> +                *
>>>> +                * Firstly try to locate an idle cpu shared the cache
>>>>                  * with prev_cpu, it has the chance to break the load
>>>>                  * balance, fortunately, select_idle_sibling() will search
>>>>                  * from top to bottom, which help to reduce the chance in
>>>>                  * some cases.
>>>>                  */
>>>>                 new_cpu = select_idle_sibling(p, prev_cpu);
>>>> -               if (idle_cpu(new_cpu))
>>>> +               if (new_cpu != -1)
>>>>                         goto unlock;
>>>>
>>>>                 /*
>>>>                  * No idle cpu could be found in the topology of prev_cpu,
>>>> -                * before jump into the slow balance_path, try search again
>>>> -                * in the topology of current cpu if it is the affine of
>>>> -                * prev_cpu.
>>>> +                * before take the prev_cpu, try search again in the
>>>> +                * topology of current cpu if it is the affine of prev_cpu.
>>>>                  */
>>>> -               if (cpu == prev_cpu ||
>>>> -                               !sbm->affine_map[prev_cpu] ||
>>>> +               if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
>>>>                                 !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> -                       goto balance_path;
>>>> +                       goto take_prev;
>>>>
>>>>                 new_cpu = select_idle_sibling(p, cpu);
>>>> -               if (!idle_cpu(new_cpu))
>>>> -                       goto balance_path;
>>>> -
>>>>                 /*
>>>>                  * Invoke wake_affine() finally since it is no doubt a
>>>>                  * performance killer.
>>>>                  */
>>>> -               if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>> +               if ((new_cpu != -1) &&
>>>> +                       wake_affine(sbm->affine_map[prev_cpu], p, sync))
>>>>                         goto unlock;
>>>> +
>>>> +take_prev:
>>>> +               new_cpu = prev_cpu;
>>>> +               goto unlock;
>>>>         }
>>>>
>>>> -balance_path:
>>>> +       /* Balance path. */
>>>>         new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
>>>>         sd = sbm->sd[type][sbm->top_level[type]];
>>>>
>>>> -- 
>>>> 1.7.4.1
>>>>
>>>> DEBUG PATCH:
>>>>
>>>> ---
>>>>  kernel/sched/core.c |   30 ++++++++++++++++++++++++++++++
>>>>  1 files changed, 30 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 0c63303..f251f29 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
>>>>  static int sbm_max_level;
>>>>  DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);
>>>>
>>>> +static void debug_sched_balance_map(int cpu)
>>>> +{
>>>> +       int i, type, level = 0;
>>>> +       struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> +
>>>> +       printk("WYT: sbm of cpu %d\n", cpu);
>>>> +
>>>> +       for (type = 0; type < SBM_MAX_TYPE; type++) {
>>>> +               if (type == SBM_EXEC_TYPE)
>>>> +                       printk("WYT: \t exec map\n");
>>>> +               else if (type == SBM_FORK_TYPE)
>>>> +                       printk("WYT: \t fork map\n");
>>>> +               else if (type == SBM_WAKE_TYPE)
>>>> +                       printk("WYT: \t wake map\n");
>>>> +
>>>> +               for (level = 0; level < sbm_max_level; level++) {
>>>> +                       if (sbm->sd[type][level])
>>>> +                               printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
>>>> +               }
>>>> +       }
>>>> +
>>>> +       printk("WYT: \t affine map\n");
>>>> +
>>>> +       for_each_possible_cpu(i) {
>>>> +               if (sbm->affine_map[i])
>>>> +                       printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
>>>> +       }
>>>> +}
>>>> +
>>>>  static void build_sched_balance_map(int cpu)
>>>>  {
>>>>         struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
>>>> @@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>>>>          * destroy_sched_domains() already do the work.
>>>>          */
>>>>         build_sched_balance_map(cpu);
>>>> +       debug_sched_balance_map(cpu);
>>>>         rcu_assign_pointer(rq->sbm, sbm);
>>>>  }
>>>>
>>>
>>>
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

next prev parent reply	other threads:[~2013-01-23  3:02 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1356588535-23251-1-git-send-email-wangyun@linux.vnet.ibm.com>
2013-01-09  9:28 ` [RFC PATCH 0/2] sched: simplify the select_task_rq_fair() Michael Wang
2013-01-12  8:01   ` Mike Galbraith
2013-01-12 10:19     ` Mike Galbraith
2013-01-14  9:21       ` Mike Galbraith
2013-01-15  3:10         ` Michael Wang
2013-01-15  4:52           ` Mike Galbraith
2013-01-15  8:26             ` Michael Wang
2013-01-17  5:55         ` Michael Wang
2013-01-20  4:09           ` Mike Galbraith
2013-01-21  2:50             ` Michael Wang
2013-01-21  4:38               ` Mike Galbraith
2013-01-21  5:07                 ` Michael Wang
2013-01-21  6:42                   ` Mike Galbraith
2013-01-21  7:09                     ` Mike Galbraith
2013-01-21  7:45                       ` Michael Wang
2013-01-21  9:09                         ` Mike Galbraith
2013-01-21  9:22                           ` Michael Wang
2013-01-21  9:44                             ` Mike Galbraith
2013-01-21 10:30                               ` Mike Galbraith
2013-01-22  3:43                               ` Michael Wang
2013-01-22  8:03                                 ` Mike Galbraith
2013-01-22  8:56                                   ` Michael Wang
2013-01-22 11:34                                     ` Mike Galbraith
2013-01-23  3:01                                       ` Michael Wang [this message]
2013-01-23  5:02                                         ` Mike Galbraith
2013-01-22 14:41                                     ` Mike Galbraith
2013-01-23  2:44                                       ` Michael Wang
2013-01-23  4:31                                         ` Mike Galbraith
2013-01-23  5:09                                           ` Michael Wang
2013-01-23  6:28                                             ` Mike Galbraith
2013-01-23  7:10                                               ` Michael Wang
2013-01-23  8:20                                                 ` Mike Galbraith
2013-01-23  8:30                                                   ` Michael Wang
2013-01-23  8:49                                                     ` Mike Galbraith
2013-01-23  9:00                                                       ` Michael Wang
2013-01-23  9:18                                                         ` Mike Galbraith
2013-01-23  9:26                                                           ` Michael Wang
2013-01-23  9:37                                                             ` Mike Galbraith
2013-01-23  9:32                                                           ` Mike Galbraith
2013-01-24  6:01                                                             ` Michael Wang
2013-01-24  6:51                                                               ` Mike Galbraith
2013-01-24  7:15                                                                 ` Michael Wang
2013-01-24  7:47                                                                   ` Mike Galbraith
2013-01-24  8:14                                                                     ` Michael Wang
2013-01-24  9:07                                                                       ` Mike Galbraith
2013-01-24  9:26                                                                         ` Michael Wang
2013-01-24 10:34                                                                           ` Mike Galbraith
2013-01-25  2:14                                                                             ` Michael Wang
2013-01-24  7:00                                                               ` Michael Wang
2013-01-21  7:34                     ` Michael Wang
2013-01-21  8:26                       ` Mike Galbraith
2013-01-21  8:46                         ` Michael Wang
2013-01-21  9:11                           ` Mike Galbraith
2013-01-15  2:46     ` Michael Wang
2013-01-11  8:15 Michael Wang
2013-01-11 10:13 ` Nikunj A Dadhania
2013-01-15  2:20   ` Michael Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50FF52A2.9010604@linux.vnet.ibm.com \
    --to=wangyun@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=bitbucket@online.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).