Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kirill Tkhai <ktkhai@odin.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Wed, 30 Sep 2015 22:16:23 +0300	[thread overview]
Message-ID: <560C3507.3040906@odin.com> (raw)
In-Reply-To: <1443547751.2790.24.camel@gmail.com>



On 29.09.2015 20:29, Mike Galbraith wrote:
> On Tue, 2015-09-29 at 19:00 +0300, Kirill Tkhai wrote:
>>
>> On 29.09.2015 17:55, Mike Galbraith wrote:
>>> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>>>
>>>> ---
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4df37a4..dfbe06b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>>  	int want_affine = 0;
>>>>  	int sync = wake_flags & WF_SYNC;
>>>>  
>>>> -	if (sd_flag & SD_BALANCE_WAKE)
>>>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>>> +		want_affine = 1;
>>>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> +			goto want_affine;
>>>> +		if (wake_wide(p))
>>>> +			goto want_affine;
>>>> +	}
>>>
>>> That blew wake_wide() right out of the water.
>>>
>>> It's not only about things like pgbench.  Drive multiple tasks in a Xen
>>> guest (single event channel dom0 -> domu, and no select_idle_sibling()
>>> to save the day) via network, and watch workers fail to be all they can
>>> be because they keep being stacked up on the irq source.  Load balancing
>>> yanks them apart, next irq stacks them right back up.  I met that in
>>> enterprise land, thought wake_wide() should cure it, and indeed it did.
>>
>> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
>> current, doesn't it? We more often don't set affine_sd. So, the skipped
>> part of patch (skipped in quote) selects prev_cpu.
> 
> Not the way I read it..
> 
>>> -    if (affine_sd) {
>>> +want_affine:
>>> +    if (want_affine) {
>>>              sd = NULL; /* Prefer wake_affine over balance flags */
>>> -            if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>>> +            if (affine_sd && wake_affine(affine_sd, p, sync))
>>>                      new_cpu = cpu;
>>> -    }
>>> -
>>> -    if (!sd) {
>>> -            if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>>> -                    new_cpu = select_idle_sibling(p, new_cpu);
>>> -
>>> +            new_cpu = select_idle_sibling(p, new_cpu);
> 
> ..it sets new_cpu = cpu if wake_affine() says Ok, wake_wide() has no say
> in the matter.
>  
>> 2)I thought about waking by irq handler and even was going to ask why
>> we use affine logic for such wakeups. Device handlers usually aren't
>> bound, timers may migrate since NO_HZ logic presents. The only explanation
>> I found is unbound timers is very unlikely case (I added statistics printk
>> to my local sched_debug to check that). But if we have the situations like
>> you described above, don't we have to disable affine logic for in_interrupt()
>> cases?
> 
> BTDT.  In my experience, the more you try to differentiate sources, the
> more corner cases you create.  I've tried doing special things for irq,
> locks, wake_all, wake_one, and it always turned into a can of worms.
> IMHO, the best policy for the fast patch is KISS.
> 
>> 3)I ask about just because (being outside of scheduler history) it's a little
>> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
>> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
>> data, which is interesting for a wakee, and this minimizes cache misses.
>> But we do the same in other cases too, and at every migration we loose
>> itlb, dtlb... Of course, it requires more accurate patches, then posted
>> (not so rude patches).
> 
> IMHO, the sync wakeup hint is more often a big fat lie than anything
> else, it really just gives us a bit more headroom for affine wakeups in
> cases where that's likely to be a very good thing (affine in the cache
> sense, not affine as in an individual CPU).  What it means is that waker
> is likely to schedule RSN, but if you measure even very fast/light
> things, there is an overlap win to be had by NOT waking CPU affine,
> rather waking cache affine, that's why we cross core schedule so often.
> A real network app doing a wakeup does is not necessarily gonna schedule
> RSN, there is very often a latency win to be had by scheduling to a
> nearby core, ie a thread pool worker doing a "sync" wakeup may very
> instantly find that it has more work to do.  If a fast/light wakee can
> slip into an idle crack and get to CPU instantly, it can generate more
> work a little bit sooner.

Yeah, in most places, where sync wakeup is used, task is not going to reschedule
soon..

Thanks for the explanation, Mike!

     prev parent reply	other threads:[~2015-09-30 19:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith
2015-09-28 10:28   ` Kirill Tkhai
2015-09-28 13:12     ` Mike Galbraith
2015-09-28 15:36       ` Kirill Tkhai
2015-09-28 15:49         ` Kirill Tkhai
2015-09-28 18:22         ` Mike Galbraith
2015-09-28 19:19           ` Kirill Tkhai
2015-09-29  2:03             ` Mike Galbraith
2015-09-29 14:55         ` Mike Galbraith
2015-09-29 16:00           ` Kirill Tkhai
2015-09-29 16:03             ` Kirill Tkhai
2015-09-29 17:29             ` Mike Galbraith
2015-09-30 19:16               ` Kirill Tkhai [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=560C3507.3040906@odin.com \
    --to=ktkhai@odin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).