From: Kirill Tkhai <ktkhai@odin.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: <linux-kernel@vger.kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Mon, 28 Sep 2015 22:19:08 +0300 [thread overview]
Message-ID: <560992AC.7020909@odin.com> (raw)
In-Reply-To: <1443464557.2780.72.camel@gmail.com>
On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
>
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket. The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
>
>> Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
>
> Why not? A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away.
But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...
>> We wake the wakee anywhere and loose hot cache.
>
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization. At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
>
> -Mike
>
> (I'll take a peek at below when damn futexes get done kicking my a$$)
This case, you'll be able to analyze new results below :)
ORIGIN 1 2 3 4 avg
C=1 8607,960451 8344,16111 8381,197569 7991,84102 8331,2900375
C=2 15397,152685 15438,913761 15771,512182 15648,368819 15563,98686175
C=4 30987,860844 31144,127431 31104,153461 30874,292825 31027,60864025
C=8 62447,47612 62179,788923 61534,482204 62787,894021 62237,410317
PATCHED 1 2 3 4 avg
C=1 8782,439938 8675,891877 8609,209537 8735,120895 8700,66556175
C=2 16526,31409 16491,650678 16149,594736 16365,630084 16383,297397
C=4 32286,341708 32313,536565 32538,285157 32299,427398 32359,397707
C=8 63860,310019 63187,152569 63400,930755 63210,460753 63414,713524
This is:
# pgbench -j 1 -S -c X -T 200 test
The test machine has no NUMA, and I suppose, NUMA machine will show much better results.
I'll test it tomorrow.
>> I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>>
>>
>> Before:
>>
>> trans. | tps (i) | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>> trans. | tps (i) | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019
>> 12635366 | 63176.399769 | 63187.152569
>> 12676890 | 63384.396440 | 63400.930755
>> 12639949 | 63199.526330 | 63210.460753
>> 12670626 | 63353.079951 | 63363.274143
>> 12647001 | 63209.613698 | 63219.812331
>>
>> I'm going to test other cases, but could you tell me (if you remember) are there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> int want_affine = 0;
>> int sync = wake_flags & WF_SYNC;
>>
>> - if (sd_flag & SD_BALANCE_WAKE)
>> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> + if (sd_flag & SD_BALANCE_WAKE) {
>> + want_affine = 1;
>> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> + goto want_affine;
>> + if (wake_wide(p))
>> + goto want_affine;
>> + }
>>
>> rcu_read_lock();
>> for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> break;
>> }
>>
>> - if (affine_sd) {
>> +want_affine:
>> + if (want_affine) {
>> sd = NULL; /* Prefer wake_affine over balance flags */
>> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> + if (affine_sd && wake_affine(affine_sd, p, sync))
>> new_cpu = cpu;
>> - }
>> -
>> - if (!sd) {
>> - if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> - new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> + new_cpu = select_idle_sibling(p, new_cpu);
>> } else while (sd) {
>> struct sched_group *group;
>> int weight;
>
>
Regards,
Kirill
next prev parent reply other threads:[~2015-09-28 19:19 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith
2015-09-28 10:28 ` Kirill Tkhai
2015-09-28 13:12 ` Mike Galbraith
2015-09-28 15:36 ` Kirill Tkhai
2015-09-28 15:49 ` Kirill Tkhai
2015-09-28 18:22 ` Mike Galbraith
2015-09-28 19:19 ` Kirill Tkhai [this message]
2015-09-29 2:03 ` Mike Galbraith
2015-09-29 14:55 ` Mike Galbraith
2015-09-29 16:00 ` Kirill Tkhai
2015-09-29 16:03 ` Kirill Tkhai
2015-09-29 17:29 ` Mike Galbraith
2015-09-30 19:16 ` Kirill Tkhai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=560992AC.7020909@odin.com \
--to=ktkhai@odin.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=umgwanakikbuti@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.