All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@odin.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Mon, 28 Sep 2015 22:19:08 +0300	[thread overview]
Message-ID: <560992AC.7020909@odin.com> (raw)
In-Reply-To: <1443464557.2780.72.camel@gmail.com>



On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> 
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
> 
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket.  The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
> 
>>  Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
> 
> Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away. 

But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...

>> We wake the wakee anywhere and loose hot cache.
> 
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization.  At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
> 
> 	-Mike
> 
> (I'll take a peek at below when damn futexes get done kicking my a$$)

This case, you'll be able to analyze new results below :)

ORIGIN  1               2               3               4               avg
C=1	8607,960451	8344,16111	8381,197569	7991,84102	8331,2900375
C=2	15397,152685	15438,913761	15771,512182	15648,368819	15563,98686175
C=4	30987,860844	31144,127431	31104,153461	30874,292825	31027,60864025
C=8	62447,47612	62179,788923	61534,482204	62787,894021	62237,410317
					
					
PATCHED	1               2               3               4               avg		
C=1	8782,439938	8675,891877	8609,209537	8735,120895	8700,66556175
C=2	16526,31409	16491,650678	16149,594736	16365,630084	16383,297397
C=4	32286,341708	32313,536565	32538,285157	32299,427398	32359,397707
C=8	63860,310019	63187,152569	63400,930755	63210,460753	63414,713524

This is:
# pgbench -j 1 -S -c X -T 200 test

The test machine has no NUMA, and I suppose, NUMA machine will show much better results.
I'll test it tomorrow.

>>  I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>>
>>
>> Before:
>>
>>   trans. |  tps (i)     | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>>  trans.  | tps (i)      | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019      
>> 12635366 | 63176.399769 | 63187.152569      
>> 12676890 | 63384.396440 | 63400.930755      
>> 12639949 | 63199.526330 | 63210.460753      
>> 12670626 | 63353.079951 | 63363.274143      
>> 12647001 | 63209.613698 | 63219.812331      
>>
>> I'm going to test other cases, but could you tell me (if you remember) are there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  	int want_affine = 0;
>>  	int sync = wake_flags & WF_SYNC;
>>  
>> -	if (sd_flag & SD_BALANCE_WAKE)
>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> +	if (sd_flag & SD_BALANCE_WAKE) {
>> +		want_affine = 1;
>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> +			goto want_affine;
>> +		if (wake_wide(p))
>> +			goto want_affine;
>> +	}
>>  
>>  	rcu_read_lock();
>>  	for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  			break;
>>  	}
>>  
>> -	if (affine_sd) {
>> +want_affine:
>> +	if (want_affine) {
>>  		sd = NULL; /* Prefer wake_affine over balance flags */
>> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>>  			new_cpu = cpu;
>> -	}
>> -
>> -	if (!sd) {
>> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> -			new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> +		new_cpu = select_idle_sibling(p, new_cpu);
>>  	} else while (sd) {
>>  		struct sched_group *group;
>>  		int weight;
> 
> 

Regards,
Kirill

  reply	other threads:[~2015-09-28 19:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith
2015-09-28 10:28   ` Kirill Tkhai
2015-09-28 13:12     ` Mike Galbraith
2015-09-28 15:36       ` Kirill Tkhai
2015-09-28 15:49         ` Kirill Tkhai
2015-09-28 18:22         ` Mike Galbraith
2015-09-28 19:19           ` Kirill Tkhai [this message]
2015-09-29  2:03             ` Mike Galbraith
2015-09-29 14:55         ` Mike Galbraith
2015-09-29 16:00           ` Kirill Tkhai
2015-09-29 16:03             ` Kirill Tkhai
2015-09-29 17:29             ` Mike Galbraith
2015-09-30 19:16               ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=560992AC.7020909@odin.com \
    --to=ktkhai@odin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.