All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Kirill Tkhai <ktkhai@odin.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Mon, 28 Sep 2015 20:22:37 +0200	[thread overview]
Message-ID: <1443464557.2780.72.camel@gmail.com> (raw)
In-Reply-To: <56095E7C.7080300@odin.com>

On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:

> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case.

We're always interested in "flips", as the point is to try to identify
N:M load components, and when they may overload a socket.  The hope is
to get it more right than wrong, as making the tracking really accurate
is too expensive for the fast path.

>  Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.

Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
its opinion, then look for an idle CPU near the waker's CPU if it says
OK, or near wakee's previous CPU if it says go away. 

> We wake the wakee anywhere and loose hot cache.

Yeah, sometimes we'll make tasks drag their data to them when we could
have dragged the task to the data in the name of trying to crank up CPU
utilization.  At some point, _somebody_ has to drag their data across
interconnect, but we really don't know if/when the data transport cost
will pay off in better utilization.

	-Mike

(I'll take a peek at below when damn futexes get done kicking my a$$)

>  I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
> 
> 
> Before:
> 
>   trans. |  tps (i)     | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
> 
> After:
> 
>  trans.  | tps (i)      | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019      
> 12635366 | 63176.399769 | 63187.152569      
> 12676890 | 63384.396440 | 63400.930755      
> 12639949 | 63199.526330 | 63210.460753      
> 12670626 | 63353.079951 | 63363.274143      
> 12647001 | 63209.613698 | 63219.812331      
> 
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  	}
>  
> -	if (affine_sd) {
> +want_affine:
> +	if (want_affine) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>  			new_cpu = cpu;
> -	}
> -
> -	if (!sd) {
> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> -
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  	} else while (sd) {
>  		struct sched_group *group;
>  		int weight;



  parent reply	other threads:[~2015-09-28 18:22 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith
2015-09-28 10:28   ` Kirill Tkhai
2015-09-28 13:12     ` Mike Galbraith
2015-09-28 15:36       ` Kirill Tkhai
2015-09-28 15:49         ` Kirill Tkhai
2015-09-28 18:22         ` Mike Galbraith [this message]
2015-09-28 19:19           ` Kirill Tkhai
2015-09-29  2:03             ` Mike Galbraith
2015-09-29 14:55         ` Mike Galbraith
2015-09-29 16:00           ` Kirill Tkhai
2015-09-29 16:03             ` Kirill Tkhai
2015-09-29 17:29             ` Mike Galbraith
2015-09-30 19:16               ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1443464557.2780.72.camel@gmail.com \
    --to=umgwanakikbuti@gmail.com \
    --cc=ktkhai@odin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.