linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Skip wake_affine() for core siblings
@ 2015-09-25 17:54 Kirill Tkhai
  2015-09-26 15:25 ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-25 17:54 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Ingo Molnar

We are not interested in actual target if both prev
and curr cpus share CPU cache. select_idle_sibling()
searches in top-down order; top level is the same
for both of them, and the result will be the same.
So, we can save a little CPU cycles and cache misses
and skip wake_affine() calculations.

tbench on 2 physical CPU Xeon (x 6 cores x 2 ht) inside cgroup:

threads  | Before         |  After
-------------------------------------------
      1  | 203.943 MB/sec |  211.524 MB/sec
      2  | 407.211 MB/sec |  411.701 MB/sec
      3  | 591.089 MB/sec |  608.404 MB/sec
      4  | 743.768 MB/sec |  790.026 MB/sec (+ 6.2%)
      5  | 914.237 MB/sec |  972.882 MB/sec (+ 6.4%)
      6  | 1053.91 MB/sec |  1092.81 MB/sec
      7  | 1208.24 MB/sec |  1281.1 MB/sec  (+ 6.0%)
      8  | 1357.53 MB/sec |  1385.79 MB/sec
      9  | 1474.11 MB/sec |  1496.76 MB/sec
     10  | 1586.89 MB/sec |  1616.76 MB/sec
     11  | 1720.17 MB/sec |  1732.7 MB/sec
     12  | 1835.4 MB/sec  |  1868.77 MB/sec
     13  | 1964.76 MB/sec |  2003.68 MB/sec
     14  | 2117.01 MB/sec |  2128.16 MB/sec
     15  | 2220.97 MB/sec |  2254.8 MB/sec
     16  | 2326.52 MB/sec |  2378.38 MB/sec
     17  | 2458.79 MB/sec |  2484.15 MB/sec
     18  | 2473.59 MB/sec |  2591.01 MB/sec (+ 4.7%)

Signed-off-by: Kirill Tkhai <ktkhai@odin.com>
---
 kernel/sched/fair.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..b378c34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4666,6 +4666,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	unsigned long weight;
 	int balanced;
 
+	if (sd->flags & SD_SHARE_PKG_RESOURCES)
+		return 1;
+
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
@ 2015-09-26 15:25 ` Mike Galbraith
  2015-09-28 10:28   ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-09-26 15:25 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.

But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval.  It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)

I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.

concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   17617   .950
clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   28721   .931
clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   52852   .970
clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   37327   .582!!

Do the opposite, wake_affine() always NAKs.
                                             master                           master++
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   16801   .906
clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   33596  1.090
clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   61381  1.127
clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   75337  1.176

...

virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.

                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   17717   .955
clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   30024   .967
clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   53043   .995
clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   31169   .960

30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.

hrmph, twiddle...

waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1       tps = 18035
clients 2       tps = 32525
clients 4       tps = 53246
clients 8       tps = 37278

better, but not enough..  + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1       tps = 18482
clients 2       tps = 32366
clients 4       tps = 54557
clients 8       tps = 69643

Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch

concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966

Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.

                                             master                           master+
tbench                    1       2       3     avg         1       2       3     avg   comp
pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123

tbench without competition
               master        master+   comp
pairs 1        MB/s =   694     692    .997 
pairs 2        MB/s =  1268    1259    .992
pairs 4        MB/s =  2210    2165    .979
pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

twiddle:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
 {
 	struct sched_domain *sd;
 	struct sched_domain *busy_sd = NULL;
+	struct sched_group *group;
 	int id = cpu;
 	int size = 1;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
-		size = cpumask_weight(sched_domain_span(sd));
 		busy_sd = sd->parent; /* sd_busy */
+		group = sd->groups;
+		/* Set size to the number of cores, not threads */
+		while (group = group->next, group != sd->groups)
+			size++;
 	}
 	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
 
 static void record_wakee(struct task_struct *p)
 {
+	unsigned long now = jiffies;
+
 	/*
 	 * Rough decay (wiping) for cost saving, don't worry
 	 * about the boundary, really active task won't care
 	 * about the loss.
 	 */
-	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+	if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
 		current->wakee_flips >>= 1;
-		current->wakee_flip_decay_ts = jiffies;
+		current->wakee_flip_decay_ts = now;
+	}
+	if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+		p->wakee_flips >>= 1;
+		p->wakee_flip_decay_ts = now;
 	}
 
 	if (current->last_wakee != p) {
 		current->last_wakee = p;
 		current->wakee_flips++;
+		p->wakee_flips++;
 	}
 }
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-26 15:25 ` Mike Galbraith
@ 2015-09-28 10:28   ` Kirill Tkhai
  2015-09-28 13:12     ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-28 10:28 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar


On 26.09.2015 18:25, Mike Galbraith wrote:
> On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
>> We are not interested in actual target if both prev
>> and curr cpus share CPU cache. select_idle_sibling()
>> searches in top-down order; top level is the same
>> for both of them, and the result will be the same.
>> So, we can save a little CPU cycles and cache misses
>> and skip wake_affine() calculations.
> 
> But, whereas previously wake_affine() could NAK a migration if it would
> create an imbalance, we'll now just go ahead and stack tasks if
> select_idle_sibling() can't find an idle home to override the blanket
> approval.  It doesn't look like a good idea to me to bounce tasks around
> only to then perhaps stack them, as if we do stack waker/wakee, we
> certainly lose concurrency. (microbenchmarks like pipe-test love that,
> but not all that many real applications play ping-pong for a living;)
> 
> I spent most of the day piddling with your little patch, so I'll post
> some condensed mixed load notes.
> 
> concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   17617   .950
> clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   28721   .931
> clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   52852   .970
> clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   37327   .582!!

Yeah, this is terrible.

> Do the opposite, wake_affine() always NAKs.
>                                              master                           master++
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   16801   .906
> clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   33596  1.090
> clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   61381  1.127
> clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   75337  1.176

Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.

Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
but the dispersion of the results much differs from time to time.

> 
> ...
> 
> virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
> clients, where wake_wide() is supposed to kick in to keep N:M load spread out.
> 
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   17717   .955
> clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   30024   .967
> clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   53043   .995
> clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   31169   .960
> 
> 30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.
> 
> hrmph, twiddle...
> 
> waker/wakee coupling strengthened
> postgres@homer:~> pgbench.sh
> clients 1       tps = 18035
> clients 2       tps = 32525
> clients 4       tps = 53246
> clients 8       tps = 37278
> 
> better, but not enough..  + sd_llc_size = #cores vs #threads
> postgres@homer:~> pgbench.sh
> clients 1       tps = 18482
> clients 2       tps = 32366
> clients 4       tps = 54557
> clients 8       tps = 69643
> 
> Ok, that's what I want to see, full repeat.
> master = twiddle
> master+ = twiddle+patch
> 
> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
> 
> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
> 
>                                              master                           master+
> tbench                    1       2       3     avg         1       2       3     avg   comp
> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
> 
> tbench without competition
>                master        master+   comp
> pairs 1        MB/s =   694     692    .997 
> pairs 2        MB/s =  1268    1259    .992
> pairs 4        MB/s =  2210    2165    .979
> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

Hm, it seems tbench with competition is better only because of a busy system makes tbench
processes be woken on the same cpu.
 
> twiddle:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
>  {
>  	struct sched_domain *sd;
>  	struct sched_domain *busy_sd = NULL;
> +	struct sched_group *group;
>  	int id = cpu;
>  	int size = 1;
>  
>  	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>  	if (sd) {
>  		id = cpumask_first(sched_domain_span(sd));
> -		size = cpumask_weight(sched_domain_span(sd));
>  		busy_sd = sd->parent; /* sd_busy */
> +		group = sd->groups;
> +		/* Set size to the number of cores, not threads */
> +		while (group = group->next, group != sd->groups)
> +			size++;
>  	}
>  	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
>  
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
>  
>  static void record_wakee(struct task_struct *p)
>  {
> +	unsigned long now = jiffies;
> +
>  	/*
>  	 * Rough decay (wiping) for cost saving, don't worry
>  	 * about the boundary, really active task won't care
>  	 * about the loss.
>  	 */
> -	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
> +	if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
>  		current->wakee_flips >>= 1;
> -		current->wakee_flip_decay_ts = jiffies;
> +		current->wakee_flip_decay_ts = now;
> +	}
> +	if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
> +		p->wakee_flips >>= 1;
> +		p->wakee_flip_decay_ts = now;
>  	}
>  
>  	if (current->last_wakee != p) {
>  		current->last_wakee = p;
>  		current->wakee_flips++;
> +		p->wakee_flips++;
>  	}
>  }
>  
> 

Regards,
Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 10:28   ` Kirill Tkhai
@ 2015-09-28 13:12     ` Mike Galbraith
  2015-09-28 15:36       ` Kirill Tkhai
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-09-28 13:12 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:

> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.

Yeah, bounce hurts more when there's no concurrency win waiting to be
collected.  This mixed load wasn't a great choice, but it turned out to
be pretty interesting.  Something waking a gaggle of waiters on a busy
big socket should do very bad things.

> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
> but the dispersion of the results much differs from time to time.

pgbench -T $testtime -j 1 -S -c $clients

> > Ok, that's what I want to see, full repeat.
> > master = twiddle
> > master+ = twiddle+patch
> > 
> > concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
> >                                              master                           master+
> > pgbench                   1       2       3     avg         1       2       3     avg   comp
> > clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
> > clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
> > clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
> > clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
> > 
> > Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
> > 
> >                                              master                           master+
> > tbench                    1       2       3     avg         1       2       3     avg   comp
> > pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
> > pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
> > pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
> > pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
> > 
> > tbench without competition
> >                master        master+   comp
> > pairs 1        MB/s =   694     692    .997 
> > pairs 2        MB/s =  1268    1259    .992
> > pairs 4        MB/s =  2210    2165    .979
> > pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)
> 
> Hm, it seems tbench with competition is better only because of a busy system makes tbench
> processes be woken on the same cpu.

Yeah.  When box is really full, select_idle_sibling() (obviously) turns
into a waste of cycles, but even as you approach that, especially when
filling the box with identical copies of nearly fully synchronous high
frequency localhost packet blasters, stacking is a win.

What bent my head up a bit was the combined effect of making wake_wide()
really keep pgbench from collapsing then adding the affine wakeup grant
for tbench.  It's not at all clear to me why 2,4 would be so demolished.

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 13:12     ` Mike Galbraith
@ 2015-09-28 15:36       ` Kirill Tkhai
  2015-09-28 15:49         ` Kirill Tkhai
                           ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-28 15:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On 28.09.2015 16:12, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:
> 
>> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.
> 
> Yeah, bounce hurts more when there's no concurrency win waiting to be
> collected.  This mixed load wasn't a great choice, but it turned out to
> be pretty interesting.  Something waking a gaggle of waiters on a busy
> big socket should do very bad things.
> 
>> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
>> but the dispersion of the results much differs from time to time.
> 
> pgbench -T $testtime -j 1 -S -c $clients

Using -S the results stabilized. It looks like my db is enormous, and some problem with that. I will
investigate.

Thanks!
 
>>> Ok, that's what I want to see, full repeat.
>>> master = twiddle
>>> master+ = twiddle+patch
>>>
>>> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>>>                                              master                           master+
>>> pgbench                   1       2       3     avg         1       2       3     avg   comp
>>> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
>>> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
>>> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
>>> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
>>>
>>> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
>>>
>>>                                              master                           master+
>>> tbench                    1       2       3     avg         1       2       3     avg   comp
>>> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
>>> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
>>> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
>>> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
>>>
>>> tbench without competition
>>>                master        master+   comp
>>> pairs 1        MB/s =   694     692    .997 
>>> pairs 2        MB/s =  1268    1259    .992
>>> pairs 4        MB/s =  2210    2165    .979
>>> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)
>>
>> Hm, it seems tbench with competition is better only because of a busy system makes tbench
>> processes be woken on the same cpu.
> 
> Yeah.  When box is really full, select_idle_sibling() (obviously) turns
> into a waste of cycles, but even as you approach that, especially when
> filling the box with identical copies of nearly fully synchronous high
> frequency localhost packet blasters, stacking is a win.
> 
> What bent my head up a bit was the combined effect of making wake_wide()
> really keep pgbench from collapsing then adding the affine wakeup grant
> for tbench.  It's not at all clear to me why 2,4 would be so demolished.

Mike, one more moment. wake_wide() and current logic confuses me a bit.
It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
if a function is not for choosing this_cpu's llc domain only. We use it
for searching in prev_cpu llc domain too, and it seems we are not interested
in current flips in this case. Imagine a situation, when we share a mutex
with a task on another NUMA node. When the task is realising the mutex
it is waking us, but we definitelly won't use affine logic in this case.
We wake the wakee anywhere and loose hot cache. I changed the logic, and
tried pgbench 1:8. The results (I threw away 3 first iterations, because
they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).


Before:

  trans. |  tps (i)     | tps (e)
--------------------------------------
12098226 | 60491.067392 | 60500.886373
12030184 | 60150.874285 | 60160.654295
11882977 | 59414.829150 | 59424.830637
12020125 | 60100.579023 | 60111.600176
12161917 | 60809.547906 | 60827.321639
12154660 | 60773.249254 | 60783.085165

After:

 trans.  | tps (i)      | tps (e)
--------------------------------------
12770407 | 63849.883578 | 63860.310019      
12635366 | 63176.399769 | 63187.152569      
12676890 | 63384.396440 | 63400.930755      
12639949 | 63199.526330 | 63210.460753      
12670626 | 63353.079951 | 63363.274143      
12647001 | 63209.613698 | 63219.812331      

I'm going to test other cases, but could you tell me (if you remember) are there reasons
we skip prev_cpu, like I described above? Some types of workloads etc.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..dfbe06b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = 1;
+		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			goto want_affine;
+		if (wake_wide(p))
+			goto want_affine;
+	}
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
@@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 	}
 
-	if (affine_sd) {
+want_affine:
+	if (want_affine) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (affine_sd && wake_affine(affine_sd, p, sync))
 			new_cpu = cpu;
-	}
-
-	if (!sd) {
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
-
+		new_cpu = select_idle_sibling(p, new_cpu);
 	} else while (sd) {
 		struct sched_group *group;
 		int weight;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 15:36       ` Kirill Tkhai
@ 2015-09-28 15:49         ` Kirill Tkhai
  2015-09-28 18:22         ` Mike Galbraith
  2015-09-29 14:55         ` Mike Galbraith
  2 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-28 15:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar



On 28.09.2015 18:36, Kirill Tkhai wrote:
> On 28.09.2015 16:12, Mike Galbraith wrote:
>> On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:
>>
>>> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.
>>
>> Yeah, bounce hurts more when there's no concurrency win waiting to be
>> collected.  This mixed load wasn't a great choice, but it turned out to
>> be pretty interesting.  Something waking a gaggle of waiters on a busy
>> big socket should do very bad things.
>>
>>> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
>>> but the dispersion of the results much differs from time to time.
>>
>> pgbench -T $testtime -j 1 -S -c $clients
> 
> Using -S the results stabilized. It looks like my db is enormous, and some problem with that. I will
> investigate.
> 
> Thanks!
>  
>>>> Ok, that's what I want to see, full repeat.
>>>> master = twiddle
>>>> master+ = twiddle+patch
>>>>
>>>> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>>>>                                              master                           master+
>>>> pgbench                   1       2       3     avg         1       2       3     avg   comp
>>>> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
>>>> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
>>>> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
>>>> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
>>>>
>>>> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
>>>>
>>>>                                              master                           master+
>>>> tbench                    1       2       3     avg         1       2       3     avg   comp
>>>> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
>>>> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
>>>> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
>>>> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
>>>>
>>>> tbench without competition
>>>>                master        master+   comp
>>>> pairs 1        MB/s =   694     692    .997 
>>>> pairs 2        MB/s =  1268    1259    .992
>>>> pairs 4        MB/s =  2210    2165    .979
>>>> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)
>>>
>>> Hm, it seems tbench with competition is better only because of a busy system makes tbench
>>> processes be woken on the same cpu.
>>
>> Yeah.  When box is really full, select_idle_sibling() (obviously) turns
>> into a waste of cycles, but even as you approach that, especially when
>> filling the box with identical copies of nearly fully synchronous high
>> frequency localhost packet blasters, stacking is a win.
>>
>> What bent my head up a bit was the combined effect of making wake_wide()
>> really keep pgbench from collapsing then adding the affine wakeup grant
>> for tbench.  It's not at all clear to me why 2,4 would be so demolished.
> 
> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case. Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.
> We wake the wakee anywhere and loose hot cache. I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
> 
> 
> Before:
> 
>   trans. |  tps (i)     | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
> 
> After:
> 
>  trans.  | tps (i)      | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019      
> 12635366 | 63176.399769 | 63187.152569      
> 12676890 | 63384.396440 | 63400.930755      
> 12639949 | 63199.526330 | 63210.460753      
> 12670626 | 63353.079951 | 63363.274143      
> 12647001 | 63209.613698 | 63219.812331      

All above is pgbench -j 1 -S -c 8 -T 200.
 
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  	}
>  
> -	if (affine_sd) {
> +want_affine:
> +	if (want_affine) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>  			new_cpu = cpu;
> -	}
> -
> -	if (!sd) {
> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> -
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  	} else while (sd) {
>  		struct sched_group *group;
>  		int weight;
> 

Regards,
Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 15:36       ` Kirill Tkhai
  2015-09-28 15:49         ` Kirill Tkhai
@ 2015-09-28 18:22         ` Mike Galbraith
  2015-09-28 19:19           ` Kirill Tkhai
  2015-09-29 14:55         ` Mike Galbraith
  2 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-09-28 18:22 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:

> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case.

We're always interested in "flips", as the point is to try to identify
N:M load components, and when they may overload a socket.  The hope is
to get it more right than wrong, as making the tracking really accurate
is too expensive for the fast path.

>  Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.

Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
its opinion, then look for an idle CPU near the waker's CPU if it says
OK, or near wakee's previous CPU if it says go away. 

> We wake the wakee anywhere and loose hot cache.

Yeah, sometimes we'll make tasks drag their data to them when we could
have dragged the task to the data in the name of trying to crank up CPU
utilization.  At some point, _somebody_ has to drag their data across
interconnect, but we really don't know if/when the data transport cost
will pay off in better utilization.

	-Mike

(I'll take a peek at below when damn futexes get done kicking my a$$)

>  I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
> 
> 
> Before:
> 
>   trans. |  tps (i)     | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
> 
> After:
> 
>  trans.  | tps (i)      | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019      
> 12635366 | 63176.399769 | 63187.152569      
> 12676890 | 63384.396440 | 63400.930755      
> 12639949 | 63199.526330 | 63210.460753      
> 12670626 | 63353.079951 | 63363.274143      
> 12647001 | 63209.613698 | 63219.812331      
> 
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  	}
>  
> -	if (affine_sd) {
> +want_affine:
> +	if (want_affine) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>  			new_cpu = cpu;
> -	}
> -
> -	if (!sd) {
> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> -
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  	} else while (sd) {
>  		struct sched_group *group;
>  		int weight;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 18:22         ` Mike Galbraith
@ 2015-09-28 19:19           ` Kirill Tkhai
  2015-09-29  2:03             ` Mike Galbraith
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-28 19:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar



On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> 
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
> 
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket.  The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
> 
>>  Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
> 
> Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away. 

But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...

>> We wake the wakee anywhere and loose hot cache.
> 
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization.  At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
> 
> 	-Mike
> 
> (I'll take a peek at below when damn futexes get done kicking my a$$)

This case, you'll be able to analyze new results below :)

ORIGIN  1               2               3               4               avg
C=1	8607,960451	8344,16111	8381,197569	7991,84102	8331,2900375
C=2	15397,152685	15438,913761	15771,512182	15648,368819	15563,98686175
C=4	30987,860844	31144,127431	31104,153461	30874,292825	31027,60864025
C=8	62447,47612	62179,788923	61534,482204	62787,894021	62237,410317
					
					
PATCHED	1               2               3               4               avg		
C=1	8782,439938	8675,891877	8609,209537	8735,120895	8700,66556175
C=2	16526,31409	16491,650678	16149,594736	16365,630084	16383,297397
C=4	32286,341708	32313,536565	32538,285157	32299,427398	32359,397707
C=8	63860,310019	63187,152569	63400,930755	63210,460753	63414,713524

This is:
# pgbench -j 1 -S -c X -T 200 test

The test machine has no NUMA, and I suppose, NUMA machine will show much better results.
I'll test it tomorrow.

>>  I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>>
>>
>> Before:
>>
>>   trans. |  tps (i)     | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>>  trans.  | tps (i)      | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019      
>> 12635366 | 63176.399769 | 63187.152569      
>> 12676890 | 63384.396440 | 63400.930755      
>> 12639949 | 63199.526330 | 63210.460753      
>> 12670626 | 63353.079951 | 63363.274143      
>> 12647001 | 63209.613698 | 63219.812331      
>>
>> I'm going to test other cases, but could you tell me (if you remember) are there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  	int want_affine = 0;
>>  	int sync = wake_flags & WF_SYNC;
>>  
>> -	if (sd_flag & SD_BALANCE_WAKE)
>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> +	if (sd_flag & SD_BALANCE_WAKE) {
>> +		want_affine = 1;
>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> +			goto want_affine;
>> +		if (wake_wide(p))
>> +			goto want_affine;
>> +	}
>>  
>>  	rcu_read_lock();
>>  	for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  			break;
>>  	}
>>  
>> -	if (affine_sd) {
>> +want_affine:
>> +	if (want_affine) {
>>  		sd = NULL; /* Prefer wake_affine over balance flags */
>> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>>  			new_cpu = cpu;
>> -	}
>> -
>> -	if (!sd) {
>> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> -			new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> +		new_cpu = select_idle_sibling(p, new_cpu);
>>  	} else while (sd) {
>>  		struct sched_group *group;
>>  		int weight;
> 
> 

Regards,
Kirill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 19:19           ` Kirill Tkhai
@ 2015-09-29  2:03             ` Mike Galbraith
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2015-09-29  2:03 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, 2015-09-28 at 22:19 +0300, Kirill Tkhai wrote:
> >>  Imagine a situation, when we share a mutex
> >> with a task on another NUMA node. When the task is realising the mutex
> >> it is waking us, but we definitelly won't use affine logic in this case.
> > 
> > Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
> > If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> > its opinion, then look for an idle CPU near the waker's CPU if it says
> > OK, or near wakee's previous CPU if it says go away. 
> 
> But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
> be choosen from previous node. There will be choosen the highest domain
> of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
> be choosen from the idlest group/cpu. And we don't have a deal with old
> cache at all. This looks like a completely wrong behaviour...

SD_WAKE_AFFINE is enabled globally by default, and SD_BALANCE_WAKE is
disabled globally due to cost and whatnot.

wingenfelder:~/:[0]# tune-sched-domains
{cpu0/domain0:SMT} SD flag: 4783
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64:                           [unused]
+ 128: SD_SHARE_CPUCAPACITY:     Domain members share cpu power
- 256: SD_SHARE_POWERDOMAIN:     Domain members share power domain
+ 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
-1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
+4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
-8192: SD_OVERLAP:               sched_domains of this level overlap
-16384: SD_NUMA:                 cross-node balancing
{cpu0/domain1:MC} SD flag: 4655
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64:                           [unused]
- 128: SD_SHARE_CPUCAPACITY:     Domain members share cpu power
- 256: SD_SHARE_POWERDOMAIN:     Domain members share power domain
+ 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
-1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
+4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
-8192: SD_OVERLAP:               sched_domains of this level overlap
-16384: SD_NUMA:                 cross-node balancing
{cpu0/domain2:NUMA} SD flag: 25647
+   1: SD_LOAD_BALANCE:          Do load balancing on this domain
+   2: SD_BALANCE_NEWIDLE:       Balance when about to become idle
+   4: SD_BALANCE_EXEC:          Balance on exec
+   8: SD_BALANCE_FORK:          Balance on fork, clone
-  16: SD_BALANCE_WAKE:          Wake to idle CPU on task wakeup
+  32: SD_WAKE_AFFINE:           Wake task to waking CPU
-  64:                           [unused]
- 128: SD_SHARE_CPUCAPACITY:     Domain members share cpu power
- 256: SD_SHARE_POWERDOMAIN:     Domain members share power domain
- 512: SD_SHARE_PKG_RESOURCES:   Domain members share cpu pkg resources
+1024: SD_SERIALIZE:             Only a single load balancing instance
-2048: SD_ASYM_PACKING:          Place busy groups earlier in the domain
-4096: SD_PREFER_SIBLING:        Prefer to place tasks in a sibling domain
+8192: SD_OVERLAP:               sched_domains of this level overlap
+16384: SD_NUMA:                 cross-node balancing

	-Mike



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-28 15:36       ` Kirill Tkhai
  2015-09-28 15:49         ` Kirill Tkhai
  2015-09-28 18:22         ` Mike Galbraith
@ 2015-09-29 14:55         ` Mike Galbraith
  2015-09-29 16:00           ` Kirill Tkhai
  2 siblings, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-09-29 14:55 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}

That blew wake_wide() right out of the water.

It's not only about things like pgbench.  Drive multiple tasks in a Xen
guest (single event channel dom0 -> domu, and no select_idle_sibling()
to save the day) via network, and watch workers fail to be all they can
be because they keep being stacked up on the irq source.  Load balancing
yanks them apart, next irq stacks them right back up.  I met that in
enterprise land, thought wake_wide() should cure it, and indeed it did.

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-29 14:55         ` Mike Galbraith
@ 2015-09-29 16:00           ` Kirill Tkhai
  2015-09-29 16:03             ` Kirill Tkhai
  2015-09-29 17:29             ` Mike Galbraith
  0 siblings, 2 replies; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-29 16:00 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar



On 29.09.2015 17:55, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> 
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  	int want_affine = 0;
>>  	int sync = wake_flags & WF_SYNC;
>>  
>> -	if (sd_flag & SD_BALANCE_WAKE)
>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> +	if (sd_flag & SD_BALANCE_WAKE) {
>> +		want_affine = 1;
>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> +			goto want_affine;
>> +		if (wake_wide(p))
>> +			goto want_affine;
>> +	}
> 
> That blew wake_wide() right out of the water.
> 
> It's not only about things like pgbench.  Drive multiple tasks in a Xen
> guest (single event channel dom0 -> domu, and no select_idle_sibling()
> to save the day) via network, and watch workers fail to be all they can
> be because they keep being stacked up on the irq source.  Load balancing
> yanks them apart, next irq stacks them right back up.  I met that in
> enterprise land, thought wake_wide() should cure it, and indeed it did.

1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
current, doesn't it? We more often don't set affine_sd. So, the skipped
part of patch (skipped in quote) selects prev_cpu.

2)I thought about waking by irq handler and even was going to ask why
we use affine logic for such wakeups. Device handlers usually aren't
bound, timers may migrate since NO_HZ logic presents. The only explanation
I found is unbound timers is very unlikely case (I added statistics printk
to my local sched_debug to check that). But if we have the situations like
you described above, don't we have to disable affine logic for in_interrupt()
cases?

3)I ask about just because (being outside of scheduler history) it's a little
bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
profit is less or more clear: smp_processor_id()'s sd_llc may contain some
data, which is interesting for a wakee, and this minimizes cache misses.
But we do the same in other cases too, and at every migration we loose
itlb, dtlb... Of course, it requires more accurate patches, then posted
(not so rude patches).

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-29 16:00           ` Kirill Tkhai
@ 2015-09-29 16:03             ` Kirill Tkhai
  2015-09-29 17:29             ` Mike Galbraith
  1 sibling, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-29 16:03 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar



On 29.09.2015 19:00, Kirill Tkhai wrote:
> 
> 
> On 29.09.2015 17:55, Mike Galbraith wrote:
>> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>>
>>> ---
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 4df37a4..dfbe06b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>  	int want_affine = 0;
>>>  	int sync = wake_flags & WF_SYNC;
>>>  
>>> -	if (sd_flag & SD_BALANCE_WAKE)
>>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>> +		want_affine = 1;
>>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>> +			goto want_affine;
>>> +		if (wake_wide(p))
>>> +			goto want_affine;
>>> +	}
>>
>> That blew wake_wide() right out of the water.
>>
>> It's not only about things like pgbench.  Drive multiple tasks in a Xen
>> guest (single event channel dom0 -> domu, and no select_idle_sibling()
>> to save the day) via network, and watch workers fail to be all they can
>> be because they keep being stacked up on the irq source.  Load balancing
>> yanks them apart, next irq stacks them right back up.  I met that in
>> enterprise land, thought wake_wide() should cure it, and indeed it did.
> 
> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
> current, doesn't it? We more often don't set affine_sd. So, the skipped
> part of patch (skipped in quote) selects prev_cpu.
> 
> 2)I thought about waking by irq handler and even was going to ask why
> we use affine logic for such wakeups. Device handlers usually aren't
> bound, timers may migrate since NO_HZ logic presents. The only explanation
> I found is unbound timers is very unlikely case (I added statistics printk
> to my local sched_debug to check that). But if we have the situations like
> you described above, don't we have to disable affine logic for in_interrupt()
> cases?
> 
> 3)I ask about just because (being outside of scheduler history) it's a little
> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
> data, which is interesting for a wakee, and this minimizes cache misses.
> But we do the same in other cases too, and at every migration we loose
> itlb, dtlb... Of course, it requires more accurate patches, then posted

***typo: instruction and data caches

> (not so rude patches).
> 
> Thanks,
> Kirill
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-29 16:00           ` Kirill Tkhai
  2015-09-29 16:03             ` Kirill Tkhai
@ 2015-09-29 17:29             ` Mike Galbraith
  2015-09-30 19:16               ` Kirill Tkhai
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Galbraith @ 2015-09-29 17:29 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Tue, 2015-09-29 at 19:00 +0300, Kirill Tkhai wrote:
> 
> On 29.09.2015 17:55, Mike Galbraith wrote:
> > On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> > 
> >> ---
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 4df37a4..dfbe06b 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>  	int want_affine = 0;
> >>  	int sync = wake_flags & WF_SYNC;
> >>  
> >> -	if (sd_flag & SD_BALANCE_WAKE)
> >> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> >> +	if (sd_flag & SD_BALANCE_WAKE) {
> >> +		want_affine = 1;
> >> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> >> +			goto want_affine;
> >> +		if (wake_wide(p))
> >> +			goto want_affine;
> >> +	}
> > 
> > That blew wake_wide() right out of the water.
> > 
> > It's not only about things like pgbench.  Drive multiple tasks in a Xen
> > guest (single event channel dom0 -> domu, and no select_idle_sibling()
> > to save the day) via network, and watch workers fail to be all they can
> > be because they keep being stacked up on the irq source.  Load balancing
> > yanks them apart, next irq stacks them right back up.  I met that in
> > enterprise land, thought wake_wide() should cure it, and indeed it did.
> 
> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
> current, doesn't it? We more often don't set affine_sd. So, the skipped
> part of patch (skipped in quote) selects prev_cpu.

Not the way I read it..

>> -    if (affine_sd) {
>> +want_affine:
>> +    if (want_affine) {
>>              sd = NULL; /* Prefer wake_affine over balance flags */
>> -            if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +            if (affine_sd && wake_affine(affine_sd, p, sync))
>>                      new_cpu = cpu;
>> -    }
>> -
>> -    if (!sd) {
>> -            if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> -                    new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> +            new_cpu = select_idle_sibling(p, new_cpu);

..it sets new_cpu = cpu if wake_affine() says Ok, wake_wide() has no say
in the matter.
 
> 2)I thought about waking by irq handler and even was going to ask why
> we use affine logic for such wakeups. Device handlers usually aren't
> bound, timers may migrate since NO_HZ logic presents. The only explanation
> I found is unbound timers is very unlikely case (I added statistics printk
> to my local sched_debug to check that). But if we have the situations like
> you described above, don't we have to disable affine logic for in_interrupt()
> cases?

BTDT.  In my experience, the more you try to differentiate sources, the
more corner cases you create.  I've tried doing special things for irq,
locks, wake_all, wake_one, and it always turned into a can of worms.
IMHO, the best policy for the fast patch is KISS.

> 3)I ask about just because (being outside of scheduler history) it's a little
> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
> data, which is interesting for a wakee, and this minimizes cache misses.
> But we do the same in other cases too, and at every migration we loose
> itlb, dtlb... Of course, it requires more accurate patches, then posted
> (not so rude patches).

IMHO, the sync wakeup hint is more often a big fat lie than anything
else, it really just gives us a bit more headroom for affine wakeups in
cases where that's likely to be a very good thing (affine in the cache
sense, not affine as in an individual CPU).  What it means is that waker
is likely to schedule RSN, but if you measure even very fast/light
things, there is an overlap win to be had by NOT waking CPU affine,
rather waking cache affine, that's why we cross core schedule so often.
A real network app doing a wakeup does is not necessarily gonna schedule
RSN, there is very often a latency win to be had by scheduling to a
nearby core, ie a thread pool worker doing a "sync" wakeup may very
instantly find that it has more work to do.  If a fast/light wakee can
slip into an idle crack and get to CPU instantly, it can generate more
work a little bit sooner.

	-Mike


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
  2015-09-29 17:29             ` Mike Galbraith
@ 2015-09-30 19:16               ` Kirill Tkhai
  0 siblings, 0 replies; 14+ messages in thread
From: Kirill Tkhai @ 2015-09-30 19:16 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar



On 29.09.2015 20:29, Mike Galbraith wrote:
> On Tue, 2015-09-29 at 19:00 +0300, Kirill Tkhai wrote:
>>
>> On 29.09.2015 17:55, Mike Galbraith wrote:
>>> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>>>
>>>> ---
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4df37a4..dfbe06b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>>  	int want_affine = 0;
>>>>  	int sync = wake_flags & WF_SYNC;
>>>>  
>>>> -	if (sd_flag & SD_BALANCE_WAKE)
>>>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>>> +		want_affine = 1;
>>>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> +			goto want_affine;
>>>> +		if (wake_wide(p))
>>>> +			goto want_affine;
>>>> +	}
>>>
>>> That blew wake_wide() right out of the water.
>>>
>>> It's not only about things like pgbench.  Drive multiple tasks in a Xen
>>> guest (single event channel dom0 -> domu, and no select_idle_sibling()
>>> to save the day) via network, and watch workers fail to be all they can
>>> be because they keep being stacked up on the irq source.  Load balancing
>>> yanks them apart, next irq stacks them right back up.  I met that in
>>> enterprise land, thought wake_wide() should cure it, and indeed it did.
>>
>> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
>> current, doesn't it? We more often don't set affine_sd. So, the skipped
>> part of patch (skipped in quote) selects prev_cpu.
> 
> Not the way I read it..
> 
>>> -    if (affine_sd) {
>>> +want_affine:
>>> +    if (want_affine) {
>>>              sd = NULL; /* Prefer wake_affine over balance flags */
>>> -            if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>>> +            if (affine_sd && wake_affine(affine_sd, p, sync))
>>>                      new_cpu = cpu;
>>> -    }
>>> -
>>> -    if (!sd) {
>>> -            if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>>> -                    new_cpu = select_idle_sibling(p, new_cpu);
>>> -
>>> +            new_cpu = select_idle_sibling(p, new_cpu);
> 
> ..it sets new_cpu = cpu if wake_affine() says Ok, wake_wide() has no say
> in the matter.
>  
>> 2)I thought about waking by irq handler and even was going to ask why
>> we use affine logic for such wakeups. Device handlers usually aren't
>> bound, timers may migrate since NO_HZ logic presents. The only explanation
>> I found is unbound timers is very unlikely case (I added statistics printk
>> to my local sched_debug to check that). But if we have the situations like
>> you described above, don't we have to disable affine logic for in_interrupt()
>> cases?
> 
> BTDT.  In my experience, the more you try to differentiate sources, the
> more corner cases you create.  I've tried doing special things for irq,
> locks, wake_all, wake_one, and it always turned into a can of worms.
> IMHO, the best policy for the fast patch is KISS.
> 
>> 3)I ask about just because (being outside of scheduler history) it's a little
>> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
>> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
>> data, which is interesting for a wakee, and this minimizes cache misses.
>> But we do the same in other cases too, and at every migration we loose
>> itlb, dtlb... Of course, it requires more accurate patches, then posted
>> (not so rude patches).
> 
> IMHO, the sync wakeup hint is more often a big fat lie than anything
> else, it really just gives us a bit more headroom for affine wakeups in
> cases where that's likely to be a very good thing (affine in the cache
> sense, not affine as in an individual CPU).  What it means is that waker
> is likely to schedule RSN, but if you measure even very fast/light
> things, there is an overlap win to be had by NOT waking CPU affine,
> rather waking cache affine, that's why we cross core schedule so often.
> A real network app doing a wakeup does is not necessarily gonna schedule
> RSN, there is very often a latency win to be had by scheduling to a
> nearby core, ie a thread pool worker doing a "sync" wakeup may very
> instantly find that it has more work to do.  If a fast/light wakee can
> slip into an idle crack and get to CPU instantly, it can generate more
> work a little bit sooner.

Yeah, in most places, where sync wakeup is used, task is not going to reschedule
soon..

Thanks for the explanation, Mike!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-09-30 19:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith
2015-09-28 10:28   ` Kirill Tkhai
2015-09-28 13:12     ` Mike Galbraith
2015-09-28 15:36       ` Kirill Tkhai
2015-09-28 15:49         ` Kirill Tkhai
2015-09-28 18:22         ` Mike Galbraith
2015-09-28 19:19           ` Kirill Tkhai
2015-09-29  2:03             ` Mike Galbraith
2015-09-29 14:55         ` Mike Galbraith
2015-09-29 16:00           ` Kirill Tkhai
2015-09-29 16:03             ` Kirill Tkhai
2015-09-29 17:29             ` Mike Galbraith
2015-09-30 19:16               ` Kirill Tkhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).