From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752061AbbIZPZQ (ORCPT <rfc822;w@1wt.eu>);
	Sat, 26 Sep 2015 11:25:16 -0400
Received: from mail-wi0-f170.google.com ([209.85.212.170]:35384 "EHLO
	mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751177AbbIZPZP (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 26 Sep 2015 11:25:15 -0400
Message-ID: <1443281111.3521.30.camel@gmail.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Kirill Tkhai <ktkhai@odin.com>
Cc: linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>
Date: Sat, 26 Sep 2015 17:25:11 +0200
In-Reply-To: <56058A3F.5060408@odin.com>
References: <56058A3F.5060408@odin.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.12.11 
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.

But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval.  It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)

I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.

concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   17617   .950
clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   28721   .931
clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   52852   .970
clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   37327   .582!!

Do the opposite, wake_affine() always NAKs.
                                             master                           master++
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   16801   .906
clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   33596  1.090
clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   61381  1.127
clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   75337  1.176

...

virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.

                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   17717   .955
clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   30024   .967
clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   53043   .995
clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   31169   .960

30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.

hrmph, twiddle...

waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1       tps = 18035
clients 2       tps = 32525
clients 4       tps = 53246
clients 8       tps = 37278

better, but not enough..  + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1       tps = 18482
clients 2       tps = 32366
clients 4       tps = 54557
clients 8       tps = 69643

Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch

concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966

Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.

                                             master                           master+
tbench                    1       2       3     avg         1       2       3     avg   comp
pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123

tbench without competition
               master        master+   comp
pairs 1        MB/s =   694     692    .997 
pairs 2        MB/s =  1268    1259    .992
pairs 4        MB/s =  2210    2165    .979
pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

twiddle:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
 {
 	struct sched_domain *sd;
 	struct sched_domain *busy_sd = NULL;
+	struct sched_group *group;
 	int id = cpu;
 	int size = 1;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
-		size = cpumask_weight(sched_domain_span(sd));
 		busy_sd = sd->parent; /* sd_busy */
+		group = sd->groups;
+		/* Set size to the number of cores, not threads */
+		while (group = group->next, group != sd->groups)
+			size++;
 	}
 	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
 
 static void record_wakee(struct task_struct *p)
 {
+	unsigned long now = jiffies;
+
 	/*
 	 * Rough decay (wiping) for cost saving, don't worry
 	 * about the boundary, really active task won't care
 	 * about the loss.
 	 */
-	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+	if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
 		current->wakee_flips >>= 1;
-		current->wakee_flip_decay_ts = jiffies;
+		current->wakee_flip_decay_ts = now;
+	}
+	if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+		p->wakee_flips >>= 1;
+		p->wakee_flip_decay_ts = now;
 	}
 
 	if (current->last_wakee != p) {
 		current->last_wakee = p;
 		current->wakee_flips++;
+		p->wakee_flips++;
 	}
 }