Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Kirill Tkhai <ktkhai@odin.com>
Cc: linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Sat, 26 Sep 2015 17:25:11 +0200	[thread overview]
Message-ID: <1443281111.3521.30.camel@gmail.com> (raw)
In-Reply-To: <56058A3F.5060408@odin.com>

On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.

But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval.  It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)

I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.

concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   17617   .950
clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   28721   .931
clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   52852   .970
clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   37327   .582!!

Do the opposite, wake_affine() always NAKs.
                                             master                           master++
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   16801   .906
clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   33596  1.090
clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   61381  1.127
clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   75337  1.176

...

virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.

                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   17717   .955
clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   30024   .967
clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   53043   .995
clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   31169   .960

30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.

hrmph, twiddle...

waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1       tps = 18035
clients 2       tps = 32525
clients 4       tps = 53246
clients 8       tps = 37278

better, but not enough..  + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1       tps = 18482
clients 2       tps = 32366
clients 4       tps = 54557
clients 8       tps = 69643

Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch

concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
                                             master                           master+
pgbench                   1       2       3     avg         1       2       3     avg   comp
clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966

Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.

                                             master                           master+
tbench                    1       2       3     avg         1       2       3     avg   comp
pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123

tbench without competition
               master        master+   comp
pairs 1        MB/s =   694     692    .997 
pairs 2        MB/s =  1268    1259    .992
pairs 4        MB/s =  2210    2165    .979
pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

twiddle:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
 {
 	struct sched_domain *sd;
 	struct sched_domain *busy_sd = NULL;
+	struct sched_group *group;
 	int id = cpu;
 	int size = 1;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
-		size = cpumask_weight(sched_domain_span(sd));
 		busy_sd = sd->parent; /* sd_busy */
+		group = sd->groups;
+		/* Set size to the number of cores, not threads */
+		while (group = group->next, group != sd->groups)
+			size++;
 	}
 	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
 
 static void record_wakee(struct task_struct *p)
 {
+	unsigned long now = jiffies;
+
 	/*
 	 * Rough decay (wiping) for cost saving, don't worry
 	 * about the boundary, really active task won't care
 	 * about the loss.
 	 */
-	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+	if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
 		current->wakee_flips >>= 1;
-		current->wakee_flip_decay_ts = jiffies;
+		current->wakee_flip_decay_ts = now;
+	}
+	if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+		p->wakee_flips >>= 1;
+		p->wakee_flip_decay_ts = now;
 	}
 
 	if (current->last_wakee != p) {
 		current->last_wakee = p;
 		current->wakee_flips++;
+		p->wakee_flips++;
 	}
 }

next prev parent reply	other threads:[~2015-09-26 15:25 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith [this message]
2015-09-28 10:28   ` Kirill Tkhai
2015-09-28 13:12     ` Mike Galbraith
2015-09-28 15:36       ` Kirill Tkhai
2015-09-28 15:49         ` Kirill Tkhai
2015-09-28 18:22         ` Mike Galbraith
2015-09-28 19:19           ` Kirill Tkhai
2015-09-29  2:03             ` Mike Galbraith
2015-09-29 14:55         ` Mike Galbraith
2015-09-29 16:00           ` Kirill Tkhai
2015-09-29 16:03             ` Kirill Tkhai
2015-09-29 17:29             ` Mike Galbraith
2015-09-30 19:16               ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1443281111.3521.30.camel@gmail.com \
    --to=umgwanakikbuti@gmail.com \
    --cc=ktkhai@odin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.