From: Mike Galbraith <umgwanakikbuti@gmail.com>
To: Kirill Tkhai <ktkhai@odin.com>
Cc: linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
Date: Sat, 26 Sep 2015 17:25:11 +0200 [thread overview]
Message-ID: <1443281111.3521.30.camel@gmail.com> (raw)
In-Reply-To: <56058A3F.5060408@odin.com>
On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.
But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval. It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)
I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.
concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18768 18591 18264 18541 18351 17257 17245 17617 .950
clients 2 tps = 30779 30661 31016 30818 29112 28026 29026 28721 .931
clients 4 tps = 54195 55100 54048 54447 53290 52336 52930 52852 .970
clients 8 tps = 60332 67052 64699 64027 38491 35746 37746 37327 .582!!
Do the opposite, wake_affine() always NAKs.
master master++
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18768 18591 18264 18541 16874 16865 16665 16801 .906
clients 2 tps = 30779 30661 31016 30818 33562 33546 33681 33596 1.090
clients 4 tps = 54195 55100 54048 54447 61544 61482 61117 61381 1.127
clients 8 tps = 60332 67052 64699 64027 75171 75524 75318 75337 1.176
...
virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.
master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18548 18673 18390 18537 17879 17652 17621 17717 .955
clients 2 tps = 31083 31110 30859 31017 30274 30003 29796 30024 .967
clients 4 tps = 53107 53156 53601 53288 52658 53024 53449 53043 .995
clients 8 tps = 34213 34310 28844 32455 31360 31416 30732 31169 .960
30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.
hrmph, twiddle...
waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1 tps = 18035
clients 2 tps = 32525
clients 4 tps = 53246
clients 8 tps = 37278
better, but not enough.. + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1 tps = 18482
clients 2 tps = 32366
clients 4 tps = 54557
clients 8 tps = 69643
Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch
concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18599 18627 18532 18586 17480 17682 17606 17589 .946
clients 2 tps = 32344 32313 32408 32355 25167 26140 23730 25012 .773
clients 4 tps = 52593 51390 51095 51692 22983 23046 22427 22818 .441
clients 8 tps = 70354 69583 70107 70014 66924 66672 69310 67635 .966
Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
master master+
tbench 1 2 3 avg 1 2 3 avg comp
pairs 1 MB/s = 430 426 436 430 481 481 494 485 1.127
pairs 2 MB/s = 1083 1085 1072 1080 1086 1090 1083 1086 1.005
pairs 4 MB/s = 1725 1697 1729 1717 2023 2002 2006 2010 1.170
pairs 8 MB/s = 2740 2631 2700 2690 3016 2977 3071 3021 1.123
tbench without competition
master master+ comp
pairs 1 MB/s = 694 692 .997
pairs 2 MB/s = 1268 1259 .992
pairs 4 MB/s = 2210 2165 .979
pairs 8 MB/s = 3586 3526 .983 (yawn, all within routine variance)
twiddle:
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
{
struct sched_domain *sd;
struct sched_domain *busy_sd = NULL;
+ struct sched_group *group;
int id = cpu;
int size = 1;
sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
- size = cpumask_weight(sched_domain_span(sd));
busy_sd = sd->parent; /* sd_busy */
+ group = sd->groups;
+ /* Set size to the number of cores, not threads */
+ while (group = group->next, group != sd->groups)
+ size++;
}
rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
static void record_wakee(struct task_struct *p)
{
+ unsigned long now = jiffies;
+
/*
* Rough decay (wiping) for cost saving, don't worry
* about the boundary, really active task won't care
* about the loss.
*/
- if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+ if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
current->wakee_flips >>= 1;
- current->wakee_flip_decay_ts = jiffies;
+ current->wakee_flip_decay_ts = now;
+ }
+ if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+ p->wakee_flips >>= 1;
+ p->wakee_flip_decay_ts = now;
}
if (current->last_wakee != p) {
current->last_wakee = p;
current->wakee_flips++;
+ p->wakee_flips++;
}
}
next prev parent reply other threads:[~2015-09-26 15:25 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-25 17:54 [PATCH] sched/fair: Skip wake_affine() for core siblings Kirill Tkhai
2015-09-26 15:25 ` Mike Galbraith [this message]
2015-09-28 10:28 ` Kirill Tkhai
2015-09-28 13:12 ` Mike Galbraith
2015-09-28 15:36 ` Kirill Tkhai
2015-09-28 15:49 ` Kirill Tkhai
2015-09-28 18:22 ` Mike Galbraith
2015-09-28 19:19 ` Kirill Tkhai
2015-09-29 2:03 ` Mike Galbraith
2015-09-29 14:55 ` Mike Galbraith
2015-09-29 16:00 ` Kirill Tkhai
2015-09-29 16:03 ` Kirill Tkhai
2015-09-29 17:29 ` Mike Galbraith
2015-09-30 19:16 ` Kirill Tkhai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1443281111.3521.30.camel@gmail.com \
--to=umgwanakikbuti@gmail.com \
--cc=ktkhai@odin.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.