From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752061AbbIZPZQ (ORCPT ); Sat, 26 Sep 2015 11:25:16 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:35384 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751177AbbIZPZP (ORCPT ); Sat, 26 Sep 2015 11:25:15 -0400 Message-ID: <1443281111.3521.30.camel@gmail.com> Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings From: Mike Galbraith To: Kirill Tkhai Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Ingo Molnar Date: Sat, 26 Sep 2015 17:25:11 +0200 In-Reply-To: <56058A3F.5060408@odin.com> References: <56058A3F.5060408@odin.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote: > We are not interested in actual target if both prev > and curr cpus share CPU cache. select_idle_sibling() > searches in top-down order; top level is the same > for both of them, and the result will be the same. > So, we can save a little CPU cycles and cache misses > and skip wake_affine() calculations. But, whereas previously wake_affine() could NAK a migration if it would create an imbalance, we'll now just go ahead and stack tasks if select_idle_sibling() can't find an idle home to override the blanket approval. It doesn't look like a good idea to me to bounce tasks around only to then perhaps stack them, as if we do stack waker/wakee, we certainly lose concurrency. (microbenchmarks like pipe-test love that, but not all that many real applications play ping-pong for a living;) I spent most of the day piddling with your little patch, so I'll post some condensed mixed load notes. concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt) master master+ pgbench 1 2 3 avg 1 2 3 avg comp clients 1 tps = 18768 18591 18264 18541 18351 17257 17245 17617 .950 clients 2 tps = 30779 30661 31016 30818 29112 28026 29026 28721 .931 clients 4 tps = 54195 55100 54048 54447 53290 52336 52930 52852 .970 clients 8 tps = 60332 67052 64699 64027 38491 35746 37746 37327 .582!! Do the opposite, wake_affine() always NAKs. master master++ pgbench 1 2 3 avg 1 2 3 avg comp clients 1 tps = 18768 18591 18264 18541 16874 16865 16665 16801 .906 clients 2 tps = 30779 30661 31016 30818 33562 33546 33681 33596 1.090 clients 4 tps = 54195 55100 54048 54447 61544 61482 61117 61381 1.127 clients 8 tps = 60332 67052 64699 64027 75171 75524 75318 75337 1.176 ... virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8 clients, where wake_wide() is supposed to kick in to keep N:M load spread out. master master+ pgbench 1 2 3 avg 1 2 3 avg comp clients 1 tps = 18548 18673 18390 18537 17879 17652 17621 17717 .955 clients 2 tps = 31083 31110 30859 31017 30274 30003 29796 30024 .967 clients 4 tps = 53107 53156 53601 53288 52658 53024 53449 53043 .995 clients 8 tps = 34213 34310 28844 32455 31360 31416 30732 31169 .960 30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench. hrmph, twiddle... waker/wakee coupling strengthened postgres@homer:~> pgbench.sh clients 1 tps = 18035 clients 2 tps = 32525 clients 4 tps = 53246 clients 8 tps = 37278 better, but not enough.. + sd_llc_size = #cores vs #threads postgres@homer:~> pgbench.sh clients 1 tps = 18482 clients 2 tps = 32366 clients 4 tps = 54557 clients 8 tps = 69643 Ok, that's what I want to see, full repeat. master = twiddle master+ = twiddle+patch concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt) master master+ pgbench 1 2 3 avg 1 2 3 avg comp clients 1 tps = 18599 18627 18532 18586 17480 17682 17606 17589 .946 clients 2 tps = 32344 32313 32408 32355 25167 26140 23730 25012 .773 clients 4 tps = 52593 51390 51095 51692 22983 23046 22427 22818 .441 clients 8 tps = 70354 69583 70107 70014 66924 66672 69310 67635 .966 Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly. master master+ tbench 1 2 3 avg 1 2 3 avg comp pairs 1 MB/s = 430 426 436 430 481 481 494 485 1.127 pairs 2 MB/s = 1083 1085 1072 1080 1086 1090 1083 1086 1.005 pairs 4 MB/s = 1725 1697 1729 1717 2023 2002 2006 2010 1.170 pairs 8 MB/s = 2740 2631 2700 2690 3016 2977 3071 3021 1.123 tbench without competition master master+ comp pairs 1 MB/s = 694 692 .997 pairs 2 MB/s = 1268 1259 .992 pairs 4 MB/s = 2210 2165 .979 pairs 8 MB/s = 3586 3526 .983 (yawn, all within routine variance) twiddle: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int { struct sched_domain *sd; struct sched_domain *busy_sd = NULL; + struct sched_group *group; int id = cpu; int size = 1; sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); if (sd) { id = cpumask_first(sched_domain_span(sd)); - size = cpumask_weight(sched_domain_span(sd)); busy_sd = sd->parent; /* sd_busy */ + group = sd->groups; + /* Set size to the number of cores, not threads */ + while (group = group->next, group != sd->groups) + size++; } rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd); --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta static void record_wakee(struct task_struct *p) { + unsigned long now = jiffies; + /* * Rough decay (wiping) for cost saving, don't worry * about the boundary, really active task won't care * about the loss. */ - if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) { + if (time_after(now, current->wakee_flip_decay_ts + HZ)) { current->wakee_flips >>= 1; - current->wakee_flip_decay_ts = jiffies; + current->wakee_flip_decay_ts = now; + } + if (time_after(now, p->wakee_flip_decay_ts + HZ)) { + p->wakee_flips >>= 1; + p->wakee_flip_decay_ts = now; } if (current->last_wakee != p) { current->last_wakee = p; current->wakee_flips++; + p->wakee_flips++; } }