From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756428AbYE0F7Z (ORCPT ); Tue, 27 May 2008 01:59:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752906AbYE0F7R (ORCPT ); Tue, 27 May 2008 01:59:17 -0400 Received: from mail.gmx.net ([213.165.64.20]:39330 "HELO mail.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750774AbYE0F7P (ORCPT ); Tue, 27 May 2008 01:59:15 -0400 X-Authenticated: #14349625 X-Provags-ID: V01U2FsdGVkX191c9NDz8h1M7ufzgCuXAWjqXfl6YM5dC5WpRTWH2 60WVSzn7DSg6h7 Subject: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+ From: Mike Galbraith To: Greg Smith Cc: Ingo Molnar , Peter Zijlstra , Dhaval Giani , lkml , Srivatsa Vaddagiri In-Reply-To: References: <1211440207.5733.8.camel@marge.simson.net> <20080522082814.GA4499@linux.vnet.ibm.com> <1211447105.4823.7.camel@marge.simson.net> <1211452465.7606.8.camel@marge.simson.net> <1211455553.4381.9.camel@marge.simson.net> <1211456659.29104.20.camel@twins> <1211458176.5693.6.camel@marge.simson.net> <1211459081.29104.40.camel@twins> <1211536814.5851.18.camel@marge.simson.net> <20080523101000.GA13964@elte.hu> <1211537717.5851.22.camel@marge.simson.net> <1211586407.4786.5.camel@marge.simson.net> Content-Type: multipart/mixed; boundary="=-N239WkgeKDHrBBQO+3yB" Date: Tue, 27 May 2008 07:59:10 +0200 Message-Id: <1211867950.5505.47.camel@marge.simson.net> Mime-Version: 1.0 X-Mailer: Evolution 2.12.0 X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-N239WkgeKDHrBBQO+3yB Content-Type: text/plain Content-Transfer-Encoding: 7bit On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote: > I did again get useful results here with the stock 2.6.26.git kernel and > default parameters using Peter's small patch to adjust se.waker. > > What I found most interesting was how the results changed when I set > /proc/sys/kernel/sched_features = 0, without doing anything with batch > mode. The default for that is 1101111111=895. What I then did was run > through setting each of those bits off one by one to see which feature(s) > were getting in the way here. The two that mattered a lot were 895-32=863 > (no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no > SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of > the features on (895-32-2=861) actually gave the best result I've ever > seen here, better than with all the features disabled. Tossing out all > the tests I did that didn't show anything useful, here's my table of the > interesting results. > > Clients .22.19 .26.git waker f=0 f=893 f=863 f=861 > 1 7660 11043 11041 9214 11204 9232 9433 > 2 17798 11452 16306 16916 11165 16686 16097 > 3 29612 13231 18476 24202 11348 26210 26906 > 4 25584 13053 17942 26639 11331 25094 25679 > 6 25295 12263 18472 28918 11761 30525 33297 > 8 24344 11748 19109 32730 12190 31775 35912 > 10 23963 11612 19537 31688 12331 29644 36215 > 15 23026 11414 19518 33209 13050 28651 36452 > 20 22549 11332 19029 32583 13544 25776 35707 > 30 22074 10743 18884 32447 14191 21772 33501 > 40 21495 10406 18609 31704 11017 20600 32743 > 50 20051 10534 17478 29483 14683 19949 31047 > 60 18690 9816 17467 28614 14817 18681 29576 > > Note that compared to earlier test runs, I replaced the 5 client case with > a 60 client one to get more data on the top end. I also wouldn't pay too > much attention to the single client case; that one really bounces around a > lot on most of the kernel revs, even with me doing 5 runs and using the > median. Care to give the below a whirl? If fixes the over-enthusiastic affinity bug in a less restrictive way. It doesn't attempt to addresss the needs of any particular load though, that needs more thought (tricky issue). With default features, I get the below. 2.6.26-smp x86_64 1 10121.600913 2 14360.229517 3 17048.770371 4 18748.777814 5 22086.493358 6 24913.416187 8 27976.026783 10 29346.503261 15 29157.239431 20 28392.257204 30 26590.199787 40 24422.481578 50 23305.981434 (I can get a bit more by disabling HR_TICK along with a dinky patchlet to reduce overhead when it's disabled. Bottom line is that the bug is fixed though, maximizing performance is separate issue imho) Prevent short-running wakers of short-running threads from overloading a single cpu via wakeup affinity, and wire up disconnected debug option. Signed-off-by: Mike Galbraith kernel/sched_fair.c | 25 ++++++++++++++----------- 1 files changed, 14 insertions(+), 11 deletions(-) Index: linux-2.6.26.git/kernel/sched_fair.c =================================================================== --- linux-2.6.26.git.orig/kernel/sched_fair.c +++ linux-2.6.26.git/kernel/sched_fair.c @@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_ struct task_struct *curr = this_rq->curr; unsigned long tl = this_load; unsigned long tl_per_task; + int bad_imbalance; - if (!(this_sd->flags & SD_WAKE_AFFINE)) + if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS)) return 0; /* + * If sync wakeup then subtract the (maximum possible) + * effect of the currently running task from the load + * of the current CPU: + */ + if (sync && tl) + tl -= curr->se.load.weight; + + bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load; + + /* * If the currently running task will sleep within * a reasonable amount of time then attract this newly * woken task: */ - if (sync && curr->sched_class == &fair_sched_class) { + if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class) { if (curr->se.avg_overlap < sysctl_sched_migration_cost && p->se.avg_overlap < sysctl_sched_migration_cost) return 1; @@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_ schedstat_inc(p, se.nr_wakeups_affine_attempts); tl_per_task = cpu_avg_load_per_task(this_cpu); - /* - * If sync wakeup then subtract the (maximum possible) - * effect of the currently running task from the load - * of the current CPU: - */ - if (sync) - tl -= current->se.load.weight; - if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) || - 100*(tl + p->se.load.weight) <= imbalance*load) { + !bad_imbalance) { /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and > Some still open questions after this long investigation that I'd like to > know the answers to are: > > 1) Why are my 2.6.26.git results so dramatically worse than the ones Mike > posted? I'm not sure what was different about his test setup here. The > 2.6.22 results are pretty similar, and the fully tuned ones as well, so > the big difference on that column bugs me. Because those were git _with_ Peter's patch. I was looking at the difference between a non-broken 2.6.26 with and without minimizing preemption, to match the way I tested 2.6.22. > 2) Mike suggested a patch to 2.6.25 in this thread that backports the > feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to > push that into 2.6.25.5? It's clearly quite useful for this load and > therefore possibly others. If the patch above flies, imho, that should be folded into the backport to stable. The heart of the patch is a bugfix, so definitely stable material. Whether to enable the debugging/tweaking knobs as well is a different question. (I would do it, but I ain't the maintainer;) > 3) Peter's se.waker patch is a big step forward on this workload without > any tuning, closing a significant amount of the gap between the default > setup and what I get with the two troublesome features turned off > altogether. What issues might there be with pushing that into the stock > {2.6.25|2.6.26} kernel? (it doesn't fully address the 1:N needs, that needs much more thought) > 4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS > and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to > tune those code paths would need my case for "works better when > SYNC/PREEMPT wakeups disabled" as well as a case that works worse to > balance modifications against. I suspect very many, certainly any load where latency is of major importance. Peak performance of the mysql+oltp benchmark for sure is injured with your settings. As a really good demonstration of how important wakeup preemption can be, try running the attached testcase, which was distilled from a real world load and posted to lkml a few years ago, without wakeup preemption and nailed to one cpu. > 5) Once (4) has identified some tests cases, what else might be done to > make the default behavior better without killing the situations it's > intended for? See patch :) -Mike --=-N239WkgeKDHrBBQO+3yB Content-Disposition: attachment; filename=starve.c Content-Type: text/x-csrc; name=starve.c; charset=UTF-8 Content-Transfer-Encoding: 7bit #include #include #include #include #include #include volatile unsigned long loop = 10000000; void handler (int n) { if (loop > 0) --loop; } static int child (void) { pid_t ppid = getppid (); sleep (1); while (1) kill (ppid, SIGUSR1); return 0; } int main (int argc, char **argv) { pid_t child_pid; int r; loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000; printf ("expecting to receive %lu signals\n", loop); if ((child_pid = fork ()) == 0) exit (child ()); signal (SIGUSR1, handler); while (loop) sleep (1); r = kill (child_pid, SIGTERM); waitpid (child_pid, NULL, 0); return 0; } --=-N239WkgeKDHrBBQO+3yB--