From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753976AbcA2Kem (ORCPT ); Fri, 29 Jan 2016 05:34:42 -0500 Received: from mx5-phx2.redhat.com ([209.132.183.37]:46645 "EHLO mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753440AbcA2KeM (ORCPT ); Fri, 29 Jan 2016 05:34:12 -0500 Date: Fri, 29 Jan 2016 05:33:45 -0500 (EST) From: Jan Stancek To: Peter Zijlstra Cc: alex shi , guz fnst , mingo@redhat.com, jolsa@redhat.com, riel@redhat.com, linux-kernel@vger.kernel.org Message-ID: <654964868.14006956.1454063625314.JavaMail.zimbra@redhat.com> In-Reply-To: <20160129101522.GF6357@twins.programming.kicks-ass.net> References: <56A8D994.6050205@redhat.com> <56AA39D6.4070509@redhat.com> <20160128174903.GV6356@twins.programming.kicks-ass.net> <333246323.13611103.1454006593261.JavaMail.zimbra@redhat.com> <20160129101522.GF6357@twins.programming.kicks-ass.net> Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [10.34.26.57] X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF38 (Linux)/8.0.6_GA_5922) Thread-Topic: scheduler doesn't balance thread to idle cpu for 3 seconds Thread-Index: R8XvGPrJqThBk9SAdL4Jy7LpafZZ+w== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- Original Message ----- > From: "Peter Zijlstra" > To: "Jan Stancek" > Cc: "alex shi" , "guz fnst" , mingo@redhat.com, jolsa@redhat.com, > riel@redhat.com, linux-kernel@vger.kernel.org > Sent: Friday, 29 January, 2016 11:15:22 AM > Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds > > On Thu, Jan 28, 2016 at 01:43:13PM -0500, Jan Stancek wrote: > > > How long should I have to wait for a fail? > > > > It's about 1000-2000 iterations for me, which I think you covered > > by now in those 2 hours. > > So I've been running: > > while ! ./pthread_cond_wait_1 ; do sleep 1; done > > overnight on the machine, and have yet to hit a wobbly -- that is, its > still running. I have seen similar result. Then, instead of turning CPUs off, I spawned more low prio threads to scale with number of CPUs on system: @@ -213,10 +213,14 @@ printf(ERROR_PREFIX "pthread_attr_setschedparam\n"); exit(PTS_UNRESOLVED); } - rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL); - if (rc != 0) { - printf(ERROR_PREFIX "pthread_create\n"); - exit(PTS_UNRESOLVED); + + int i, ncpus = sysconf(_SC_NPROCESSORS_ONLN); + for (i = 0; i < ncpus - 1; i++) { + rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL); + if (rc != 0) { + printf(ERROR_PREFIX "pthread_create\n"); + exit(PTS_UNRESOLVED); + } and let this ran on 3 bare metal x86 systems over night (v4.5-rc1). It failed on 2 systems (12 and 24 CPUs) with 1:1000 chance, it never failed on 3rd one (4 CPUs). > > Also note that I don't think failing this test is a bug per se. > Undesirable maybe, but within spec, since SIGALRM is process wide, so it > being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has > no timeliness guarantees. > > That said; if I could reliably reproduce I'd have a go at fixing this, I > suspect there's a 'fun' problem at the bottom of this. Thanks for trying, I'll see if I can find some more reliable way. Regards, Jan