From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751104AbcEGPZK (ORCPT ); Sat, 7 May 2016 11:25:10 -0400 Received: from e17.ny.us.ibm.com ([129.33.205.207]:36142 "EHLO e17.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750732AbcEGPZG (ORCPT ); Sat, 7 May 2016 11:25:06 -0400 X-IBM-Helo: d01dlp02.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Sat, 7 May 2016 08:25:01 -0700 From: "Paul E. McKenney" To: Ross Green Cc: Peter Zijlstra , Mathieu Desnoyers , "Chatre, Reinette" , Jacob Pan , Josh Triplett , John Stultz , Thomas Gleixner , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Oleg Nesterov , pranith kumar Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Message-ID: <20160507152501.GW3593@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20160327210641.GB4287@linux.vnet.ibm.com> <20160328062547.GD6344@twins.programming.kicks-ass.net> <20160328130841.GE4287@linux.vnet.ibm.com> <20160329002518.GA13058@linux.vnet.ibm.com> <20160329002814.GB13058@linux.vnet.ibm.com> <20160329134908.GA27588@linux.vnet.ibm.com> <20160330145547.GA3929@linux.vnet.ibm.com> <20160331154255.GA22915@linux.vnet.ibm.com> <20160403081853.GA32220@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16050715-0041-0000-0000-00000413A0A6 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote: > On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney > wrote: > > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote: > >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote: > >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote: > >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote: > >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote: > >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote: > >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote: > >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote: > >> > > > > > >> > > > > [ . . . ] > >> > > > > > >> > > > > > > > OK, so I should instrument migration_call() if I get the repro rate up? > >> > > > > > > > >> > > > > > > Can do, maybe try the below first. (yes I know how long it all takes :/) > >> > > > > > > >> > > > > > OK, will run this today, then run calibration for last night's run this > >> > > > > > evening. > >> > > > >> > > And of 18 two-hour runs, there were five failures, or about 28%. > >> > > That said, I don't have even one significant digit on the failure rate, > >> > > as 5 of 18 is within the 95% confidence limits for a failure probability > >> > > as low as 12.5% and as high as 47%. > >> > > >> > And after last night's run, this is narrowed down to between 23% and 38%, > >> > which is close enough. Average is 30%, 18 failures in 60 runs. > >> > > >> > Next step is to test Peter's patch some more. Might take a couple of > >> > night's worth of runs to get statistical significance. After which > >> > it will be time to rebase to 4.6-rc1. > >> > >> And the first night was not so good: 6 failures out of 24 runs. Adding > >> this to the 1-of-10 earlier gets 7 failures of 34. Here are how things > >> stack up given the range of base failure estimates: > >> > >> Low 95% bound of 23%: 84% confidence. > >> > >> Actual measurement of 30%: 92% confidence. > >> > >> High 95% bound of 38%: 98% confidence. > >> > >> So there is still some chance that Peter's patch is helping. I will > >> run for one more evening, after which it will be time to move forward > >> to 4.6-rc1. > > > > And no luck reducing bounds. However, moving to 4.6-rc1 did get some > > of the trace_printk() to print. The ftrace_dump()s resulted in RCU > > CPU stall warnings, and the dumps were truncated due to test timeouts > > in my scripting. (I need to make my scripts more patient when they > > see an ftrace dump in progress, I guess.) > > > > Here are the results: > > > > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz > > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz > > > > The config is here: > > > > http://www2.rdrop.com/users/paulmck/submission/config.tgz > > > > More runs to measure 4.6-rc1 base error rate... > > > > Thanx, Paul > > > G'day Paul, > > > Thought i would update a few runs with the linux-4.6-rc kernels. > > I have attached log outputs through dmesg showing rcu_preempt stall warnings. > > > Thought it might be interesting for someone else to look at. > > Currently running linux-4.6-rc6 in testing. Thank you for sending these, I will look them over! Still working to reproduce this quickly enough to do real debug... :-/ Thanx, Paul