From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934301AbcILPgP (ORCPT ); Mon, 12 Sep 2016 11:36:15 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:57818 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933750AbcILPgI (ORCPT ); Mon, 12 Sep 2016 11:36:08 -0400 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com Date: Mon, 12 Sep 2016 08:36:04 -0700 From: "Paul E. McKenney" To: Rich Felker Cc: linux-kernel@vger.kernel.org, john.stultz@linaro.org, tglx@linutronix.de Subject: Re: rcu_sched stalls in idle task introduced in pre-4.8? Reply-To: paulmck@linux.vnet.ibm.com References: <20160802170414.GA20083@brightrain.aerifal.cx> <20160802181636.GJ3482@linux.vnet.ibm.com> <20160802192036.GW15995@brightrain.aerifal.cx> <20160802194802.GK3482@linux.vnet.ibm.com> <20160802203217.GZ15995@brightrain.aerifal.cx> <20160802204504.GL3482@linux.vnet.ibm.com> <20160803161631.GA20790@linux.vnet.ibm.com> <20160908221653.GR15995@brightrain.aerifal.cx> <20160910101938.GO32751@linux.vnet.ibm.com> <20160912151345.GW15995@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160912151345.GW15995@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16091215-0020-0000-0000-000009C2BDFE X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00005749; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000185; SDB=6.00756883; UDB=6.00358786; IPR=6.00530174; BA=6.00004710; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00012650; XFM=3.00000011; UTC=2016-09-12 15:36:05 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16091215-0021-0000-0000-0000556805D9 Message-Id: <20160912153604.GN3851@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-09-12_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609020000 definitions=main-1609120243 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 12, 2016 at 11:13:45AM -0400, Rich Felker wrote: > On Sat, Sep 10, 2016 at 03:19:38AM -0700, Paul E. McKenney wrote: > > On Thu, Sep 08, 2016 at 06:16:53PM -0400, Rich Felker wrote: > > > On Wed, Aug 03, 2016 at 09:16:31AM -0700, Paul E. McKenney wrote: > > > > On Tue, Aug 02, 2016 at 01:45:04PM -0700, Paul E. McKenney wrote: > > > > > On Tue, Aug 02, 2016 at 04:32:17PM -0400, Rich Felker wrote: > > > > > > On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote: > > > > > > > > [ . . . ] > > > > > > > > > > > Does the problem reproduces easily? > > > > > > > > > > > > Yes, it happens right after boot and repeats every 30-90 seconds or > > > > > > so. > > > > > > > > > > Well, that at least makes it easier to test any patches! > > > > > > > > > > > > A bisection might be very helpful. > > > > > > > > > > > > Bisection would require some manual work to setup because the whole > > > > > > reason I was rebasing on Linus's tree was to adapt the drivers to > > > > > > upstream infrastructure changes (the new cpuhp stuff replacing > > > > > > notifier for cpu starting). The unfortunate way it was done, each > > > > > > driver adds an enum to linux/cpuhotplug.h so all the patches have > > > > > > gratuitous conflicts. In addition, for older revisions in Linus's > > > > > > tree, there's at least one show-stopping (hang during boot) bug that > > > > > > needs a cherry-pick to fix. There may be other small issues too. I > > > > > > don't think they're at all insurmountible but it requires an annoying > > > > > > amount of scripting. > > > > > > > > > > I had to ask! Might eventually be necessary, but let's see what we > > > > > can learn from what you currently have. > > > > > > > > And at first glance, my overnight run looks uglier than I would expect. > > > > I am now running tests at v4.7, and will run other tests to see if > > > > there really is a statistically significant degradation. If there is, > > > > then I might be able to bisect, though with nine-hour runs this could > > > > take quite some time. > > > > > > Any more thoughts on this? I'm testing v4.8-rc5 (plus jcore drivers > > > not yet upstream) and it's still happening. > > > > Not seeing it, but please do send me a recent splat from your dmesg and > > your .config. > > > > Because I am not seeing it, I also suggest inspecting your jcore drivers > > with the information in Documentation/RCU/stallwarn.txt in mind. > > I found a plausible cause -- it looks like the kernel timer system is > programming the clock event device with intervals so low that another > timer interrupt is pending before the first handler finishes, and it > immediately interrupts the softirq part of the timer. Here's a snippet > from ftrace event tracing: That sort of thing could definitely result in RCU CPU stall warnings, and much else besides! Not so good for battery lifetime, for example, if that is an issue in your case. > -0 [001] d.h. 2646.702790: irq_handler_entry: irq=72 name=jcore_pit > -0 [001] d.h. 2646.703398: softirq_raise: vec=1 [action=TIMER] > -0 [001] d.h. 2646.703607: softirq_raise: vec=9 [action=RCU] > -0 [001] d.h. 2646.703884: softirq_raise: vec=7 [action=SCHED] > -0 [001] d.h. 2646.704191: irq_handler_exit: irq=72 ret=handled > -0 [001] d.H. 2646.704509: irq_handler_entry: irq=72 name=jcore_pit > -0 [001] d.H. 2646.704990: softirq_raise: vec=1 [action=TIMER] > -0 [001] d.H. 2646.705182: softirq_raise: vec=9 [action=RCU] > -0 [001] d.H. 2646.705465: softirq_raise: vec=7 [action=SCHED] > -0 [001] d.H. 2646.705761: irq_handler_exit: irq=72 ret=handled > -0 [001] d.H. 2646.706071: irq_handler_entry: irq=72 name=jcore_pit > -0 [001] d.H. 2646.706328: irq_handler_exit: irq=72 ret=handled > -0 [001] ..s. 2646.706530: softirq_entry: vec=1 [action=TIMER] > -0 [001] ..s. 2646.706708: softirq_exit: vec=1 [action=TIMER] > -0 [001] ..s. 2646.706816: softirq_entry: vec=7 [action=SCHED] > -0 [001] ..s. 2646.707270: softirq_exit: vec=7 [action=SCHED] > -0 [001] ..s. 2646.707397: softirq_entry: vec=9 [action=RCU] > -0 [001] ..s. 2646.707641: softirq_exit: vec=9 [action=RCU] > ksoftirqd/1-14 [001] ..s. 2646.708569: softirq_entry: vec=1 [action=TIMER] > ksoftirqd/1-14 [001] ..s. 2646.708727: softirq_exit: vec=1 [action=TIMER] > ksoftirqd/1-14 [001] ..s. 2646.708829: softirq_entry: vec=7 [action=SCHED] > ksoftirqd/1-14 [001] ..s. 2646.709073: softirq_exit: vec=7 [action=SCHED] > ksoftirqd/1-14 [001] ..s. 2646.709180: softirq_entry: vec=9 [action=RCU] > ksoftirqd/1-14 [001] ..s. 2646.709882: softirq_exit: vec=9 [action=RCU] > > Presumably it usually just hits one or two levels of nesting like > above, but in the worst case it gets in some sort of feedback loop and > the backlog grows until it lasts >21 sec. I was able to see that there > are plenty of shorter stalls (too short for the stall detector to kick > in) by removing the SD card and observing a multi-second delay before: > > [ 401.375518] mmc0: SPI card removed > > The spi mmc host code that does this probe (no irq) is scheduled to > run every 1s, so if the message takes >1s to appear (>0.5s avg), > timers must be stalled. Makes sense! > Other interrupt activity (like serial or ethernet) breaks the stall > immediately so it's not noticable in most interactive usage. Interesting... This other interrupt activity stops the timer from posting so often, or just runs at a higher interrupt priority level? > I was able to make it go away by increasing the min_delta for the > clock even device to 2.5ms (chosen because I measured timer interrupts > taking up to ~1.5ms -- keep in mind this is at 50 MHz) but hard-coding > a number like that does not seem like a proper solution. The kernel > timer subsystem should have some way to choose oneshot expiration > appropriately based on feedback, or it should otherwise ensure that > the feedback loop can't happen (maybe by waiting to program the next > expiration from the softirq instead of the hard one, but I'm not sure > if that's possible...?). I am not sure what the timer guys will want to do, but it might be that they program at the beginning to increase accuracy. I could imagine varying the default min_delta value based on the processor frequency, but I have not idea whether or not that is a reasonable approach. Thanx, Paul