From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754357AbcHBUpS (ORCPT ); Tue, 2 Aug 2016 16:45:18 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34659 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753059AbcHBUpK (ORCPT ); Tue, 2 Aug 2016 16:45:10 -0400 X-IBM-Helo: d01dlp02.pok.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com Date: Tue, 2 Aug 2016 13:45:04 -0700 From: "Paul E. McKenney" To: Rich Felker Cc: linux-kernel@vger.kernel.org, john.stultz@linaro.org, tglx@linutronix.de Subject: Re: rcu_sched stalls in idle task introduced in pre-4.8? Reply-To: paulmck@linux.vnet.ibm.com References: <20160802170414.GA20083@brightrain.aerifal.cx> <20160802181636.GJ3482@linux.vnet.ibm.com> <20160802192036.GW15995@brightrain.aerifal.cx> <20160802194802.GK3482@linux.vnet.ibm.com> <20160802203217.GZ15995@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160802203217.GZ15995@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16080220-0056-0000-0000-000000F2B7BA X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00005545; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000177; SDB=6.00739088; UDB=6.00347439; IPR=6.00511679; BA=6.00004637; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00012128; XFM=3.00000011; UTC=2016-08-02 20:45:06 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16080220-0057-0000-0000-0000050CF3EE Message-Id: <20160802204504.GL3482@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-02_15:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1606300000 definitions=main-1608020210 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 02, 2016 at 04:32:17PM -0400, Rich Felker wrote: > On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote: > > > > > The system I'm testing on is an SMP configuration of the J2 SoC on > > > > > FPGA (see: j-core.org), and the SMP support and timer driver (most > > > > > relevant, imo) are new, but never exhibited this problem on 4.6 or > > > > > 4.7. A tree with all drivers merged is available in the development > > > > > branch of git://git.libc.org/linux-sh; it's regularly rebased but > > > > > there's a perma-tag devel-20160801 you can check out instead. Fetching > > > > > into an existing clone of Linus's tree is recommended since the server > > > > > has limited resources. > > > > > > > > > > I doubt the code from my devel branch will be useful without hardware > > > > > to test on, though. Hopefully the issue is reproducible on other > > > > > systems. Very low cpu speed (50 MHz) could be a factor. The problem > > > > > seems to go away with dyntick disabled. > > > > > > > > Indeed, it is quite possible that RCU's self-tuning needs a bit of help > > > > for CPUs below a few hundred MHz. As might other pieces of the kernel. > > > > Should not be a big deal (famous last words...). However, given that > > > > you are running HZ=100, it should be reasonably close. So first, let's > > > > see what is going on here... > > > > > > > > The following dmesg line usually indicates that RCU's kernel thread is > > > > stuck for some reason. > > > > > > > > [42039.188293] rcu_sched kthread starved for 2152 jiffies! g70249 c70248 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 > > > > > > > > The RCU_GP_WAIT_FQS indicates that it is waiting for a timer to expire. > > > > The task is in TASK_INTERRUPTIBLE state, but it should have been awakened > > > > after only a few jiffies sleep time. The stack trace looks like there > > > > might be trouble canceling the timer. > > > > > > Can you offer any advice on tracking down what it's waiting for, or > > > what's going on with the timer? Is any arch- or clockevent-specific > > > stuff that might be my fault involved here? > > > > It is in an swait_event_interruptible_timeout(), most likely waiting on > > a one-jiffy timer, as in __mod_timer(). If I understood your earlier > > email correctly, setting CONFIG_HZ_PERIODIC avoided this problem. > > If my understanding is correct, it might be worth reviewing any timer > > code that might be pushed along by the once-per-jiffy scheduler tick. > > Is it possible that something wacky is going on with broadcast timers? > A dummy broadcast timer device gets initialized momentatily before the > percpu timer for the secondary cpu is initialized, and in my > experience these broadcast timers don't work when the real timer > device has a cpumask limiting it to a specific cpu (which seems like a > bug, but one I haven't had to care about since I'm not using it). I need to defer to the timer guys on this one, adding them on CC. > > Does the problem reproduces easily? > > Yes, it happens right after boot and repeats every 30-90 seconds or > so. Well, that at least makes it easier to test any patches! > > A bisection might be very helpful. > > Bisection would require some manual work to setup because the whole > reason I was rebasing on Linus's tree was to adapt the drivers to > upstream infrastructure changes (the new cpuhp stuff replacing > notifier for cpu starting). The unfortunate way it was done, each > driver adds an enum to linux/cpuhotplug.h so all the patches have > gratuitous conflicts. In addition, for older revisions in Linus's > tree, there's at least one show-stopping (hang during boot) bug that > needs a cherry-pick to fix. There may be other small issues too. I > don't think they're at all insurmountible but it requires an annoying > amount of scripting. I had to ask! Might eventually be necessary, but let's see what we can learn from what you currently have. > > > > Is it possible that the timer > > > > handler is being prevented from making progress due to heavy interrupt > > > > load or some such? > > > > > > Actually the stall messages are only appearing when the system is idle > > > and interrupts are infrequent. That's what leads me to believe it > > > might be a situation where, on every (infrequent) wake, both the rcu > > > task and something blocking its progress happen at roughly the same > > > time, and the slowness of the cpu leads the rcu task to decide to wait > > > for the next timer/schedule event and try again -- over and over. Of > > > course this hypothesis is just based on a very high-level > > > understanding of what's going on; I'm not sufficiently familiar wilth > > > the implementation specifics to offer more. > > > > > > One observation I just made versus old kernels where this was not > > > happening: I've got a lot more timer interrupts happening than before. > > > In 6 minutes of uptime: > > > > > > CPU0 CPU1 > > > 17: 38 0 AIC 17 abce0000.ethernet > > > 18: 811 0 AIC 18 uartlite > > > 20: 188 121 AIC 20 ipi > > > 72: 11295 6499 AIC 72 jcore_pit > > > > > > That's 30+ timer interrupts per sec on cpu0. Previously it was more > > > like 1-5 per second when idle (with dyntick). > > > > That is interesting, but I don't know what to make of it. RCU does depend > > heavily on timed sleeps to make forward progress, so if timer interrupts > > were lost, that would be a smoking gun. But you actually appear to have > > -more- timer interrupts in the failure case rather than fewer! > > Could the extra interrupts be a symptom of the rcu task doing a > very-short sleep to retry over and over? Possibly. One way to check for this is to set the kernel boot variables rcutree.jiffies_till_sched_qs=10 and rcutree.jiffies_till_first_fqs=10. > > > > > My .config and message log are below. Let me know if there are any > > > > > other details I can provide that would be helpful. > > > > > > > > In case the timer issue I am suspecting is a red herring... > > > > > > > > RCU does tune itself for different values of HZ, but it might need more > > > > help for 50MHz CPUs. (Although if this were the problem, I would expect > > > > you to be seeing excessive CPU consumption by the rcu_sched kthread.) > > > > But it might be worthwhile booting with rcutree.jiffies_till_sched_qs=10 > > > > and rcutree.jiffies_till_first_fqs=10 (or thereabouts). It might also > > > > be necessary to use a longer RCU CPU stall warning timeout (assuming > > > > you don't just suppress it as above), for example, by booting with (say) > > > > rcupdate.rcu_cpu_stall_timeout=60. > > > > > > With that in mind, it seems plausible that the problem existed before > > > and just went over the threshold for reporting the stall with a > > > non-bug change going into 4.8. Would building an older known-good > > > kernel with TREE_RCU_TRACE enabled and checking debugfs > > > rcu/rcu_sched/rcugp for max=??? be informative? > > > > It might be. However, if ftrace is enabled on your platform, getting > > a trace of the RCU-related events would be even more likely to be > > helpful. And for that matter, the timer-related events. > > It looks like ftrace is supported on sh. Looks like I have some > reading to do -- this could be very useful. Sounds good! > > One way to determine whether we were just on the edge on > > the older systems would be to boot them with smaller values of > > rcupdate.rcu_cpu_stall_timeout=60. The default values of the other two > > boot parameters looks to already be at the minimum usefule value, though. > > CONFIG_RCU_CPU_STALL_TIMEOUT=21 seems to be what it's hitting. Confused. If you set CONFIG_RCU_CPU_STALL_TIMEOUT to smaller values on older kernels you see the RCU CPU stall warnings? Thanx, Paul