From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760065Ab2IFV7b (ORCPT ); Thu, 6 Sep 2012 17:59:31 -0400 Received: from e35.co.us.ibm.com ([32.97.110.153]:33728 "EHLO e35.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759933Ab2IFV73 (ORCPT ); Thu, 6 Sep 2012 17:59:29 -0400 Date: Thu, 6 Sep 2012 14:58:38 -0700 From: "Paul E. McKenney" To: Steven Rostedt Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, mingo@elte.hu, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com, eric.dumazet@gmail.com, darren@dvhart.com, fweisbec@gmail.com, sbw@mit.edu, patches@linaro.org, "Paul E. McKenney" Subject: Re: [PATCH tip/core/rcu 11/15] rcu: Avoid spurious RCU CPU stall warnings Message-ID: <20120906215838.GM2448@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20120830185607.GA32148@linux.vnet.ibm.com> <1346352988-32444-1-git-send-email-paulmck@linux.vnet.ibm.com> <1346352988-32444-11-git-send-email-paulmck@linux.vnet.ibm.com> <1346943414.18408.31.camel@twins> <1346944049.1680.23.camel@gandalf.local.home> <1346944758.18408.35.camel@twins> <20120906210354.GC2448@linux.vnet.ibm.com> <1346967661.1680.52.camel@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1346967661.1680.52.camel@gandalf.local.home> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12090621-6148-0000-0000-0000095CEF17 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 06, 2012 at 05:41:01PM -0400, Steven Rostedt wrote: > On Thu, 2012-09-06 at 14:03 -0700, Paul E. McKenney wrote: > > > Here are a few other ways that stalls can happen: > > > > o A CPU looping in an RCU read-side critical section. > > For a minute? That's a bug. > > > > > o A CPU looping with interrupts disabled. This condition can > > result in RCU-sched and RCU-bh stalls. > > Also a bug. > > > > > o A CPU looping with preemption disabled. This condition can > > result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh > > stalls. > > Bug as well. > > > > > o A CPU looping with bottom halves disabled. This condition can > > result in RCU-sched and RCU-bh stalls. > > Bug too. > > > > > o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel > > without invoking schedule(). > > Another bug. > > > > > o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might > > happen to preempt a low-priority task in the middle of an RCU > > read-side critical section. This is especially damaging if > > that low-priority task is not permitted to run on any other CPU, > > in which case the next RCU grace period can never complete, which > > will eventually cause the system to run out of memory and hang. > > While the system is in the process of running itself out of > > memory, you might see stall-warning messages. > > Buggy system. > > > > > o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that > > is running at a higher priority than the RCU softirq threads. > > This will prevent RCU callbacks from ever being invoked, > > and in a CONFIG_TREE_PREEMPT_RCU kernel will further prevent > > RCU grace periods from ever completing. Either way, the > > system will eventually run out of memory and hang. In the > > CONFIG_TREE_PREEMPT_RCU case, you might see stall-warning > > messages. > > Not really a bug, but the developers need a spanking. And RCU does what it can, which is limited to a splat on the console. > > o A hardware or software issue shuts off the scheduler-clock > > interrupt on a CPU that is not in dyntick-idle mode. This > > problem really has happened, and seems to be most likely to > > result in RCU CPU stall warnings for CONFIG_NO_HZ=n kernels. > > Driving the bug. > > > > > o A bug in the RCU implementation. > > Bug in the name. > > > > > o A hardware failure. This is quite unlikely, but has occurred > > at least once in real life. A CPU failed in a running system, > > becoming unresponsive, but not causing an immediate crash. > > This resulted in a series of RCU CPU stall warnings, eventually > > leading the realization that the CPU had failed. > > Hardware bug. > > So, where's the "spurious RCU CPU stall warnings"? I figured that would count as a bug in the RCU implementation. ;-) > All these cases deserve a warning. Agreed, and that is the whole purpose of the stall warnings. Thanx, Paul