From: Dave Chinner <david@fromorbit.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org
Subject: trigger_all_cpu_backtrace() has no generic implementation (was Re: [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies))
Date: Sat, 6 Aug 2011 10:20:56 +1000 [thread overview]
Message-ID: <20110806002056.GC3162@dastard> (raw)
In-Reply-To: <20110805112429.GM13065@linux.vnet.ibm.com>
On Fri, Aug 05, 2011 at 04:24:29AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 05, 2011 at 06:48:39PM +1000, Dave Chinner wrote:
> > On Thu, Aug 04, 2011 at 11:41:19PM -0700, Paul E. McKenney wrote:
> > > On Fri, Aug 05, 2011 at 10:33:16AM +1000, Dave Chinner wrote:
> > > > On Tue, Aug 02, 2011 at 11:30:50PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Aug 03, 2011 at 12:52:22PM +1000, Dave Chinner wrote:
> > > > > > On Wed, Aug 03, 2011 at 12:28:57PM +1000, Dave Chinner wrote:
> > > > > > > Hi Paul,
> > > > > > >
> > > > > > > I've had this hang a couple of times now, so I figured it isn't an
> > > > > > > isolated event. I am getting kernels occassionally hanging with the
> > > > > > > following output occurring:
> > > > > > >
> > > > > > > [ 62.812011] INFO: rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)
> > > > > > > [ 242.936009] INFO: rcu_sched_state detected stall on CPU 8 (t=60031 jiffies)
> > > >
> > > > ....
> > > >
> > > > > > This might be a false alarm - I've just diagnosed(*) that a kernel
> > > > > > thread was stuck in a hard loop therefore not giving up the CPU.
> > > > >
> > > > > Ah, that is indeed one of the conditions that RCU CPU stall warnings
> > > > > can catch.
> > > > >
> > > > > > Perhaps this is error message could be more informative?
> > > > > > The detector is acting like the hung task detector, except it's
> > > > > > working on kernel code stuck in a loop burning CPU, so maybe dumping
> > > > > > a stack trace of the spinning CPU (i.e. similar to sysrq-l output)
> > > > > > might be a useful addition to tracking down such stalls?
> > > > >
> > > > > Strange. There is a trigger_all_cpu_backtrace() call that is supposed
> > > > > to dump all CPUs' stacks. It has been working in the past, but you are
> > > > > the second person in a couple of weeks to report that it isn't doing
> > > > > its job. (Though the other one was running the -rt tree.)
> > > >
> > > > Ok, so it is supposed to be dumping the stack. Good.
> > >
> > > Yep!
....
> > > > I'm running on x86_64 (inside a KVM VM) so it should be present.
> > >
> > > Indeed it should! Is NMI delivery busted or something?
> >
> > Not that I know of.
....
> > > Just out of curiosity, what are you thinking of doing in the code to
> > > figure out that trigger_all_cpu_backtrace() didn't work and that it
> > > was time to fall back on sysrq-l-style processing?
> >
> > - trigger_all_cpu_backtrace();
> > + if (!trigger_all_cpu_backtrace()) {
> > + pr_err("trigger_all_cpu_backtrace returned false!");
>
> So a good defensive-programming patch would be something like the
> following, right?
>
> if (!trigger_all_cpu_backtrace()) {
> pr_err("Falling back to dump_stack()");
> dump_stack();
> }
>
> > And I see:
> >
> > [ 90.808037] INFO: rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)
> > [ 90.808037] trigger_all_cpu_backtrace returned false
> >
> > Which indicates that arch_trigger_all_cpu_backtrace is indeed not
> > defined, and that sysrq-l is using the fallback.
> >
> > My .config is below if you want to look into it further.
>
> My ability to do so is quite limited, but you do have
> CONFIG_X86_LOCAL_APIC=y. I took a quick look but don't immediately
> see why you would not be getting the NMI variant.
It would appear to be because include/linux/nmi.h does not do
#include <asm/nmi.h> where arch_trigger_all_cpu_backtrace is
defined.
I think this is the case from looking at the build deps for
rcutree.h: I see include/linux/nmi.h but no asm/nmi.h. It appears
that asm/nmi.h is only included if this config ifdef is true:
#if defined(ARCH_HAS_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
#include <asm/nmi.h>
And my .config has:
$ grep CONFIG_HARDLOCKUP_DETECTOR .config
# CONFIG_HARDLOCKUP_DETECTOR is not set
$
And ARCH_HAS_NMI_WATCHDOG is only defined for these archs:
$ grep ARCH_HAS_NMI_WATCHDOG -r arch
arch/blackfin/include/asm/irq.h:# define ARCH_HAS_NMI_WATCHDOG
arch/mn10300/include/asm/reset-regs.h:#define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */
arch/sparc/include/asm/irq_64.h:#define ARCH_HAS_NMI_WATCHDOG
$
So unless you configure the CONFIG_HARDLOCKUP_DETECTOR into a
kernel, trigger_all_cpu_backtrace() will simply fail.
And i can't just convert this to always include asm/nmi.h, because:
$ find arch -name nmi.h
arch/blackfin/include/asm/nmi.h
arch/mips/include/asm/sn/nmi.h
arch/mn10300/include/asm/nmi.h
arch/s390/include/asm/nmi.h
arch/sparc/include/asm/nmi.h
arch/x86/include/asm/nmi.h
$
Only a few platforms actually define asm/nmi.h. The whole platform
specific trigger_all_cpu_backtrace() stuff is just broken - it's not
just platform specific, it's platform and config specific, without
having any indication that it is config specific....
Arch specific code is supposed to have a generic implementation for
platfoms that don't implement it. It's pretty clear from the
sysrq-l code that a generic implementation is possible, so I'm
wondering why trigger_all_cpu_backtrace doesn't own that generic
fallback code.....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2011-08-06 0:21 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-03 2:28 [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies) Dave Chinner
2011-08-03 2:52 ` Dave Chinner
2011-08-03 6:30 ` Paul E. McKenney
2011-08-05 0:33 ` Dave Chinner
2011-08-05 6:41 ` Paul E. McKenney
2011-08-05 8:48 ` Dave Chinner
2011-08-05 11:24 ` Paul E. McKenney
2011-08-06 0:20 ` Dave Chinner [this message]
2011-08-08 18:33 ` trigger_all_cpu_backtrace() has no generic implementation (was Re: [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)) Paul E. McKenney
2011-08-22 15:47 ` Don Zickus
2011-08-23 16:42 ` Don Zickus
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110806002056.GC3162@dastard \
--to=david@fromorbit.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.