All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)
Date: Fri, 5 Aug 2011 10:33:16 +1000	[thread overview]
Message-ID: <20110805003316.GA3162@dastard> (raw)
In-Reply-To: <20110803063049.GD13065@linux.vnet.ibm.com>

On Tue, Aug 02, 2011 at 11:30:50PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 03, 2011 at 12:52:22PM +1000, Dave Chinner wrote:
> > On Wed, Aug 03, 2011 at 12:28:57PM +1000, Dave Chinner wrote:
> > > Hi Paul,
> > > 
> > > I've had this hang a couple of times now, so I figured it isn't an
> > > isolated event. I am getting kernels occassionally hanging with the
> > > following output occurring:
> > > 
> > > [   62.812011] INFO: rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)
> > > [  242.936009] INFO: rcu_sched_state detected stall on CPU 8 (t=60031 jiffies)

....

> > This might be a false alarm - I've just diagnosed(*) that a kernel
> > thread was stuck in a hard loop therefore not giving up the CPU.
> 
> Ah, that is indeed one of the conditions that RCU CPU stall warnings
> can catch.
> 
> > Perhaps this is error message could be more informative?
> > The detector is acting like the hung task detector, except it's
> > working on kernel code stuck in a loop burning CPU, so maybe dumping
> > a stack trace of the spinning CPU (i.e. similar to sysrq-l output)
> > might be a useful addition to tracking down such stalls?
> 
> Strange.  There is a trigger_all_cpu_backtrace() call that is supposed
> to dump all CPUs' stacks.  It has been working in the past, but you are
> the second person in a couple of weeks to report that it isn't doing
> its job.  (Though the other one was running the -rt tree.)

Ok, so it is supposed to be dumping the stack. Good.

> 
> Wait a minute...  Here is the definition:
> 
> 	#ifdef arch_trigger_all_cpu_backtrace
> 	static inline bool trigger_all_cpu_backtrace(void)
> 	{
> 		arch_trigger_all_cpu_backtrace();
> 
> 		return true;
> 	}
> 	#else
> 	static inline bool trigger_all_cpu_backtrace(void)
> 	{
> 		return false;
> 	}
> 	#endif
> 
> Passing a lower-case symbol to #ifdef is a bit of a red flag.  Where
> is it defined?
> 
> o	arch/sparc/include/asm/irq_64.h:
> 
> 	#define arch_trigger_all_cpu_backtrace arch_trigger_all_cpu_backtrace
> 
> o	arch/sparc/kernel/process_64.c:
> 
> 	void arch_trigger_all_cpu_backtrace(void)
> 	{
> 		...
> 	}
> 
> o	arch/x86/include/asm/nmi.h:
> 
> 	#define arch_trigger_all_cpu_backtrace arch_trigger_all_cpu_backtrace
> 
> o	arch/x86/kernel/apic/hw_nmi.c:
> 
> 	void arch_trigger_all_cpu_backtrace(void)
> 	{
> 		...
> 	}
> 
> So I am guessing that you are running some architecture other than
> x86 or SPARC.  And the implementation is a bit hostile on other
> architectures.  So I suggest adding a dump_stack() before the
> "return false" in trigger_all_cpu_backtrace(), as in the patch
> shown below.

I'm running on x86_64 (inside a KVM VM) so it should be present.

Hmmm - I note that sysrq-l has a fallback implementation that uses
smp_call_function() should trigger_all_cpu_backtrace() return false.
I'd bet that's why sysrq-l is working and the rcu stall detection
isn't. i.e arch_trigger_all_cpu_backtrace() is either broken or for
some reason not compiled in. I can't tell why - I get lost in all
the different ways that arch specific code is inlined by
preprocessor magic...

> But this is still strange.  I -know- I have seen stack dumps for
> all CPUs when running on Power...  But the code has not changed
> for quite some time.
> 
> Nevertheless, could you please try out the patch below?  It should
> get you at least the stack dump for the current CPU, which in your
> case was the offending CPU.

I'll give it a go, though perhaps using the same fallback as sysrq-l
might be a better idea?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2011-08-05  0:33 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-03  2:28 [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies) Dave Chinner
2011-08-03  2:52 ` Dave Chinner
2011-08-03  6:30   ` Paul E. McKenney
2011-08-05  0:33     ` Dave Chinner [this message]
2011-08-05  6:41       ` Paul E. McKenney
2011-08-05  8:48         ` Dave Chinner
2011-08-05 11:24           ` Paul E. McKenney
2011-08-06  0:20             ` trigger_all_cpu_backtrace() has no generic implementation (was Re: [regression, 3.1, rcu] rcu_sched_state detected stall on CPU 8 (t=15000 jiffies)) Dave Chinner
2011-08-08 18:33               ` Paul E. McKenney
2011-08-22 15:47               ` Don Zickus
2011-08-23 16:42               ` Don Zickus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110805003316.GA3162@dastard \
    --to=david@fromorbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.