From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758604Ab2IMQ7z (ORCPT ); Thu, 13 Sep 2012 12:59:55 -0400 Received: from e36.co.us.ibm.com ([32.97.110.154]:50890 "EHLO e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755167Ab2IMQ7v (ORCPT ); Thu, 13 Sep 2012 12:59:51 -0400 Date: Thu, 13 Sep 2012 09:58:44 -0700 From: "Paul E. McKenney" To: John Stultz Cc: Linus Walleij , Daniel Lezcano , linux-kernel@vger.kernel.org Subject: Re: RCU lockup in the SMP idle thread, help... Message-ID: <20120913165844.GW4257@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <50520E8A.9030408@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50520E8A.9030408@linaro.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12091316-7606-0000-0000-000003A1472B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 13, 2012 at 09:49:14AM -0700, John Stultz wrote: > On 09/13/2012 05:36 AM, Linus Walleij wrote: > >Hi Paul et al, > > > >I have this sporadic lockup in the SMP idle thread on ARM U8500: > > > >root@ME:/ > >root@ME:/ > >root@ME:/ INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} > >(detected by 1, t=23190 jiffies) > >[] (unwind_backtrace+0x0/0xf8) from [] > >(rcu_check_callbacks+0x69c/0x6e0) > >[] (rcu_check_callbacks+0x69c/0x6e0) from [] > >(update_process_times+0x38/0x4c) > >[] (update_process_times+0x38/0x4c) from [] > >(tick_sched_timer+0x80/0xe4) > >[] (tick_sched_timer+0x80/0xe4) from [] > >(__run_hrtimer.isra.18+0x44/0xd0) > >[] (__run_hrtimer.isra.18+0x44/0xd0) from [] > >(hrtimer_interrupt+0x118/0x2b4) > >[] (hrtimer_interrupt+0x118/0x2b4) from [] > >(twd_handler+0x30/0x44) > >[] (twd_handler+0x30/0x44) from [] > >(handle_percpu_devid_irq+0x80/0xa0) > >[] (handle_percpu_devid_irq+0x80/0xa0) from [] > >(generic_handle_irq+0x2c/0x40) > >[] (generic_handle_irq+0x2c/0x40) from [] > >(handle_IRQ+0x4c/0xac) > >[] (handle_IRQ+0x4c/0xac) from [] (gic_handle_irq+0x24/0x58) > >[] (gic_handle_irq+0x24/0x58) from [] (__irq_svc+0x40/0x70) > >Exception stack(0xcf851f88 to 0xcf851fd0) > >1f80: 00000020 c05d5920 00000001 00000000 cf850000 cf850000 > >1fa0: c05f4d48 c02de0b4 c05d8d90 412fc091 cf850000 00000000 01000000 cf851fd0 > >1fc0: c000f234 c000f238 60000013 ffffffff > >[] (__irq_svc+0x40/0x70) from [] (default_idle+0x28/0x30) > >[] (default_idle+0x28/0x30) from [] (cpu_idle+0x98/0xe4) > >[] (cpu_idle+0x98/0xe4) from [<002d2ef4>] (0x2d2ef4) > > > >The hangup has been there in the v3.6-rc series for a while (probably > >since the merge window). > > > >I haven't been able to bisect out why this is happening, because the bug > >is pretty hazardous to check - you have to boot the system and leave it alone > >or use it sporadically for a while. Then all of a sudden it happens. > > > >So: reproducible, but not deterministically reproducible (I hate this kind > >of thing...) > > > >The code involved seems to be generic kernel code apart from the > >ARM GIC and TWD timer drivers. > > > >Any hints or debug options I should switch on? > > I saw this once as well testing the fix to Daniel's deep idle hang > issue (also on 32 bit). > > Really briefly looking at the code in rcutree.c, I'm curious if > we're hitting a false positive on the 5 minute jiffies overflow? Hmmm... Might be. Does the patch below help? Thanx, Paul ------------------------------------------------------------------------ rcu: Avoid spurious RCU CPU stall warnings If a given CPU avoids the idle loop but also avoids starting a new RCU grace period for a full minute, RCU can issue spurious RCU CPU stall warnings. This commit fixes this issue by adding a check for ongoing grace period to avoid these spurious stall warnings. Reported-by: Becky Bruce Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 3d63d1c..aea3157 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -819,7 +819,8 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp) j = ACCESS_ONCE(jiffies); js = ACCESS_ONCE(rsp->jiffies_stall); rnp = rdp->mynode; - if ((ACCESS_ONCE(rnp->qsmask) & rdp->grpmask) && ULONG_CMP_GE(j, js)) { + if (rcu_gp_in_progress(rsp) && + (ACCESS_ONCE(rnp->qsmask) & rdp->grpmask) && ULONG_CMP_GE(j, js)) { /* We haven't checked in, so go dump stack. */ print_cpu_stall(rsp);