From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from e37.co.us.ibm.com (e37.co.us.ibm.com [32.97.110.158]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 239321A07EE for ; Tue, 12 Aug 2014 09:42:25 +1000 (EST) Received: from /spool/local by e37.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 11 Aug 2014 17:42:22 -0600 Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id AE44819D8039 for ; Mon, 11 Aug 2014 17:42:09 -0600 (MDT) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by b03cxnp08028.gho.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s7BNgLZX20906076 for ; Tue, 12 Aug 2014 01:42:21 +0200 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id s7BNkaVf019818 for ; Mon, 11 Aug 2014 17:46:37 -0600 Date: Mon, 11 Aug 2014 16:42:19 -0700 From: "Paul E. McKenney" To: Anton Blanchard Subject: Re: [PATCH 2/2] powerpc: Add ppc64 hard lockup detector support Message-ID: <20140811234219.GH5821@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20140805145500.773004e9@kryten> <20140805145621.2fa2a372@kryten> <20140812093137.404e8930@kryten> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20140812093137.404e8930@kryten> Cc: mikey@neuling.org, paulus@samba.org, linuxppc-dev@lists.ozlabs.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Tue, Aug 12, 2014 at 09:31:37AM +1000, Anton Blanchard wrote: > The hard lockup detector uses a PMU event as a periodic NMI to > detect if we are stuck (where stuck means no timer interrupts have > occurred). > > Ben's rework of the ppc64 soft disable code has made ppc64 PMU > exceptions a partial NMI. They can get disabled if an external interrupt > comes in, but otherwise PMU interrupts will fire in interrupt disabled > regions. > > I wrote a kernel module to test this patch and noticed we sometimes > missed hard lockup warnings. The RCU code detected the stall first and > issued an IPI to backtrace all CPUs. Unfortunately an IPI is an external > interrupt and that will hard disable interrupts, preventing the hard > lockup detector from going off. If it helps, commit bc1dce514e9b (rcu: Don't use NMIs to dump other CPUs' stacks) makes RCU avoid this behavior. It instead reads the stacks out remotely when this commit is applied. It is in -tip, and should make mainline this merge window. Corresponding patch below. Thanx, Paul ------------------------------------------------------------------------ rcu: Don't use NMIs to dump other CPUs' stacks Although NMI-based stack dumps are in principle more accurate, they are also more likely to trigger deadlocks. This commit therefore replaces all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so that the CPU detecting an RCU CPU stall does the stack dumping. Signed-off-by: Paul E. McKenney Reviewed-by: Lai Jiangshan diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3f93033d3c61..8f3e4d43d736 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1013,10 +1013,7 @@ static void record_gp_stall_check_time(struct rcu_state *rsp) } /* - * Dump stacks of all tasks running on stalled CPUs. This is a fallback - * for architectures that do not implement trigger_all_cpu_backtrace(). - * The NMI-triggered stack traces are more accurate because they are - * printed by the target CPU. + * Dump stacks of all tasks running on stalled CPUs. */ static void rcu_dump_cpu_stacks(struct rcu_state *rsp) { @@ -1094,7 +1091,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp) (long)rsp->gpnum, (long)rsp->completed, totqlen); if (ndetected == 0) pr_err("INFO: Stall ended before state dump start\n"); - else if (!trigger_all_cpu_backtrace()) + else rcu_dump_cpu_stacks(rsp); /* Complain about tasks blocking the grace period. */ @@ -1125,8 +1122,7 @@ static void print_cpu_stall(struct rcu_state *rsp) pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n", jiffies - rsp->gp_start, (long)rsp->gpnum, (long)rsp->completed, totqlen); - if (!trigger_all_cpu_backtrace()) - dump_stack(); + rcu_dump_cpu_stacks(rsp); raw_spin_lock_irqsave(&rnp->lock, flags); if (ULONG_CMP_GE(jiffies, ACCESS_ONCE(rsp->jiffies_stall)))