From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753692AbdJNMvg (ORCPT ); Sat, 14 Oct 2017 08:51:36 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:47738 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753530AbdJNMvc (ORCPT ); Sat, 14 Oct 2017 08:51:32 -0400 Date: Sat, 14 Oct 2017 05:51:16 -0700 From: "Paul E. McKenney" To: Wang YanQing Cc: linux-kernel@vger.kernel.org Subject: Re: Bug report for RCU stalled warning [3.10.69] Reply-To: paulmck@linux.vnet.ibm.com References: <20171011042139.GA5038@udknight> <20171012203824.GK3521@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171012203824.GK3521@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17101412-0024-0000-0000-000002E2CD48 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007896; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000236; SDB=6.00931026; UDB=6.00468694; IPR=6.00711275; BA=6.00005639; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017541; XFM=3.00000015; UTC=2017-10-14 12:51:29 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17101412-0025-0000-0000-000045BA55C6 Message-Id: <20171014125116.GA8791@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-14_01:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710140182 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 12, 2017 at 01:38:24PM -0700, Paul E. McKenney wrote: > [ Adding LKML on CC so that others can find this. ] > > On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote: > > Hi, Paul McKenney. > > > > I have received many machine-stopped-respone reports, after reboot and > > inspect message, all of them show RCU stalled, but I can't figure out > > how to fix it. I can't update the kernel, it is the painful point, so I > > need to fix it in 3.10. I have attached four messages come from different > > cpu and broads(so I guess it is a BUG instead of hardware fault), any > > suggestion is welcome. > > The first step is of course to report this to your distro, as they are > the ones who do the care and feeding of such old kernels. Please include > the information below in that report, as it might help your distro find > and fix the problem. > > It looks like the stalled CPU is idle, and that the activity resulting > from the stall-warning message gets things going again. Callbacks are > being processed, so no OOM. But you are getting the splat every 60 > seconds. The system has only two CPUs, and is x86. > > If you cannot upgrade the kernel, my ability to help is limited. And the > diagnostics printed with the v3.10 CPU stall warnings are also quite > limited. However, there are some things you could try as workarounds: > > 1. Check to make sure that the rcu_sched kthread is getting > the CPU time that it needs. Preventing this kthread from > running would create exactly this output, assuming that > the stall warning got it going again temporarily. > > 2. It looks like the disturbance of the RCU CPU stall warning > is getting things going again. Try artificially providing > this disturbance, for example, by running a usermode program > or script that runs on each CPU in turn, then sleeps for > (say) five seconds. > > 3. If you can reconfigure your kernel, try building with > CONFIG_RCU_FAST_NO_HZ=n. And if you can reconfigure kernel, in v3.10, building with CONFIG_RCU_CPU_STALL_INFO and CONFIG_RCU_CPU_STALL_VERBOSE will provide more information on the CPUs and tasks stalling the grace period. Thanx, Paul > 4. Was the system running reliably on some earlier version? > If so, consider reverting back to that version, and include > the version information in your report to your distro. If > your distro provides individual patches, you should consider > bisecting so as to locate the offending patch. > > Good luck with it! > > Thanx, Paul