From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932146AbcLABT5 (ORCPT ); Wed, 30 Nov 2016 20:19:57 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:55297 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751150AbcLABTz (ORCPT ); Wed, 30 Nov 2016 20:19:55 -0500 Date: Wed, 30 Nov 2016 17:19:50 -0800 From: "Paul E. McKenney" To: Guenter Roeck Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , sparclinux@vger.kernel.org, davem@davemloft.net Subject: Re: next: Commit 'mm: Prevent __alloc_pages_nodemask() RCU CPU stall ...' causing hang on sparc32 qemu Reply-To: paulmck@linux.vnet.ibm.com References: <20161129212308.GA12447@roeck-us.net> <20161130012817.GH3924@linux.vnet.ibm.com> <20161130070212.GM3924@linux.vnet.ibm.com> <929f6b29-461a-6e94-fcfd-710c3da789e9@roeck-us.net> <20161130120333.GQ3924@linux.vnet.ibm.com> <20161130192159.GB22216@roeck-us.net> <20161130210152.GL3924@linux.vnet.ibm.com> <20161130231846.GB17244@roeck-us.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161130231846.GB17244@roeck-us.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16120101-0008-0000-0000-00000637E9FB X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006171; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000193; SDB=6.00787599; UDB=6.00380999; IPR=6.00565280; BA=6.00004933; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013496; XFM=3.00000011; UTC=2016-12-01 01:19:52 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16120101-0009-0000-0000-00003D790431 Message-Id: <20161201011950.GX3924@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-30_20:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1612010022 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 30, 2016 at 03:18:46PM -0800, Guenter Roeck wrote: > On Wed, Nov 30, 2016 at 01:01:52PM -0800, Paul E. McKenney wrote: > > On Wed, Nov 30, 2016 at 11:21:59AM -0800, Guenter Roeck wrote: > > > On Wed, Nov 30, 2016 at 04:03:33AM -0800, Paul E. McKenney wrote: > > > > On Wed, Nov 30, 2016 at 02:52:11AM -0800, Guenter Roeck wrote: > > > > > On 11/29/2016 11:02 PM, Paul E. McKenney wrote: > > > > > >On Tue, Nov 29, 2016 at 08:32:51PM -0800, Guenter Roeck wrote: > > > > > >>On 11/29/2016 05:28 PM, Paul E. McKenney wrote: > > > > > >>>On Tue, Nov 29, 2016 at 01:23:08PM -0800, Guenter Roeck wrote: > > > > > >>>>Hi Paul, > > > > > >>>> > > > > > >>>>most of my qemu tests for sparc32 targets started to fail in next-20161129. > > > > > >>>>The problem is only seen in SMP builds; non-SMP builds are fine. > > > > > >>>>Bisect points to commit 2d66cccd73436 ("mm: Prevent __alloc_pages_nodemask() > > > > > >>>>RCU CPU stall warnings"); reverting that commit fixes the problem. > > > > > > > > And I have dropped this patch. Michal Hocko showed me the error of > > > > my ways with this patch. > > > > > > > > > > :-) > > > > > > On another note, I still get RCU tracebacks in the s390 tests. > > > > > > BUG: sleeping function called from invalid context at mm/page_alloc.c:3775 > > > > > > That is caused by 'rcu: Maintain special bits at bottom of ->dynticks counter'; > > > if I recall correctly we had discussed that earlier. > > > > Indeed, I had missed a dyntick counter update back on Nov 11, which meant > > that some of the code was still looking at the low-order bit instead of > > the next bit up. This is now fixed. > > > > So to get to the error message you call out above, I need to have improperly > > left the system in bh state or left irqs disabled, while the system was > > running normally without an oops. I am having a hard time seeing how this > > patch can do that. > > > > I would be more suspicious of f2a471ffc8a8 ("rcu: Allow boot-time use > > of cond_resched_rcu_qs()"). > > > > So you bisected or did a revert to work out which was the offending commit? > > > > My most recent bisect was with the November 10 image, so that would have missed > any later fix. Comparing the log messages, the current message is indeed > different. Sorry, I mixed that up; I just assumed that the problem would be > the same without really checking. My bad. > > Bisect would be tricky, since the s390 image was broken for some time after > November 10. The first time I have seen the above BUG: was with next-20161128 > (which is the first build after the crash was fixed). That version did not > include f2a471ffc8a8, so that can not be the cause. > > I'll try to set up a bisect tonight, working around the crash problem. > I'll let you know how it goes. Whew! You had me going for a bit there. ;-) Thanx, Paul