From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752150AbdK0MuY (ORCPT ); Mon, 27 Nov 2017 07:50:24 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:43450 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751889AbdK0MuV (ORCPT ); Mon, 27 Nov 2017 07:50:21 -0500 Date: Mon, 27 Nov 2017 13:49:18 +0100 From: Martin Schwidefsky To: Will Deacon Cc: Peter Zijlstra , Sebastian Ott , Ingo Molnar , Heiko Carstens , linux-kernel@vger.kernel.org Subject: Re: [bisected] system hang after boot In-Reply-To: <20171127114947.GA30679@arm.com> References: <20171122182659.GA22648@arm.com> <20171122202217.GO3326@worktop> <20171127114947.GA30679@arm.com> X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17112712-0020-0000-0000-000003D0F452 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17112712-0021-0000-0000-000042665693 Message-Id: <20171127134918.4da71a78@mschwideX1> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-11-27_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1711270178 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 27 Nov 2017 11:49:48 +0000 Will Deacon wrote: > Hi Peter, > > On Wed, Nov 22, 2017 at 09:22:17PM +0100, Peter Zijlstra wrote: > > On Wed, Nov 22, 2017 at 06:26:59PM +0000, Will Deacon wrote: > > > > > Now, I can't see what the break_lock is doing here other than causing > > > problems. Is there a good reason for it, or can you just try removing it > > > altogether? Patch below. > > > > The main use is spin_is_contended(), which in turn ends up used in > > __cond_resched_lock() through spin_needbreak(). > > > > This allows better lock wait times for PREEMPT kernels on platforms > > where the lock implementation itself cannot provide 'contended' state. > > > > In that capacity the write-write race shouldn't be a problem though. > > I'm not sure why it isn't a problem: given that the break_lock variable > can read as 1 for a lock that is no longer contended and 0 for a lock that > is currently contended, then the __cond_resched_lock is likely to see a > value of 0 (i.e. spin_needbreak always return false) more often than no > since it's checked by the lock holder. Grepping for 'break_lock' the two locking blueprints are the only places where the field is written to. Unless I am blind, the associated unlock functions do *not* reset 'break_lock'. Without the raw_##op##_can_lock(lock) check the first of the blueprints now looks like this: void __lockfunc __raw_##op##_lock(locktype##_t *lock) \ { \ for (;;) { \ preempt_disable(); \ if (likely(do_raw_##op##_trylock(lock))) \ break; \ preempt_enable(); \ \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ while ((lock)->break_lock) \ arch_##op##_relax(&lock->raw_lock); \ } \ (lock)->break_lock = 0; \ } \ All it takes to create an endless loop is two CPUs, the first acquired the lock and the second tries to get the lock. After the unsuccessful trylock of the second CPU, the first CPU releases the lock and never tries to take it again. The second CPU will be stuck in an endless loop. I guess my best course of action is to remove GENERIC_LOCKBREAK from arch/s390/Kconfig to avoid this construct altogether. Let us see what breaks if I do that .. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.