From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S967283AbcA1Kwl (ORCPT ); Thu, 28 Jan 2016 05:52:41 -0500 Received: from mail-pf0-f194.google.com ([209.85.192.194]:36055 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967077AbcA1Kw3 (ORCPT ); Thu, 28 Jan 2016 05:52:29 -0500 Date: Thu, 28 Jan 2016 19:53:42 +0900 From: Sergey Senozhatsky To: Sergey Senozhatsky Cc: Byungchul Park , akpm@linux-foundation.org, mingo@kernel.org, linux-kernel@vger.kernel.org, akinobu.mita@gmail.com, jack@suse.cz, torvalds@linux-foundation.org, peter@hurleysoftware.com, sergey.senozhatsky@gmail.com Subject: Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code Message-ID: <20160128105342.GB610@swordfish> References: <1453896061-14102-1-git-send-email-byungchul.park@lge.com> <20160128014253.GC1538@X58A-UD3R> <20160128023750.GB1834@swordfish> <000301d15985$7f416690$7dc433b0$@lge.com> <20160128060530.GC1834@swordfish> <20160128081313.GB31266@X58A-UD3R> <20160128104137.GA610@swordfish> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160128104137.GA610@swordfish> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org two small corrections. On (01/28/16 19:41), Sergey Senozhatsky wrote: [..] > > Unfortunately, it's not reproduced anymore. > > > > If it's clearly a spinlock caller's bug as you said, modifying the > > spinlock debug code does not help it at all. But I found there's a > > possiblity in the debug code *itself* to cause a lockup. So I tried > > to fix it. What do you think about it? > > ah... silly me... you mean the first CPU that triggers the spin_dump() will ^^^ this, of course, is true for console_sem->lock and logbuf_lock only. > deadlock itself, so the rest of CPUs will see endless recursive > spin_lock()->spin_dump()->spin_lock()->spin_dump() calls? > > like the one below? > > > CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner > > CPUZ -> vprintk_emit() > __spin_lock_debug() > for (i = 0; i < `loops_per_jiffy * HZ'; i++) { << wait for the lock > if (arch_spin_trylock()) > return; > __delay(1); > } > spin_dump() << lock is still owned by CPUA > { -> vprintk_emit() > __spin_lock_debug() > for (...) { > if (arch_spin_trylock()) > return; > __delay(1); > } - << CPUA unlocked the lock > spin_dump() > { -> vprintk_emit() > __spin_lock_debug() the "<< CPUA unlocked the lock" line better be here. to make it correct. + << CPUA unlocked the lock > for (...) { > if (arch_spin_trylock()) << success!! > /* CPUZ now owns the lock */ > return; > } > } > > << we return here with the spin_lock being owned by this CPUZ > > trigger_all_cpu_backtrace() > > << and... now it does the arch_spin_lock() > /* > * The trylock above was causing a livelock. Give the lower level arch > * specific lock code a chance to acquire the lock. We have already > * printed a warning/backtrace at this point. The non-debug arch > * specific code might actually succeed in acquiring the lock. If it is > * not successful, the end-result is the same - there is no forward > * progress. > */ > arch_spin_lock(&lock->raw_lock); > > << which obviously dealocks this CPU... > } > > trigger_all_cpu_backtrace() > > arch_spin_lock() > > > > > so > "the CPUZ is now keeping the lock forever, and not going to release it" > and > "CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..." > > > > My apologies for not getting it right the first time. Sorry! > > Can you please update your bug description in the commit message? > It's the deadlock that is causing the recursion on other CPUs in the > first place. -ss