From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S967283AbcA1Kwl (ORCPT <rfc822;w@1wt.eu>);
	Thu, 28 Jan 2016 05:52:41 -0500
Received: from mail-pf0-f194.google.com ([209.85.192.194]:36055 "EHLO
	mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S967077AbcA1Kw3 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 28 Jan 2016 05:52:29 -0500
Date: Thu, 28 Jan 2016 19:53:42 +0900
From: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>, akpm@linux-foundation.org,
        mingo@kernel.org, linux-kernel@vger.kernel.org, akinobu.mita@gmail.com,
        jack@suse.cz, torvalds@linux-foundation.org, peter@hurleysoftware.com,
        sergey.senozhatsky@gmail.com
Subject: Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in
 the debug code
Message-ID: <20160128105342.GB610@swordfish>
References: <1453896061-14102-1-git-send-email-byungchul.park@lge.com>
 <20160128014253.GC1538@X58A-UD3R>
 <20160128023750.GB1834@swordfish>
 <000301d15985$7f416690$7dc433b0$@lge.com>
 <20160128060530.GC1834@swordfish>
 <20160128081313.GB31266@X58A-UD3R>
 <20160128104137.GA610@swordfish>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160128104137.GA610@swordfish>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

two small corrections.

On (01/28/16 19:41), Sergey Senozhatsky wrote:
[..]
> > Unfortunately, it's not reproduced anymore.
> > 
> > If it's clearly a spinlock caller's bug as you said, modifying the
> > spinlock debug code does not help it at all. But I found there's a
> > possiblity in the debug code *itself* to cause a lockup. So I tried
> > to fix it. What do you think about it?
> 
> ah... silly me... you mean the first CPU that triggers the spin_dump() will
					^^^ this, of course, is true for
					console_sem->lock and logbuf_lock
					only.

> deadlock itself, so the rest of CPUs will see endless recursive
> spin_lock()->spin_dump()->spin_lock()->spin_dump() calls?
> 
> like the one below?
> 
> 
> CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner
> 
> CPUZ -> vprintk_emit()
> 	__spin_lock_debug()
> 		for (i = 0; i < `loops_per_jiffy * HZ'; i++) {   << wait for the lock
> 			if (arch_spin_trylock())
> 				return;
> 			__delay(1);
> 			}
> 		spin_dump()       << lock is still owned by CPUA
> 		{   -> vprintk_emit()
> 			__spin_lock_debug()
> 				for (...) {
> 					if (arch_spin_trylock())
> 						return;
> 					__delay(1);
> 				}
- 		<< CPUA unlocked the lock
> 				spin_dump()
> 				{   -> vprintk_emit()
> 					__spin_lock_debug()
      the "<< CPUA unlocked the lock" line better be here. to make it correct.

+		<< CPUA unlocked the lock
> 						for (...) {
> 							if (arch_spin_trylock())   << success!!
> 							/* CPUZ now owns the lock */
> 								return;
> 						}
> 				}
> 
> 			<< we return here with the spin_lock being owned by this CPUZ
> 
> 				trigger_all_cpu_backtrace()
> 
> 			<< and... now it does the arch_spin_lock()
> 				/*
> 				 * The trylock above was causing a livelock.  Give the lower level arch
> 				 * specific lock code a chance to acquire the lock. We have already
> 				 * printed a warning/backtrace at this point. The non-debug arch
> 				 * specific code might actually succeed in acquiring the lock.  If it is
> 				 * not successful, the end-result is the same - there is no forward
> 				 * progress.
> 				 */
> 				arch_spin_lock(&lock->raw_lock);
> 
> 			<< which obviously dealocks this CPU...
> 		}
> 
> 		trigger_all_cpu_backtrace()
> 
> 		arch_spin_lock()
> 
> 
> 
> 
> so
> 	"the CPUZ is now keeping the lock forever, and not going to release it"
> and
> 	"CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..."
> 
> 
> 
> My apologies for not getting it right the first time. Sorry!
> 
> Can you please update your bug description in the commit message?
> It's the deadlock that is causing the recursion on other CPUs in the
> first place.

	-ss