From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933481AbcK3K5F (ORCPT ); Wed, 30 Nov 2016 05:57:05 -0500 Received: from mx2.suse.de ([195.135.220.15]:52289 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752786AbcK3K44 (ORCPT ); Wed, 30 Nov 2016 05:56:56 -0500 Date: Wed, 30 Nov 2016 11:56:53 +0100 From: Petr Mladek To: linyongting@huawei.com Cc: kejinling@huawei.com, akpm@linux-foundation.org, sergey.senozhatsky@gmail.com, bp@suse.de, tj@kernel.org, treding@nvidia.com, linux-kernel@vger.kernel.org, leisure.wang@huawei.com, Peter Zijlstra Subject: Re: [PATCH] printk: Fix spinlock deadlock in printk reenty Message-ID: <20161130105653.GF24060@pathway.suse.cz> References: <1480490119-63559-1-git-send-email-linyongting@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1480490119-63559-1-git-send-email-linyongting@huawei.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 2016-11-30 15:15:19, linyongting@huawei.com wrote: > From: Jinling Ke > > when Oops in printk, printk will call zap_locks() to reinitialize > spinlock to prevent deadlock. In arm, arm64, x86 or other > architecture smp cpu, race condition will occur in printk spinlock > logbuf_lock and then it will result other cpu that is waiting printk > spinlock in deadlock(in function raw_spin_lock). Because the cpus > deadlock, you can see the error printk log: > > "SMP: failed to stop secondary CPUs" > > In arm, arm64, x86 or other architecture, spinlock variable > is divided into 2 parts, for example they are 'owner' and 'next' in arm. > When get a spinlock, the 'next' part will add 1 and wait 'next' being > equal to 'owner'. However, at this moment, the 'next' part is local > variable, but 'owner' part value is get from global variable logbuf_lock. > However,raw_spin_lock_init(&logbuf_lock) will set 'owner' part and > 'next' part to zero, the result is that cpu deadlock in function > raw_spin_lock( while loop in function arch_spin_lock ). > > struct of arm spinlock > union { > u32 slock; > struct __raw_tickets { > u16 owner; > u16 next; > } tickets; > }; > } arch_spinlock_t; > static inline void arch_spin_lock(arch_spinlock_t *lock) > {... > <--- At the moment, other cpu call zap_locks()->spin_lock_init(), > <--- set the 'owner' part to zero, but lockval.tickets.next is a > <--- local variable > while (lockval.tickets.next != lockval.tickets.owner) { > lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner); > } > ... > } > > The solution is that In function zap_locks(), replace > raw_spin_lock_init(&logbuf_lock) with raw_spin_unlock(&logbuf_lock), > to let spin_lock stay in unlocked. > > Signed-off-by: Jinling Ke > --- > kernel/printk/printk.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index f7a55e9..05b1886 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -1603,7 +1603,7 @@ static void zap_locks(void) > > debug_locks_off(); > /* If a crash is occurring, make sure we can't deadlock */ > - raw_spin_lock_init(&logbuf_lock); > + raw_spin_unlock(&logbuf_lock); But what if the lock was not not locked in the first place? A solution might be to use if (raw_spin_is_locked(&logbuf_lock)) raw_spin_unlock(&logbuf_lock); But this would fail if the lock looks locked because it was unlocked twice or when the first next waiter is blocked from some reason. The idea behind the current code is the best effort to print the Oops message. It means to allow to get the printk lock by the process that is calling zap_locks(). For this the lock_init() looks like the best solution. Note that we are going to remove zap_lock() completely. See https://lkml.kernel.org/r/20161027154933.1211-7-sergey.senozhatsky@gmail.com Another solution would be to make printk() to ignore locks when Oops is in progress. It was somewhere suggested by Peter Zijlstra. Well, it might cause some problems as well when there are more CPUs still running and printing. Best Regards, Petr