From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com [IPv6:2607:f8b0:400e:c05::243]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3yqvYJ10sgzDr1t for ; Mon, 4 Dec 2017 17:07:31 +1100 (AEDT) Received: by mail-pg0-x243.google.com with SMTP id f12so7511656pgo.5 for ; Sun, 03 Dec 2017 22:07:31 -0800 (PST) Date: Mon, 4 Dec 2017 16:07:14 +1000 From: Nicholas Piggin To: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org, Benjamin Herrenschmidt Subject: Re: [PATCH 2/4] powerpc/64: do not trace irqs-off at interrupt return to soft-disabled context Message-ID: <20171204160714.2ff62d11@roar.ozlabs.ibm.com> In-Reply-To: <87d13v3wpm.fsf@concordia.ellerman.id.au> References: <20171116160052.18672-1-npiggin@gmail.com> <20171116160052.18672-3-npiggin@gmail.com> <87d13v3wpm.fsf@concordia.ellerman.id.au> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Mon, 04 Dec 2017 16:09:57 +1100 Michael Ellerman wrote: > Nicholas Piggin writes: > > > When an interrupt is returning to a soft-disabled context (which can > > happen for non-maskable interrupts or synchronous interrupts), it goes > > through the motions of soft-disabling again, including calling > > TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()). > > > > This is not necessary, because we must already be soft-disabled in the > > interrupt context, it also may be causing crashes in the irq tracing > > code to re-enter as an nmi. Replace it with a warning to ensure that > > soft-interrupts are still disabled. > > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/kernel/entry_64.S | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > So this patch is the core of the bug fix I gather. > > Git blames says: > > Fixes: 7c0482e3d055 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync") > Cc: stable@vger.kernel.org # v3.4+ > > But I'm wondering how this has been broken that long without us > noticing? You hit it doing some sort of perf stress test I think - so is > it just that we've never pushed hard enough? Or did something change to > expose this? Or we're just not sure? I'm not really sure. A customer hit it, during either a stress test or long running workload with lockdep irq tracing and perf running at the same time. I don't have a lot more details but we might be able to get some offline if necessary. Thanks, Nick