From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933833AbcBQJ3Z (ORCPT ); Wed, 17 Feb 2016 04:29:25 -0500 Received: from mail.skyhub.de ([78.46.96.112]:48688 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933798AbcBQJ3Q (ORCPT ); Wed, 17 Feb 2016 04:29:16 -0500 Date: Wed, 17 Feb 2016 10:29:11 +0100 From: Borislav Petkov To: Ingo Molnar Cc: Andy Lutomirski , "linux-kernel@vger.kernel.org" , X86 ML Subject: Re: WARNING: CPU: 0 PID: 3031 at ./arch/x86/include/asm/fpu/internal.h:530 fpu__restore+0x90/0x130() Message-ID: <20160217092911.GA2023@pd.tnic> References: <20160211192741.GG5565@pd.tnic> <20160212170010.GE4099@pd.tnic> <20160215191422.GB32716@pd.tnic> <20160217081646.GA32354@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20160217081646.GA32354@gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 17, 2016 at 09:16:46AM +0100, Ingo Molnar wrote: > So I'm wondering why this started triggering only now. Is this a pre-existing bug > that somehow got triggered via: > > 58122bf1d856 x86/fpu: Default eagerfpu=on on all CPUs > > ? Well, that's an interesting question. See, the thing is, I triggered this only *once* by accident and I haven't seen it ever since. The "reliable" "reproducer" I used to debug this was Andy's suggestion to stick a schedule() in __fpu__restore_sig(). So the answer to that question is not easy. BUT(!), regardless, the bug still needs to be fixed because my tracing here https://lkml.kernel.org/r/20160215191422.GB32716@pd.tnic showed that getting preempted after setting fpu->fpstate_active = 1; leads to the WARN. Because - and please doublecheck me on that - when we're in __switch_to() and the task which already has ->fpstate_active set and it is the next task to which we're going to switch to, when it enters switch_fpu_prepare(), it does: fpu.preload = static_cpu_has(X86_FEATURE_FPU) && new_fpu->fpstate_active && ^^^^^^^^^^^^^^^^^^^^^^^ so that fpu.preload is set now. A bit later in that same function: /* Don't change CR0.TS if we just switch! */ if (fpu.preload) { new_fpu->counter++; __fpregs_activate(new_fpu); ^^^^^^^^^^^^^^^^^ ->fpregs_active gets set here and when the task returns to __fpu__restore_sig(), fpu__restore() sets it again, leading to the WARN. Mind you, this happens on 32-bit only because there we sigreturn with irqs enabled. Look at the call trace. > If yes then we need a plausible theory of how that never triggered on > modern Intel CPUs that had eagerfpu enabled for years. AFAICT, it triggers - and the window is very small at that - only on 32-bit. If at all. > Or perhaps was it caused by one of the other changes in tip:x86/fpu: > > c6ab109f7e0e x86/fpu: Speed up lazy FPU restores slightly > a20d7297045f x86/fpu: Fold fpu_copy() into fpu__copy() > 5ed73f40735c x86/fpu: Fix FNSAVE usage in eagerfpu mode > 4ecd16ec7059 x86/fpu: Fix math emulation in eager fpu mode > > ? I can certainly try to test all those but I don't have a reliable reproducer. The only thing I could do is check out each of those commits and stick a schedule() in __fpu__restore_sig() and see what happens. But if my analysis above is right, none of those would matter because of the mechanism of how the warn happens... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.