From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S933833AbcBQJ3Z (ORCPT <rfc822;w@1wt.eu>);
	Wed, 17 Feb 2016 04:29:25 -0500
Received: from mail.skyhub.de ([78.46.96.112]:48688 "EHLO mail.skyhub.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S933798AbcBQJ3Q (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 17 Feb 2016 04:29:16 -0500
Date: Wed, 17 Feb 2016 10:29:11 +0100
From: Borislav Petkov <bp@alien8.de>
To: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>
Subject: Re: WARNING: CPU: 0 PID: 3031 at
 ./arch/x86/include/asm/fpu/internal.h:530 fpu__restore+0x90/0x130()
Message-ID: <20160217092911.GA2023@pd.tnic>
References: <20160211192741.GG5565@pd.tnic>
 <CALCETrVSChX9BFyLn2mf4q3a0WxyNgFzcb4A4nFfAQnbfO02Pg@mail.gmail.com>
 <CALCETrVU8RvcDAUPfwoW9FVvgyn3z-5R86+4-mXtubpTd4YiKg@mail.gmail.com>
 <20160212170010.GE4099@pd.tnic>
 <20160215191422.GB32716@pd.tnic>
 <CALCETrW3r89hbU3u2xG2u+ZDrk-WzwmicL9ZL-L1GVGZfH-sdQ@mail.gmail.com>
 <20160217081646.GA32354@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20160217081646.GA32354@gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Feb 17, 2016 at 09:16:46AM +0100, Ingo Molnar wrote:
> So I'm wondering why this started triggering only now. Is this a pre-existing bug 
> that somehow got triggered via:
> 
>   58122bf1d856 x86/fpu: Default eagerfpu=on on all CPUs
> 
> ?

Well, that's an interesting question. See, the thing is, I triggered
this only *once* by accident and I haven't seen it ever since.

The "reliable" "reproducer" I used to debug this was Andy's suggestion
to stick a schedule() in __fpu__restore_sig().

So the answer to that question is not easy.

BUT(!), regardless, the bug still needs to be fixed because my tracing
here

https://lkml.kernel.org/r/20160215191422.GB32716@pd.tnic

showed that getting preempted after setting

	fpu->fpstate_active = 1;

leads to the WARN. Because - and please doublecheck me on that - when
we're in __switch_to() and the task which already has ->fpstate_active
set and it is the next task to which we're going to switch to, when it
enters switch_fpu_prepare(), it does:

        fpu.preload = static_cpu_has(X86_FEATURE_FPU) &&
                      new_fpu->fpstate_active &&
		      ^^^^^^^^^^^^^^^^^^^^^^^

so that fpu.preload is set now.

A bit later in that same function:

                /* Don't change CR0.TS if we just switch! */
                if (fpu.preload) {
                        new_fpu->counter++;
                        __fpregs_activate(new_fpu);
			^^^^^^^^^^^^^^^^^

->fpregs_active gets set here and when the task returns to
__fpu__restore_sig(), fpu__restore() sets it again, leading to the WARN.

Mind you, this happens on 32-bit only because there we sigreturn with
irqs enabled. Look at the call trace.

> If yes then we need a plausible theory of how that never triggered on
> modern Intel CPUs that had eagerfpu enabled for years.

AFAICT, it triggers - and the window is very small at that - only on
32-bit. If at all.

> Or perhaps was it caused by one of the other changes in tip:x86/fpu:
> 
>   c6ab109f7e0e x86/fpu: Speed up lazy FPU restores slightly
>   a20d7297045f x86/fpu: Fold fpu_copy() into fpu__copy()
>   5ed73f40735c x86/fpu: Fix FNSAVE usage in eagerfpu mode
>   4ecd16ec7059 x86/fpu: Fix math emulation in eager fpu mode
> 
> ?

I can certainly try to test all those but I don't have a reliable
reproducer. The only thing I could do is check out each of those commits
and stick a schedule() in __fpu__restore_sig() and see what happens.

But if my analysis above is right, none of those would matter because of
the mechanism of how the warn happens...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.