From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755807AbbCRWVR (ORCPT <rfc822;w@1wt.eu>);
	Wed, 18 Mar 2015 18:21:17 -0400
Received: from mail-la0-f43.google.com ([209.85.215.43]:33178 "EHLO
	mail-la0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750822AbbCRWVO (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 18 Mar 2015 18:21:14 -0400
MIME-Version: 1.0
In-Reply-To: <5509F986.2050506@redhat.com>
References: <5505400B.8050300@message-id.googlemail.com> <s5hr3smfl11.wl-tiwai@suse.de>
 <s5hy4mukxpj.wl-tiwai@suse.de> <s5h4mpi5hbv.wl-tiwai@suse.de>
 <CALCETrVCMEcCOHZ35LneCU6uGH+W5SF0groKbUGp2zTjWpzB0w@mail.gmail.com>
 <5509CBF7.3040602@message-id.googlemail.com> <CALCETrU2R020HVniX2sczxexPO2qhEPbS++9DXzcxeycgxoGQg@mail.gmail.com>
 <CA+55aFwT4BJVR10i2Cm8pMH0UGd-J3EwnEUYKf3BWTM0awebbA@mail.gmail.com>
 <5509F161.3010101@redhat.com> <CALCETrXZvSiT41+AYAPizSsGZ_=O=7wmb+Lwo_ChEZySxUnH-A@mail.gmail.com>
 <5509F986.2050506@redhat.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 18 Mar 2015 15:20:52 -0700
Message-ID: <CALCETrWbg4DQrHiUFw3s1GqyxaS7xd3bAPPMxueYN6Uh5U7MbQ@mail.gmail.com>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Stefan Seyfried <stefan.seyfried@googlemail.com>,
        Takashi Iwai <tiwai@suse.de>, X86 ML <x86@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>, Tejun Heo <tj@kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Mar 18, 2015 at 3:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/18/2015 10:55 PM, Andy Lutomirski wrote:
>> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
>>>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>
>>>>>> crash> disassemble page_fault
>>>>>> Dump of assembler code for function page_fault:
>>>>>>    0xffffffff816834a0 <+0>:     data32 xchg %ax,%ax
>>>>>>    0xffffffff816834a3 <+3>:     data32 xchg %ax,%ax
>>>>>>    0xffffffff816834a6 <+6>:     data32 xchg %ax,%ax
>>>>>>    0xffffffff816834a9 <+9>:     sub    $0x78,%rsp
>>>>>>    0xffffffff816834ad <+13>:    callq  0xffffffff81683620 <error_entry>
>>>>>
>>>>> The callq was the double-faulting instruction, and it is indeed the
>>>>> first function in here that would have accessed the stack.  (The sub
>>>>> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
>>>>> page fault, and the page fault is promoted to a double fault.  The
>>>>> surprising thing is that the page fault itself seems to have been
>>>>> delivered okay, and RSP wasn't on a page boundary.
>>>>
>>>> Not at all surprising, and sure it was on a page boundry..
>>>>
>>>> Look closer.
>>>>
>>>> %rsp is 00007fffa55eafb8.
>>>>
>>>> But that's *after* page_fault has done that
>>>>
>>>>     sub    $0x78,%rsp
>>>>
>>>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
>>>> different page.
>>
>> Ah, I forgot to add 0x78.  You're right, of course.
>>
>>>>
>>>> And that page happened to be mapped.
>>>>
>>>> So what happened is:
>>>>
>>>>  - we somehow entered kernel mode without switching stacks
>>>>
>>>>    (ie presumably syscall)
>>>>
>>>>  - the user stack was still fine
>>>>
>>>>  - we took a page fault, which once again didn't switch stacks,
>>>> because we were already in kernel mode. And this page fault worked,
>>>> because it just pushed the error code onto the user stack which was
>>>> mapped.
>>>>
>>>>  - we now took a second page fault within the page fault handler,
>>>> because now the stack pointer has been decremented and points one user
>>>> page down that is *not* mapped, so now that page fault cannot push the
>>>> error code and return information.
>>>>
>>>> Now, how we took that original page fault is sadly not very clear at
>>>> all.  I agree that it's something about system-call (how could we not
>>>> change stacks otherwise), but why it should have started now, I don't
>>>> know. I don't think "system_call" has changed at all.
>>>>
>>>> Maybe there is something wrong with the new "ret_from_sys_call" logic,
>>>> and that "use sysret to return to user mode" thing. Because this code
>>>> sequence:
>>>>
>>>> +       movq (RSP-RIP)(%rsp),%rsp
>>>> +       USERGS_SYSRET64
>>>>
>>>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>>>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>>>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>>>> takes some fault.
>>>
>>> Yes, so far we happily thought that SYSRET never fails...
>>>
>>> This merits adding some code which would at least BUG_ON
>>> if the faulting address is seen to match SYSRET64.
>>
>> sysret64 can only fail with #GP, and we're totally screwed if that
>> happens, although I agree about the BUG_ON in principle.  Where would
>> we add it that would help in this case, though?  We never even made it
>> to C code.
>>
>> In any event, this was a page fault.  sysret64 doesn't access memory.
>
> Let's see.
>
> Faulting SYSRET will still be in CPL0.
> It would drop CPU into the #GP handler
> but %rsp is already loaded with _user_ %rsp (!).
>
> #GP handler will start pushing stuff onto stack,
> happily thinking that it is a kernel stack.
>
> This can cause a page fault.
>
> Most likely, this page fault won't succeed,
> and we'd get a double fault with %pir somewhere in #GP handler.
>
> Yes, this doesn't entirely matches what we see...
>
> There is an easy way to test the theory that SYSRET is to blame.
>
> Just replace
>
>         movq RCX(%rsp),%rcx
>         cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
>         jne opportunistic_sysret_failed
>
> this "jne" with "jmp", and try to reproduce.
>

This is a classic root exploit, and it's why we check for
non-canonical RIP.  In theory, that's the only way this can happen.
Intel screwed up -- AMD never fails SYSRET.

--Andy