From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755807AbbCRWVR (ORCPT ); Wed, 18 Mar 2015 18:21:17 -0400 Received: from mail-la0-f43.google.com ([209.85.215.43]:33178 "EHLO mail-la0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750822AbbCRWVO (ORCPT ); Wed, 18 Mar 2015 18:21:14 -0400 MIME-Version: 1.0 In-Reply-To: <5509F986.2050506@redhat.com> References: <5505400B.8050300@message-id.googlemail.com> <5509CBF7.3040602@message-id.googlemail.com> <5509F161.3010101@redhat.com> <5509F986.2050506@redhat.com> From: Andy Lutomirski Date: Wed, 18 Mar 2015 15:20:52 -0700 Message-ID: Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related? To: Denys Vlasenko Cc: Linus Torvalds , Stefan Seyfried , Takashi Iwai , X86 ML , LKML , Tejun Heo Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 18, 2015 at 3:17 PM, Denys Vlasenko wrote: > On 03/18/2015 10:55 PM, Andy Lutomirski wrote: >> On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko wrote: >>> On 03/18/2015 10:32 PM, Linus Torvalds wrote: >>>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski wrote: >>>>>> >>>>>> crash> disassemble page_fault >>>>>> Dump of assembler code for function page_fault: >>>>>> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax >>>>>> 0xffffffff816834a9 <+9>: sub $0x78,%rsp >>>>>> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 >>>>> >>>>> The callq was the double-faulting instruction, and it is indeed the >>>>> first function in here that would have accessed the stack. (The sub >>>>> *changes* rsp but isn't a memory access.) So, since RSP is bogus, we >>>>> page fault, and the page fault is promoted to a double fault. The >>>>> surprising thing is that the page fault itself seems to have been >>>>> delivered okay, and RSP wasn't on a page boundary. >>>> >>>> Not at all surprising, and sure it was on a page boundry.. >>>> >>>> Look closer. >>>> >>>> %rsp is 00007fffa55eafb8. >>>> >>>> But that's *after* page_fault has done that >>>> >>>> sub $0x78,%rsp >>>> >>>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a >>>> different page. >> >> Ah, I forgot to add 0x78. You're right, of course. >> >>>> >>>> And that page happened to be mapped. >>>> >>>> So what happened is: >>>> >>>> - we somehow entered kernel mode without switching stacks >>>> >>>> (ie presumably syscall) >>>> >>>> - the user stack was still fine >>>> >>>> - we took a page fault, which once again didn't switch stacks, >>>> because we were already in kernel mode. And this page fault worked, >>>> because it just pushed the error code onto the user stack which was >>>> mapped. >>>> >>>> - we now took a second page fault within the page fault handler, >>>> because now the stack pointer has been decremented and points one user >>>> page down that is *not* mapped, so now that page fault cannot push the >>>> error code and return information. >>>> >>>> Now, how we took that original page fault is sadly not very clear at >>>> all. I agree that it's something about system-call (how could we not >>>> change stacks otherwise), but why it should have started now, I don't >>>> know. I don't think "system_call" has changed at all. >>>> >>>> Maybe there is something wrong with the new "ret_from_sys_call" logic, >>>> and that "use sysret to return to user mode" thing. Because this code >>>> sequence: >>>> >>>> + movq (RSP-RIP)(%rsp),%rsp >>>> + USERGS_SYSRET64 >>>> >>>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the >>>> kernel with a user stack poiinter, maybe we're *exiting* the kernel, >>>> and have just reloaded the user stack pointer when "USERGS_SYSRET64" >>>> takes some fault. >>> >>> Yes, so far we happily thought that SYSRET never fails... >>> >>> This merits adding some code which would at least BUG_ON >>> if the faulting address is seen to match SYSRET64. >> >> sysret64 can only fail with #GP, and we're totally screwed if that >> happens, although I agree about the BUG_ON in principle. Where would >> we add it that would help in this case, though? We never even made it >> to C code. >> >> In any event, this was a page fault. sysret64 doesn't access memory. > > Let's see. > > Faulting SYSRET will still be in CPL0. > It would drop CPU into the #GP handler > but %rsp is already loaded with _user_ %rsp (!). > > #GP handler will start pushing stuff onto stack, > happily thinking that it is a kernel stack. > > This can cause a page fault. > > Most likely, this page fault won't succeed, > and we'd get a double fault with %pir somewhere in #GP handler. > > Yes, this doesn't entirely matches what we see... > > There is an easy way to test the theory that SYSRET is to blame. > > Just replace > > movq RCX(%rsp),%rcx > cmpq %rcx,RIP(%rsp) /* RCX == RIP */ > jne opportunistic_sysret_failed > > this "jne" with "jmp", and try to reproduce. > This is a classic root exploit, and it's why we check for non-canonical RIP. In theory, that's the only way this can happen. Intel screwed up -- AMD never fails SYSRET. --Andy