From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933395AbbCRWR5 (ORCPT ); Wed, 18 Mar 2015 18:17:57 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47330 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932834AbbCRWRy (ORCPT ); Wed, 18 Mar 2015 18:17:54 -0400 Message-ID: <5509F986.2050506@redhat.com> Date: Wed, 18 Mar 2015 23:17:42 +0100 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andy Lutomirski CC: Linus Torvalds , Stefan Seyfried , Takashi Iwai , X86 ML , LKML , Tejun Heo Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related? References: <5505400B.8050300@message-id.googlemail.com> <5509CBF7.3040602@message-id.googlemail.com> <5509F161.3010101@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/18/2015 10:55 PM, Andy Lutomirski wrote: > On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko wrote: >> On 03/18/2015 10:32 PM, Linus Torvalds wrote: >>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski wrote: >>>>> >>>>> crash> disassemble page_fault >>>>> Dump of assembler code for function page_fault: >>>>> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax >>>>> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax >>>>> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax >>>>> 0xffffffff816834a9 <+9>: sub $0x78,%rsp >>>>> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 >>>> >>>> The callq was the double-faulting instruction, and it is indeed the >>>> first function in here that would have accessed the stack. (The sub >>>> *changes* rsp but isn't a memory access.) So, since RSP is bogus, we >>>> page fault, and the page fault is promoted to a double fault. The >>>> surprising thing is that the page fault itself seems to have been >>>> delivered okay, and RSP wasn't on a page boundary. >>> >>> Not at all surprising, and sure it was on a page boundry.. >>> >>> Look closer. >>> >>> %rsp is 00007fffa55eafb8. >>> >>> But that's *after* page_fault has done that >>> >>> sub $0x78,%rsp >>> >>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a >>> different page. > > Ah, I forgot to add 0x78. You're right, of course. > >>> >>> And that page happened to be mapped. >>> >>> So what happened is: >>> >>> - we somehow entered kernel mode without switching stacks >>> >>> (ie presumably syscall) >>> >>> - the user stack was still fine >>> >>> - we took a page fault, which once again didn't switch stacks, >>> because we were already in kernel mode. And this page fault worked, >>> because it just pushed the error code onto the user stack which was >>> mapped. >>> >>> - we now took a second page fault within the page fault handler, >>> because now the stack pointer has been decremented and points one user >>> page down that is *not* mapped, so now that page fault cannot push the >>> error code and return information. >>> >>> Now, how we took that original page fault is sadly not very clear at >>> all. I agree that it's something about system-call (how could we not >>> change stacks otherwise), but why it should have started now, I don't >>> know. I don't think "system_call" has changed at all. >>> >>> Maybe there is something wrong with the new "ret_from_sys_call" logic, >>> and that "use sysret to return to user mode" thing. Because this code >>> sequence: >>> >>> + movq (RSP-RIP)(%rsp),%rsp >>> + USERGS_SYSRET64 >>> >>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the >>> kernel with a user stack poiinter, maybe we're *exiting* the kernel, >>> and have just reloaded the user stack pointer when "USERGS_SYSRET64" >>> takes some fault. >> >> Yes, so far we happily thought that SYSRET never fails... >> >> This merits adding some code which would at least BUG_ON >> if the faulting address is seen to match SYSRET64. > > sysret64 can only fail with #GP, and we're totally screwed if that > happens, although I agree about the BUG_ON in principle. Where would > we add it that would help in this case, though? We never even made it > to C code. > > In any event, this was a page fault. sysret64 doesn't access memory. Let's see. Faulting SYSRET will still be in CPL0. It would drop CPU into the #GP handler but %rsp is already loaded with _user_ %rsp (!). #GP handler will start pushing stuff onto stack, happily thinking that it is a kernel stack. This can cause a page fault. Most likely, this page fault won't succeed, and we'd get a double fault with %pir somewhere in #GP handler. Yes, this doesn't entirely matches what we see... There is an easy way to test the theory that SYSRET is to blame. Just replace movq RCX(%rsp),%rcx cmpq %rcx,RIP(%rsp) /* RCX == RIP */ jne opportunistic_sysret_failed this "jne" with "jmp", and try to reproduce.