From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753007AbbCWRqx (ORCPT ); Mon, 23 Mar 2015 13:46:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44273 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752214AbbCWRqs (ORCPT ); Mon, 23 Mar 2015 13:46:48 -0400 Message-ID: <55105185.5090200@redhat.com> Date: Mon, 23 Mar 2015 18:46:45 +0100 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Takashi Iwai CC: Andy Lutomirski , Denys Vlasenko , Jiri Kosina , Linus Torvalds , Stefan Seyfried , X86 ML , LKML , Tejun Heo Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related? References: <5505400B.8050300@message-id.googlemail.com> <5509CBF7.3040602@message-id.googlemail.com> <5509F161.3010101@redhat.com> <550AABCB.9040502@redhat.com> <550C6415.9050402@redhat.com> <55103A33.1060704@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/23/2015 06:18 PM, Takashi Iwai wrote: > At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote: >>>> I pulled tip tree on top of 4.0-rc5, built with your patch and now >>>> succeeded to get a better message: >>>> >>>> kvm: zapping shadow pages for mmio generation wraparound >>>> kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff >>>> Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006 >>>> RIP: 0010:[] [] netlink_attachskb+0x1d/0x1d0 >>>> PANIC: double fault, error_code: 0x0 >>>> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2 >>>> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 >>>> task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000 >>>> RIP: 0010:[] [] netlink_attachskb+0x1d/0x1d0 >>>> RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006 >>>> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101 >>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0 >> FYI: the disassembly of netlink_attachskb (from "Code:" line) is: >> >> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >> 5: 55 push %rbp >> 6: 48 89 e5 mov %rsp,%rbp >> 9: 41 56 push %r14 >> b: 41 55 push %r13 >> d: 49 89 d5 mov %rdx,%r13 >> 10: 41 54 push %r12 >> 12: 49 89 f4 mov %rsi,%r12 >> 15: 53 push %rbx >> 16: 48 89 fb mov %rdi,%rbx >> 19: 48 83 ec 30 sub $0x30,%rsp >> 1d: 8b 87 68 01 00 00 mov 0x168(%rdi),%eax >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> 23: 39 87 9c 01 00 00 cmp %eax,0x19c(%rdi) >> 29: 7c 25 jl 50 <_start+0x50> >> 2b: 48 8b 87 88 04 00 00 mov 0x488(%rdi),%rax >> >> The ^^^^^ instruction is the one which faults. Since you said it >> consistently happens here, this should be a page fault, not an external >> hardware interrupt. >> >> The code corresponds to the comparison in if(): >> >> int netlink_attachskb(struct sock *sk, struct sk_buff *skb, >> long *timeo, struct sock *ssk) >> { >> struct netlink_sock *nlk; >> >> nlk = nlk_sk(sk); >> >> if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || >>> - Another piece is that the bug happens only when a KVM is running. >>> The kernel ran without problem over days with similar tasks >>> (compiling kernel, etc) when no KVM was used. >> >> Conceivably virtualization support in CPUs can have nasty erratas. >> However, you and other reporter have different CPUs - yours >> is Ivy Bridge, his CPU is a Penryn. >> >> I don't see the path how KVM helps to trigger this. >> >>> - And now I get the trace as above, pointing netlink_attachskb(). >>> >>> I have a difficulty to imagine how all these pieces fit into a single >>> picture. Is something already screwed up before that? >> >> Well, a tiny bit more info will be seen if you'd change %rdi >> to, say, %r15 in these two lines in my patch: >> >> /* Save bogus RSP value */ >> movq %rsp,%rdi >> ... >> push %rdi /* pt_regs->sp */ >> >> Then original %rdi will be visible in the crash message. > > OK, here we go. > > kvm: zapping shadow pages for mmio generation wraparound > kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff > Exception on user stack 00007fff1d7e5ec0: RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002 > RIP: 0010:[] [] netlink_attachskb+0x1d/0x1d0 > PANIC: double fault, error_code: 0x0 > CPU: 5 PID: 14285 Comm: fixdep Tainted: G W 4.0.0-rc5-debug1+ #3 > Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 > task: ffff88020ba1c690 ti: ffff880206ba4000 task.ti: ffff880206ba4000 > RIP: 0010:[] [] netlink_attachskb+0x1d/0x1d0 > RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000c0000101 > RDX: 0000000000000000 RSI: 0000000000001ebb RDI: 0000000000000000 Thanks for your testing. So the %rdi was NULL... not very informative. Notice that your every crash is preceded by kvm: zapping shadow pages for mmio generation wraparound kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff This hints that kvm _is_ somehow responsible. I'm no expert on kvm, I need to take a look around that code...