From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753007AbbCWRqx (ORCPT <rfc822;w@1wt.eu>);
	Mon, 23 Mar 2015 13:46:53 -0400
Received: from mx1.redhat.com ([209.132.183.28]:44273 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752214AbbCWRqs (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 23 Mar 2015 13:46:48 -0400
Message-ID: <55105185.5090200@redhat.com>
Date: Mon, 23 Mar 2015 18:46:45 +0100
From: Denys Vlasenko <dvlasenk@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Takashi Iwai <tiwai@suse.de>
CC: Andy Lutomirski <luto@amacapital.net>,
        Denys Vlasenko <vda.linux@googlemail.com>,
        Jiri Kosina <jkosina@suse.cz>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Stefan Seyfried <stefan.seyfried@googlemail.com>,
        X86 ML <x86@kernel.org>, LKML <linux-kernel@vger.kernel.org>,
        Tejun Heo <tj@kernel.org>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?
References: <5505400B.8050300@message-id.googlemail.com>	<5509CBF7.3040602@message-id.googlemail.com>	<CALCETrU2R020HVniX2sczxexPO2qhEPbS++9DXzcxeycgxoGQg@mail.gmail.com>	<CA+55aFwT4BJVR10i2Cm8pMH0UGd-J3EwnEUYKf3BWTM0awebbA@mail.gmail.com>	<5509F161.3010101@redhat.com>	<CALCETrXZvSiT41+AYAPizSsGZ_=O=7wmb+Lwo_ChEZySxUnH-A@mail.gmail.com>	<alpine.LRH.2.00.1503182320490.13021@twin.jikos.cz>	<CALCETrVnJHXhz81QCr7qmm0uwdw2t0EWe_zUw4E7bZB2WXQNTQ@mail.gmail.com>	<s5hiodxjnqs.wl-tiwai@suse.de>	<550AABCB.9040502@redhat.com>	<s5hk2ydteq2.wl-tiwai@suse.de>	<CAK1hOcMTPgmrnrUEqTMXjcEdkP8NiLDt4Tq3Z1yvHFLgKx+cdQ@mail.gmail.com>	<s5hh9tht7z3.wl-tiwai@suse.de>	<s5hoanprq8x.wl-tiwai@suse.de>	<s5hk2ydnhav.wl-tiwai@suse.de>	<CALCETrVQEnL6w2E=JJ9_jXBSAyutFWRpYSrDBO9WOGYq_TM74Q@mail.gmail.com>	<s5hbnjpnfxu.wl-tiwai@suse.de>	<550C6415.9050402@redhat.com>	<s5hr3sgxf0j.wl-tiwai@suse.de>	<s5hvbhsoy36.wl-tiwai@suse.de>	<s5hk2y74zmw.wl-tiwai@suse.de>	<55103A33.1060704@redhat.com> <s5h619reiog.wl-tiwai@!
 suse.de>
In-Reply-To: <s5h619reiog.wl-tiwai@suse.de>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 03/23/2015 06:18 PM, Takashi Iwai wrote:
> At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
>>>> I pulled tip tree on top of 4.0-rc5, built with your patch and now
>>>> succeeded to get a better message:
>>>>
>>>>  kvm: zapping shadow pages for mmio generation wraparound
>>>>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>>>>  Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28  EFLAGS: 00010006
>>>>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>>>>  PANIC: double fault, error_code: 0x0
>>>>  CPU: 1 PID: 10819 Comm: cc1 Tainted: G        W       4.0.0-rc5-debug1+ #2
>>>>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>>>>  task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000
>>>>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>>>>  RSP: 0018:00007ffd22c23f28  EFLAGS: 00010006
>>>>  RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101
>>>>  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0

>> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
>>
>>    0:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>    5:   55                      push   %rbp
>>    6:   48 89 e5                mov    %rsp,%rbp
>>    9:   41 56                   push   %r14
>>    b:   41 55                   push   %r13
>>    d:   49 89 d5                mov    %rdx,%r13
>>   10:   41 54                   push   %r12
>>   12:   49 89 f4                mov    %rsi,%r12
>>   15:   53                      push   %rbx
>>   16:   48 89 fb                mov    %rdi,%rbx
>>   19:   48 83 ec 30             sub    $0x30,%rsp
>>   1d:   8b 87 68 01 00 00       mov    0x168(%rdi),%eax
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>   23:   39 87 9c 01 00 00       cmp    %eax,0x19c(%rdi)
>>   29:   7c 25                   jl     50 <_start+0x50>
>>   2b:   48 8b 87 88 04 00 00    mov    0x488(%rdi),%rax
>>
>> The ^^^^^ instruction is the one which faults. Since you said it
>> consistently happens here, this should be a page fault, not an external
>> hardware interrupt.
>>
>> The code corresponds to the comparison in if():
>>
>> int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
>>                       long *timeo, struct sock *ssk)
>> {
>>         struct netlink_sock *nlk;
>>
>>         nlk = nlk_sk(sk);
>>
>>         if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||

>>> - Another piece is that the bug happens only when a KVM is running.
>>>   The kernel ran without problem over days with similar tasks
>>>   (compiling kernel, etc) when no KVM was used.
>>
>> Conceivably virtualization support in CPUs can have nasty erratas.
>> However, you and other reporter have different CPUs - yours
>> is Ivy Bridge, his CPU is a Penryn.
>>
>> I don't see the path how KVM helps to trigger this.
>>
>>> - And now I get the trace as above, pointing netlink_attachskb().
>>>
>>> I have a difficulty to imagine how all these pieces fit into a single
>>> picture.  Is something already screwed up before that?
>>
>> Well, a tiny bit more info will be seen if you'd change %rdi
>> to, say, %r15 in these two lines in my patch:
>>
>>        /* Save bogus RSP value */
>>        movq    %rsp,%rdi
>> ...
>>        push    %rdi            /* pt_regs->sp */
>>
>> Then original %rdi will be visible in the crash message.
> 
> OK, here we go.
> 
>  kvm: zapping shadow pages for mmio generation wraparound
>  kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>  Exception on user stack 00007fff1d7e5ec0: RSP: 0018:00007fff1d7e5ef8  EFLAGS: 00010002
>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>  PANIC: double fault, error_code: 0x0
>  CPU: 5 PID: 14285 Comm: fixdep Tainted: G        W       4.0.0-rc5-debug1+ #3
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>  task: ffff88020ba1c690 ti: ffff880206ba4000 task.ti: ffff880206ba4000
>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>  RSP: 0018:00007fff1d7e5ef8  EFLAGS: 00010002
>  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000c0000101
>  RDX: 0000000000000000 RSI: 0000000000001ebb RDI: 0000000000000000

Thanks for your testing. So the %rdi was NULL... not very informative.

Notice that your every crash is preceded by

    kvm: zapping shadow pages for mmio generation wraparound
    kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff

This hints that kvm _is_ somehow responsible.
I'm no expert on kvm, I need to take a look around that code...