From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41777) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YdLWs-0005Su-0c for qemu-devel@nongnu.org; Wed, 01 Apr 2015 12:30:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YdLWq-0005RI-Sw for qemu-devel@nongnu.org; Wed, 01 Apr 2015 12:30:09 -0400 Received: from mail-lb0-x22b.google.com ([2a00:1450:4010:c04::22b]:34011) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YdLWq-0005Qw-HT for qemu-devel@nongnu.org; Wed, 01 Apr 2015 12:30:08 -0400 Received: by lboc7 with SMTP id c7so40364749lbo.1 for ; Wed, 01 Apr 2015 09:30:07 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20150330185634.GE13271@potion.brq.redhat.com> <20150331134512.GG13271@potion.brq.redhat.com> <20150331164539.GD14262@potion.brq.redhat.com> <20150401114923.GH13271@potion.brq.redhat.com> <551BF064.9030506@redhat.com> From: Andrey Korolyov Date: Wed, 1 Apr 2015 19:29:47 +0300 Message-ID: Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] E5-2620v2 - emulation stop error List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: "kvm@vger.kernel.org" , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , "qemu-devel@nongnu.org" , "Dr. David Alan Gilbert" , Bandan Das , Kevin O'Connor , Gerd Hoffmann On Wed, Apr 1, 2015 at 6:37 PM, Andrey Korolyov wrote: > On Wed, Apr 1, 2015 at 4:19 PM, Paolo Bonzini wrote: >> >> >> On 01/04/2015 14:26, Andrey Korolyov wrote: >>> Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI >>> would look different and they had no reasons to be fired at this stage >>> inside guest. I`d suspect a hypervisor hardware misbehavior there but >>> have a very little idea on how APICv behavior (which is completely >>> microcode-dependent and CPU-dependent but decoupled from peripheral >>> hardware) may vary at this point, I am using 1.20140913.1 ucode >>> version from debian if this can matter. Will send trace suggested by >>> Paolo in a next couple of hours. Also it would be awesome to ask >>> hardware folks from Intel who can prove or disprove my abovementioned >>> statement (as I was unable to catch the problem on 2603v2 so far, this >>> hypothesis has some chance to be real). >> >> Yes, the interaction with the NMI watchdog is unexpected and makes a >> processor erratum somewhat more likely. >> >> Paolo > > > http://xdel.ru/downloads/kvm-e5v2-issue/trace-nmi-apicv-fail-at-reboot.dat.gz > > err, no NMI entries nearby failure event, though capture should be correct: > /sys/kernel/debug/tracing/events/kvm*/filter > /sys/kernel/debug/tracing/events/*/kvm*/filter > /sys/kernel/debug/tracing/events/nmi*/filter > /sys/kernel/debug/tracing/events/*/nmi*/filter Moved 2603v2s back and issue is still here. I used wrong pattern for the issue on a previous series of tests on those CPUs in the middle of month, continuously respawning VMs when the real issue is hiding in *first* reboot events starting from the hypervisor reboot (or module load). So either it should be reproducible anywhere or this is not a hardware issue (or it is related to the mainboard instead of CPU itself :) ).