From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43740) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ycere-0007l9-HR for qemu-devel@nongnu.org; Mon, 30 Mar 2015 14:56:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ycerb-0000Tn-Rr for qemu-devel@nongnu.org; Mon, 30 Mar 2015 14:56:46 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46780) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ycerb-0000TY-KB for qemu-devel@nongnu.org; Mon, 30 Mar 2015 14:56:43 -0400 Date: Mon, 30 Mar 2015 20:56:35 +0200 From: Radim =?utf-8?B?S3LEjW3DocWZ?= Message-ID: <20150330185634.GE13271@potion.brq.redhat.com> References: <20150326155807.GA13271@potion.brq.redhat.com> <20150326163657.GA16305@morn.localdomain> <20150326170654.GB16305@morn.localdomain> <20150326174056.GC13271@potion.brq.redhat.com> <20150326204053.GC27093@potion.brq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] E5-2620v2 - emulation stop error List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrey Korolyov Cc: "kvm@vger.kernel.org" , "qemu-devel@nongnu.org" , "Dr. David Alan Gilbert" , Bandan Das , Kevin O'Connor , Gerd Hoffmann , Paolo Bonzini 2015-03-27 13:16+0300, Andrey Korolyov: > On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das wrote: > > Radim Kr=C4=8Dm=C3=A1=C5=99 writes: > >> I second Bandan -- checking that it reproduces on other machine woul= d be > >> great for sanity :) (Although a bug in our APICv is far more likely= .) > > > > If it's APICv related, a run without apicv enabled could give more hi= nts. > > > > Your "devices not getting reset" hypothesis makes the most sense to m= e, > > maybe the timer vector in the error message is just one part of > > the whole story. Another misbehaving interrupt from the dark comes in= at the > > same time and leads to a double fault. >=20 > Default trace (APICv enabled, first reboot introduced the issue): > http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz The relevant part is here, prefixed with "qemu-system-x86-4180 [002] 697.111550:" kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0 kvm_cr: cr_write 0 =3D 0x10 kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe roo= t 0 sync kvm_entry: vcpu 0 kvm_emulate_insn: f0000:d275: ea 7a d2 00 f0 kvm_emulate_insn: f0000:d27a: 2e 0f 01 1e f0 6c kvm_emulate_insn: f0000:d280: 31 c0 kvm_emulate_insn: f0000:d282: 8e e0 kvm_emulate_insn: f0000:d284: 8e e8 kvm_emulate_insn: f0000:d286: 8e c0 kvm_emulate_insn: f0000:d288: 8e d8 kvm_emulate_insn: f0000:d28a: 8e d0 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0 kvm_page_fault: address f8dd0 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0 kvm_page_fault: address f76d6 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0 kvm_inj_virq: irq 8 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0 kvm_page_fault: address ffea5 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0 kvm_page_fault: address fe990 error_code 184 kvm_entry: vcpu 0 kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 80000= 0f6 kvm_entry: vcpu 0 kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 80000b0d kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17) > Trace without APICv (three reboots, just to make sure to hit the > problematic condition of supposed DF, as it still have not one hundred > percent reproducibility): > http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz The trace here contains a well matching excerpt, just instead of the EXCEPTION_NMI, it does 169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 1= 81 0 169.905102: kvm_page_fault: address feffd066 error_code 181 and works. Page fault says we tried to read 0xfeffd066 -- probably IOPB of TSS. (I guess it is pre-fetch for following IO instruction.) Nothing strikes me when looking at it, but some APICv boots don't fail, so it would be interesting to compare them ... hosts's 0xf6 interrupt (IRQ_WORK_VECTOR) is a possible source of races. (We could look more closely. It is fired too often for my liking as well.)