From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43396) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YVNTv-0006Q2-Rg for qemu-devel@nongnu.org; Tue, 10 Mar 2015 12:58:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YVNTo-0008Go-DK for qemu-devel@nongnu.org; Tue, 10 Mar 2015 12:58:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43444) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YVNTo-0008GF-6U for qemu-devel@nongnu.org; Tue, 10 Mar 2015 12:58:04 -0400 Date: Tue, 10 Mar 2015 16:57:56 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20150310165755.GL2338@work-vm> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] E5-2620v2 - emulation stop error List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrey Korolyov Cc: Bandan Das , "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" * Andrey Korolyov (andrey@xdel.ru) wrote: > On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov wrote: > > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das wrote: > >> Andrey Korolyov writes: > >> > >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov wrote: > >>>> Hello, > >>>> > >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future > >>>> replacement of the E5-2620v1, but I experienced relatively many events > >>>> with emulation errors, all traces looks simular to the one below. I am > >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but > >>>> can switch to some other versions if necessary. Most of crashes > >>>> happened during reboot cycle or at the end of ACPI-based shutdown > >>>> action, if this can help. I have zero clues of what can introduce such > >>>> a mess inside same processor family using identical software, as > >>>> 2620v1 has no simular problem ever. Please let me know if there can be > >>>> some side measures for making entire story more clear. > >>>> > >>>> Thanks! > >>>> > >>>> KVM internal error. Suberror: 2 > >>>> extra data[0]: 800000d1 > >>>> extra data[1]: 80000b0d > >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000 > >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4 > >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >>>> ES =0000 00000000 0000ffff 00009300 > >>>> CS =f000 000f0000 0000ffff 00009b00 > >>>> SS =0000 00000000 0000ffff 00009300 > >>>> DS =0000 00000000 0000ffff 00009300 > >>>> FS =0000 00000000 0000ffff 00009300 > >>>> GS =0000 00000000 0000ffff 00009300 > >>>> LDT=0000 00000000 0000ffff 00008200 > >>>> TR =0000 00000000 0000ffff 00008b00 > >>>> GDT= 000f6e98 00000037 > >>>> IDT= 00000000 000003ff > >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 > >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 > >>>> DR3=0000000000000000 > >>>> DR6=00000000ffff0ff0 DR7=0000000000000400 > >>>> EFER=0000000000000000 > >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb > >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66 > >>>> b8 00 e0 00 00 8e > >>> > >>> > >>> It turns out that those errors are introduced by APICv, which gets > >>> enabled due to different feature set. If anyone is interested in > >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of > >>> migrations/power state changes for an issue to appear, guest OS can be > >>> Linux or Win. > >> > >> Are you able to reproduce this on a more recent upstream kernel as well ? > >> > >> Bandan > > > > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and > > follow up with any reproduceable results. > > Heh.. issue is not triggered on 2603v2 at all, at least I am not able > to hit this. The only difference with 2620v2 except lower frequency is > an Intel Dynamic Acceleration feature. I`d appreciate any testing with > higher CPU models with same or richer feature set. The testing itself > can be done on both generic 3.10 or RH7 kernels, as both of them are > experiencing this issue. I conducted all tests with disabled cstates > so I advise to do the same for a first reproduction step. > > Thanks! > > model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz > stepping : 4 > microcode : 0x416 > cpu MHz : 2100.039 > cache size : 15360 KB > siblings : 12 > apicid : 43 > initial apicid : 43 > fpu : yes > fpu_exception : yes > cpuid level : 13 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts > rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq > dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca > sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c > rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi > flexpriority ept vpid fsgsbase smep erms I'm seeing something similar; it's very intermittent and generally happening right at boot of the guest; I'm running this on qemu head+my postcopy world (but it's happening right at boot before postcopy gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case but hey maybe I'm seeing a different bug. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK