From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Thimo E." Subject: Re: cpuidle and un-eoid interrupts at the local apic Date: Wed, 04 Sep 2013 21:56:40 +0200 Message-ID: <52279078.3030701@digithi.de> References: <51A908CA.7050604@citrix.com><51F8CB15.1070608@digithi.de><51F8DD40.2090207@citrix.com><51FC37A9.9090809@digithi.de><51FC418D.8020708@citrix.com><51FFBA8502000078000E9462@nat28.tlf.novell.com><51FFBC08.6070804@citrix.com><52055EC9.8030207@digithi.de><520561E1.8020809@citrix.com><520562C8.8080703@citrix.com><5207CE0C.1000502@digithi.de><5208CC8A.7070703@digithi.de><5208CF6B.7030505@citrix.com><5212365E.7010803@digithi.de><52130202.5020909@digithi.de><521347A702000078000ED015@nat28.tlf.novell.com><52170DC4.30507@digithi.de> <52277CDA.8010401@digithi.de> <5227821A.9090201@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5227821A.9090201@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andrew Cooper Cc: Keir Fraser , Jan Beulich , "Dong, Eddie" , Xen-develList , "Nakajima, Jun" , "Zhang, Yang Z" , "Zhang, Xiantao" List-Id: xen-devel@lists.xenproject.org Hello Andrew, thanks for your response. At least I've seen the trigger of the new crash (2e) already before, so they seem so belong together. I can't image that I am the only one on the world who is using a haswell board. And as I haven't seen any other Xen bug/crash reports like mine (and one time you) nor bug reports from users with other operating systems, I ask myself if only my hardware is buggy or if other operating systems handle those "spurious" interrupts in another way ?!?! What does " ioapic_ack=old" change ? Best regards Thimo Am 04.09.2013 20:55, schrieb Andrew Cooper: > On 04/09/13 19:32, Thimo E. wrote: >> Hello again, >> >> the last two weeks no crash with pinning dom0_vcpus_pin and >> restricting dom0 to 1 cpu. But yesterday it crashed again. So changed >> the command line again to: >> >> iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0 >> console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M >> watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M >> cpuid_mask_xsave_eax=0 >> >> And today server crashed again and produced a lot of debugging >> messages, see attached. The "..." in the logfiles mean that the >> message above the points was repeated very often. >> >> My summary so far: >> - With only 1 cpu atteched to dom0 the server was stable for 2 weeks, >> the crash there did not really show any irq problems, see >> crash20130903.txt >> You can find Andrews ideas to this in >> http://forums.citrix.com/thread.jspa?messageID=1760771#1760771 >> - With more than 1 cpu and irqbalance the server produced the crashes >> I've already posted before >> - Without irqbalance crash with some other fancy output, see >> crash20130904.txt >> >> Next step is to change the network card. >> >> Zhang, any update from your side ? Or do the others have any idea ? >> Could "ioapic_ack=old" help somewhere ? >> >> Best regards >> Thimo >> > Ok - the second attachment (crash20130903.txt) is the one I have triaged > before, and the crash is impossible given the expected code flow through > the function. > > %r14 is calculated as a the per-cpu cpu_info, which cannot possibly be > -1 at the point of the fault. The only explanation is that the > pagefault is a result of a spurious jump to this location. > > From a quick glance at the other crash, vector 2e was the problematic > one (iirc). The "Bad vmexit (reason 3)" at the top would suggest that > something on the system has sent an INIT to pcpu 2, which seems antisocial. > > As we have identified that the hardware is delivering invalid > interrupts, I wouldn't necessarily read any more into this new crash; > something is very broken in the hardware. > > I would be interested for any update from Intel regarding the ISR violation. > > ~Andrew