From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Thimo E." <abc@digithi.de>
Subject: Re: cpuidle and un-eoid interrupts at the local apic
Date: Wed, 04 Sep 2013 21:56:40 +0200
Message-ID: <52279078.3030701@digithi.de>
References: <51A908CA.7050604@citrix.com><51F8CB15.1070608@digithi.de><51F8DD40.2090207@citrix.com><51FC37A9.9090809@digithi.de><51FC418D.8020708@citrix.com><51FFBA8502000078000E9462@nat28.tlf.novell.com><51FFBC08.6070804@citrix.com><52055EC9.8030207@digithi.de><520561E1.8020809@citrix.com><520562C8.8080703@citrix.com><5207CE0C.1000502@digithi.de><A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com><5208CC8A.7070703@digithi.de><5208CF6B.7030505@citrix.com><5212365E.7010803@digithi.de><52130202.5020909@digithi.de><521347A702000078000ED015@nat28.tlf.novell.com><A9667DDFB95DB7438FA9D7D576C3D87E0A8E70C4@SHSMSX104.ccr.corp.intel.com><52170DC4.30507@digithi.de>
	<A9667DDFB95DB7438FA9D7D576C3D87E0A8EF776@SHSMSX104.ccr.corp.intel.com>
	<52277CDA.8010401@digithi.de> <5227821A.9090201@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <5227821A.9090201@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Keir Fraser <keir@xen.org>, Jan Beulich <JBeulich@suse.com>, "Dong,
	Eddie" <eddie.dong@intel.com>, Xen-develList <xen-devel@lists.xen.org>, "Nakajima,
	Jun" <jun.nakajima@intel.com>, "Zhang,
	Yang Z" <yang.z.zhang@intel.com>, "Zhang,
	Xiantao" <xiantao.zhang@intel.com>
List-Id: xen-devel@lists.xenproject.org

Hello Andrew,

thanks for your response. At least I've seen the trigger of the new 
crash (2e) already before, so they seem so belong together.

I can't image that I am the only one on the world who is using a haswell 
board. And as I haven't seen any other Xen bug/crash reports
like mine (and one time you) nor bug reports from users with other 
operating systems, I ask myself if only my hardware is buggy
or if other operating systems handle those "spurious" interrupts in 
another way ?!?!

What does " ioapic_ack=old" change ?

Best regards
   Thimo

Am 04.09.2013 20:55, schrieb Andrew Cooper:
> On 04/09/13 19:32, Thimo E. wrote:
>> Hello again,
>>
>> the last two weeks no crash with pinning dom0_vcpus_pin and
>> restricting dom0 to 1 cpu. But yesterday it crashed again. So changed
>> the command line again to:
>>
>> iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0
>> console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M
>> watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M
>> cpuid_mask_xsave_eax=0
>>
>> And today server crashed again and produced a lot of debugging
>> messages, see attached. The "..." in the logfiles mean that the
>> message above the points was repeated very often.
>>
>> My summary so far:
>> - With only 1 cpu atteched to dom0 the server was stable for 2 weeks,
>> the crash there did not really show any irq problems, see
>> crash20130903.txt
>>     You can find Andrews ideas to this in
>> http://forums.citrix.com/thread.jspa?messageID=1760771#1760771
>> - With more than 1 cpu and irqbalance the server produced the crashes
>> I've already posted before
>> - Without irqbalance crash with some other fancy output, see
>> crash20130904.txt
>>
>> Next step is to change the network card.
>>
>> Zhang, any update from your side ? Or do the others have any idea ?
>> Could "ioapic_ack=old" help somewhere ?
>>
>> Best regards
>>    Thimo
>>
> Ok - the second attachment (crash20130903.txt) is the one I have triaged
> before, and the crash is impossible given the expected code flow through
> the function.
>
> %r14 is calculated as a the per-cpu cpu_info, which cannot possibly be
> -1 at the point of the fault.  The only explanation is that the
> pagefault is a result of a spurious jump to this location.
>
>  From a quick glance at the other crash, vector 2e was the problematic
> one (iirc).  The "Bad vmexit (reason 3)" at the top would suggest that
> something on the system has sent an INIT to pcpu 2, which seems antisocial.
>
> As we have identified that the hardware is delivering invalid
> interrupts, I wouldn't necessarily read any more into this new crash;
> something is very broken in the hardware.
>
> I would be interested for any update from Intel regarding the ISR violation.
>
> ~Andrew