From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: [PATCH] x86/watchdog: Use real timestamps for watchdog timeout Date: Fri, 24 May 2013 11:03:25 +0100 Message-ID: <519F3AED.2090209@citrix.com> References: <20130524093712.GA54769@ocelot.phlegethon.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130524093712.GA54769@ocelot.phlegethon.org> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Tim Deegan Cc: "Keir (Xen.org)" , Jan Beulich , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org On 24/05/13 10:37, Tim Deegan wrote: > At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote: >> Do not assume that we will only receive interrupts at a rate of nmi_hz. On a >> test system being debugged, I observed a PCI SERR being continuously asserted >> without the SERR bit being set. The result was Xen "exceeding" a 300 second >> timeout within 1 second. > Sounds like the CPU is indeed stuck, and the watchdog has just optimized > away the 5 minutes of back-to-back NMIs. :) > > Handling this case it nice, but I wonder whether this patch ought to > detect and report ludicrous NMI rates rather than silently ignoring > them. I guess that's hard to do in an NMI handler, other than by > adjusting the printk when we crash. > > Tim. Actually I suspect the system was livelocked with PCI SERRs being issued from a PCIe switch. I only have second granularity on the serial console, but can confirm that cpu0 was perfectly alive and well within the same second as the watchdog supposedly expiring. I was considering trying to work around a ludicrous rate of interrupts, but decided to go for the easier patch first ~Andrew