From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [PATCH] x86/watchdog: Use real timestamps for
	watchdog timeout
Date: Fri, 24 May 2013 11:03:25 +0100
Message-ID: <519F3AED.2090209@citrix.com>
References: <ebb0070be9fd3fb26bec.1369341126@andrewcoop.uk.xensource.com>
	<20130524093712.GA54769@ocelot.phlegethon.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20130524093712.GA54769@ocelot.phlegethon.org>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Tim Deegan <tim@xen.org>
Cc: "Keir (Xen.org)" <keir@xen.org>, Jan Beulich <JBeulich@suse.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 24/05/13 10:37, Tim Deegan wrote:
> At 21:32 +0100 on 23 May (1369344726), Andrew Cooper wrote:
>> Do not assume that we will only receive interrupts at a rate of nmi_hz.  On a
>> test system being debugged, I observed a PCI SERR being continuously asserted
>> without the SERR bit being set.  The result was Xen "exceeding" a 300 second
>> timeout within 1 second.
> Sounds like the CPU is indeed stuck, and the watchdog has just optimized
> away the 5 minutes of back-to-back NMIs. :)
>
> Handling this case it nice, but I wonder whether this patch ought to
> detect and report ludicrous NMI rates rather than silently ignoring
> them.  I guess that's hard to do in an NMI handler, other than by
> adjusting the printk when we crash.
>
> Tim.

Actually I suspect the system was livelocked with PCI SERRs being issued
from a PCIe switch.  I only have second granularity on the serial
console, but can confirm that cpu0 was perfectly alive and well within
the same second as the watchdog supposedly expiring.

I was considering trying to work around a ludicrous rate of interrupts,
but decided to go for the easier patch first

~Andrew