From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754665Ab0IMSYI (ORCPT ); Mon, 13 Sep 2010 14:24:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:4550 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752515Ab0IMSYG (ORCPT ); Mon, 13 Sep 2010 14:24:06 -0400 Date: Mon, 13 Sep 2010 14:23:54 -0400 From: Don Zickus To: Andi Kleen Cc: Huang Ying , Ingo Molnar , "H. Peter Anvin" , "linux-kernel@vger.kernel.org" Subject: Re: [RFC 5/6] x86, NMI, Add support to notify hardware error with unknown NMI Message-ID: <20100913182354.GE26290@redhat.com> References: <20100910160211.GH4879@redhat.com> <20100910181929.4f35ab7c@basil.nowhere.org> <20100910184039.GK4879@redhat.com> <1284344389.3269.82.camel@yhuang-dev.sh.intel.com> <20100913141140.GB27371@redhat.com> <20100913172438.37443bf7@basil.nowhere.org> <20100913154750.GA26290@redhat.com> <20100913185721.59ad9b4d@basil.nowhere.org> <20100913175346.GC26290@redhat.com> <20100913200707.3b31429e@basil.nowhere.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100913200707.3b31429e@basil.nowhere.org> User-Agent: Mutt/1.5.20 (2009-08-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 13, 2010 at 08:07:07PM +0200, Andi Kleen wrote: > > > > > Honestly, I don't think you need much screen real estate. It would be > > nice when an unknown NMI comes in, if the kernel just pokes around > > the hardware registers and display a summary of what it found. For > > example, > > > > The following devices had error bits set in the status registers: > > PCI device x:y.z - STATUS_BIT1 | STATUS_BIT2 > > HW device xyz - STATUS_BIT3 > > ... > > You mean data from the generic PCI config space? Yes. I normally just look at the Status register. With PCI-e I'll look at the other status registers in the capabilities field too. > > I don't think i would feel comfortable with arbitrary driver callbacks > (the risk of the driver breaking the panic would be high) Neither would I. > > But if it's generic if not on the screen it should > be at least in the error serialization data and logged after boot. I guess I don't know what that is, 'error serialization data'. Is there somewhere I can read more about it? > > At least on PCI-E it may be enough to simply dump all recent AER > data. This assumes AER is supported on the bridge? Which for newer chips is probably true, but I wasn't sure about older ones. How would I dump AER data from within the kernel? > > > > > But I guess if we accept the fact that an unknown NMI will panic the > > box, then we can probably be a little more liberal in breaking > > spinlocks and poking around the hardware to display some userful info. > > You have to be a bit careful with that, you may caused nested errors > (e.g. machine checks or more NMIs). I suppose this could be checked for > though. Of course. Cheers, Don