linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: "Michal Suchánek" <msuchanek@suse.de>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Michal Suchanek <msuchanek@suse.com>,
	Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>,
	linuxppc-dev <linuxppc-dev@ozlabs.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Laurent Dufour <ldufour@linux.vnet.ibm.com>
Subject: Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.
Date: Thu, 9 Aug 2018 12:26:46 +0200	[thread overview]
Message-ID: <20180809122646.2f1a827d@naga.suse.cz> (raw)
In-Reply-To: <20180809183333.6097d5ec@roar.ozlabs.ibm.com>

On Thu, 9 Aug 2018 18:33:33 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Thu, 9 Aug 2018 13:39:45 +0530
> Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:  
> > > On Thu, 09 Aug 2018 16:34:07 +1000
> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > >     
> > > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:    
> > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:      
> > > > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:      
> > > > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > > >>>
> > > > >>> Introduce recovery action for recovered memory errors
> > > > >>> (MCEs). There are soft memory errors like SLB Multihit,
> > > > >>> which can be a result of a bad hardware OR software BUG.
> > > > >>> Kernel can easily recover from these soft errors by
> > > > >>> flushing SLB contents. After the recovery kernel can still
> > > > >>> continue to function without any issue. But in some
> > > > >>> scenario's we may keep getting these soft errors until the
> > > > >>> root cause is fixed. To be able to analyze and find the
> > > > >>> root cause, best way is to gather enough data and system
> > > > >>> state at the time of MCE. Hence this patch introduces a
> > > > >>> sysctl knob where user can decide either to continue after
> > > > >>> recovery or panic the kernel to capture the dump.      
> > > > >> 
> > > > >> I'm not convinced we want this.
> > > > >> 
> > > > >> As we've discovered it's often not possible to reconstruct
> > > > >> what happened based on a dump anyway.
> > > > >> 
> > > > >> The key thing you need is the content of the SLB and that's
> > > > >> not included in a dump.
> > > > >> 
> > > > >> So I think we should dump the SLB content when we get the
> > > > >> MCE (which this series does) and any other useful info, and
> > > > >> then if we can recover we should.      
> > > > >
> > > > > The reasoning there is what if we got multi-hit due to some
> > > > > corruption in slb_cache_ptr. ie. some part of kernel is
> > > > > wrongly updating the paca data structure due to wrong
> > > > > pointer. Now that is far fetched, but then possible right?.
> > > > > Hence the idea that, if we don't have much insight into why a
> > > > > slb multi-hit occur from the dmesg which include slb content,
> > > > > slb_cache contents etc, there should be an easy way to force
> > > > > a dump that might assist in further debug.      
> > > > 
> > > > If you're debugging something complex that you can't determine
> > > > from the SLB dump then you should be running a debug kernel
> > > > anyway. And if anything you want to drop into xmon and sit
> > > > there, preserving the most state, rather than taking a dump.    
> > > 
> > > I'm not saying for a dump specifically, just some form of crash.
> > > And we really should have an option to xmon on panic, but that's
> > > another story.    
> > 
> > That's fine during development or in a lab, not something we could
> > enforce in a customer environment, could we?  
> 
> xmon on panic? Not something to enforce but IMO (without thinking
> about it too much but having encountered it several times) it should
> probably be tied xmon on BUG option.

You should get that with this patch and xmon=on or am I missing
something?

Thanks

Michal

  reply	other threads:[~2018-08-09 10:26 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-07 14:15 [PATCH v7 0/9] powerpc/pseries: Machine check handler improvements Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 1/9] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 2/9] powerpc/pseries: Defer the logging of rtas error to irq work queue Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 3/9] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler Mahesh J Salgaonkar
2018-08-13 11:23   ` [v7, " Michael Ellerman
2018-08-07 14:16 ` [PATCH v7 4/9] powerpc/pseries: Define MCE error event section Mahesh J Salgaonkar
2018-08-08 14:42   ` Michael Ellerman
2018-08-10 10:29     ` Mahesh Jagannath Salgaonkar
2018-08-16  4:14       ` Michael Ellerman
2018-08-16 14:44         ` Segher Boessenkool
2018-08-17 11:22         ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-07 16:54   ` Michal Suchánek
2018-08-10 10:30     ` Mahesh Jagannath Salgaonkar
2018-08-08  9:04   ` Nicholas Piggin
2018-08-10 10:30     ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 6/9] powerpc/pseries: Display machine check error details Mahesh J Salgaonkar
2018-08-07 14:17 ` [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-09  1:05   ` Michael Ellerman
2018-08-10 10:32     ` Mahesh Jagannath Salgaonkar
2018-08-10 10:49       ` Mahesh Jagannath Salgaonkar
2018-08-11  4:33   ` Nicholas Piggin
2018-08-13  4:17     ` Mahesh Jagannath Salgaonkar
2018-08-13 14:27       ` Nicholas Piggin
2018-08-14 10:57         ` Mahesh Jagannath Salgaonkar
2018-08-14 12:47           ` Aneesh Kumar K.V
2018-08-07 14:17 ` [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE Mahesh J Salgaonkar
2018-08-08 14:56   ` Michael Ellerman
2018-08-08 15:37     ` Aneesh Kumar K.V
2018-08-08 16:09       ` Michal Suchánek
2018-08-10 11:04         ` Michael Ellerman
2018-08-09  6:34       ` Michael Ellerman
2018-08-09  8:02         ` Nicholas Piggin
2018-08-09  8:09           ` Ananth N Mavinakayanahalli
2018-08-09  8:33             ` Nicholas Piggin
2018-08-09 10:26               ` Michal Suchánek [this message]
2018-08-10  7:31                 ` Nicholas Piggin
2018-08-09  1:43     ` Nicholas Piggin
2018-08-07 14:18 ` [PATCH v7 9/9] powernv/pseries: consolidate code for mce early handling Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180809122646.2f1a827d@naga.suse.cz \
    --to=msuchanek@suse.de \
    --cc=ananth@linux.vnet.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mahesh@linux.vnet.ibm.com \
    --cc=msuchanek@suse.com \
    --cc=npiggin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).