[PATCH] powerpc/powernv: Panic on unhandled Machine Check

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] powerpc/powernv: Panic on unhandled Machine Check
@ 2015-09-23  6:41 Daniel Axtens
  2015-09-24  0:12 ` Andrew Donnellan
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Daniel Axtens @ 2015-09-23  6:41 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: mpe, benh, mikey, imunsie, andrew.donnellan, Matthew R. Ochs,
	Manoj Kumar, Daniel Axtens

All unrecovered machine check errors on PowerNV should cause an
immediate panic. There are 2 reasons that this is the right policy:
it's not safe to continue, and we're already trying to reboot.

Firstly, if we go through the recovery process and do not successfully
recover, we can't be sure about the state of the machine, and it is
not safe to recover and proceed.

Linux knows about the following sources of Machine Check Errors:
- Uncorrectable Errors (UE)
- Effective - Real Address Translation (ERAT)
- Segment Lookaside Buffer (SLB)
- Translation Lookaside Buffer (TLB)
- Unknown/Unrecognised

In the SLB, TLB and ERAT cases, we can further categorise these as
parity errors, multihit errors or unknown/unrecognised.

We can handle SLB errors by flushing and reloading the SLB. We can
handle TLB and ERAT multihit errors by flushing the TLB. (It appears
we may not handle TLB and ERAT parity errors: I will investigate
further and send a followup patch if appropriate.)

This leaves us with uncorrectable errors. Uncorrectable errors are
usually the result of ECC memory detecting an error that it cannot
correct, but they also crop up in the context of PCI cards failing
during DMA writes, and during CAPI error events.

There are several types of UE, and there are 3 places a UE can occur:
Skiboot, the kernel, and userspace. For Skiboot errors, we have the
facility to make some recoverable. For userspace, we can simply kill
(SIGBUS) the affected process. We have no meaningful way to deal with
UEs in kernel space or in unrecoverable sections of Skiboot.

Currently, these unrecovered UEs fall through to
machine_check_expection() in traps.c, which calls die(), which OOPSes
and sends SIGBUS to the process. This sometimes allows us to stumble
onwards. For example we've seen UEs kill the kernel eehd and
khugepaged. However, the process killed could have held a lock, or it
could have been a more important process, etc: we can no longer make
any assertions about the state of the machine. Similarly if we see a
UE in skiboot (and again we've seen this happen), we're not in a
position where we can make any assertions about the state of the
machine.

Likewise, for unknown or unrecognised errors, we're not able to say
anything about the state of the machine.

Therefore, if we have an unrecovered MCE, the most appropriate thing
to do is to panic.

The second reason is that since e784b6499d9c ("powerpc/powernv: Invoke
opal_cec_reboot2() on unrecoverable machine check errors."), we
attempt a special OPAL reboot on an unhandled MCE. This is so the
hardware can record error data for later debugging.

The comments in that commit assert that we are heading down the panic
path anyway. At the moment this is not always true. With UEs in kernel
space, for instance, they are marked as recoverable by the hardware,
so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
panic() but would simply die() and OOPS. It doesn't make sense to be
staggering on if we've just tried to reboot: we should panic().

Explicitly panic() on unrecovered MCEs on PowerNV.
Update the comments appropriately.

This fixes some hangs following EEH events on cxlflash setups.

Signed-off-by: Daniel Axtens <dja@axtens.net>
---
 arch/powerpc/platforms/powernv/opal.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 230f3a7cdea4..4296d55e88f3 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -487,9 +487,12 @@ int opal_machine_check(struct pt_regs *regs)
 	 *    PRD component would have already got notified about this
 	 *    error through other channels.
 	 *
-	 * In any case, let us just fall through. We anyway heading
-	 * down to panic path.
+	 * If hardware marked this as an unrecoverable MCE, we are
+	 * going to panic anyway. Even if it didn't, it's not safe to
+	 * continue at this point, so we should explicitly panic.
 	 */
+
+	panic("PowerNV Unrecovered Machine Check");
 	return 0;
 }

-- 
2.5.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] powerpc/powernv: Panic on unhandled Machine Check
  2015-09-23  6:41 [PATCH] powerpc/powernv: Panic on unhandled Machine Check Daniel Axtens
@ 2015-09-24  0:12 ` Andrew Donnellan
  2015-09-24  3:17 ` Ian Munsie
  2015-10-13  3:47 ` Michael Ellerman
  2 siblings, 0 replies; 4+ messages in thread
From: Andrew Donnellan @ 2015-09-24  0:12 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: mikey, Matthew R. Ochs, imunsie, Manoj Kumar

On 23/09/15 16:41, Daniel Axtens wrote:
> All unrecovered machine check errors on PowerNV should cause an
> immediate panic. There are 2 reasons that this is the right policy:
> it's not safe to continue, and we're already trying to reboot.
>
> Firstly, if we go through the recovery process and do not successfully
> recover, we can't be sure about the state of the machine, and it is
> not safe to recover and proceed.
>
> Linux knows about the following sources of Machine Check Errors:
> - Uncorrectable Errors (UE)
> - Effective - Real Address Translation (ERAT)
> - Segment Lookaside Buffer (SLB)
> - Translation Lookaside Buffer (TLB)
> - Unknown/Unrecognised
>
> In the SLB, TLB and ERAT cases, we can further categorise these as
> parity errors, multihit errors or unknown/unrecognised.
>
> We can handle SLB errors by flushing and reloading the SLB. We can
> handle TLB and ERAT multihit errors by flushing the TLB. (It appears
> we may not handle TLB and ERAT parity errors: I will investigate
> further and send a followup patch if appropriate.)
>
> This leaves us with uncorrectable errors. Uncorrectable errors are
> usually the result of ECC memory detecting an error that it cannot
> correct, but they also crop up in the context of PCI cards failing
> during DMA writes, and during CAPI error events.
>
> There are several types of UE, and there are 3 places a UE can occur:
> Skiboot, the kernel, and userspace. For Skiboot errors, we have the
> facility to make some recoverable. For userspace, we can simply kill
> (SIGBUS) the affected process. We have no meaningful way to deal with
> UEs in kernel space or in unrecoverable sections of Skiboot.
>
> Currently, these unrecovered UEs fall through to
> machine_check_expection() in traps.c, which calls die(), which OOPSes
> and sends SIGBUS to the process. This sometimes allows us to stumble
> onwards. For example we've seen UEs kill the kernel eehd and
> khugepaged. However, the process killed could have held a lock, or it
> could have been a more important process, etc: we can no longer make
> any assertions about the state of the machine. Similarly if we see a
> UE in skiboot (and again we've seen this happen), we're not in a
> position where we can make any assertions about the state of the
> machine.
>
> Likewise, for unknown or unrecognised errors, we're not able to say
> anything about the state of the machine.
>
> Therefore, if we have an unrecovered MCE, the most appropriate thing
> to do is to panic.
>
> The second reason is that since e784b6499d9c ("powerpc/powernv: Invoke
> opal_cec_reboot2() on unrecoverable machine check errors."), we
> attempt a special OPAL reboot on an unhandled MCE. This is so the
> hardware can record error data for later debugging.
>
> The comments in that commit assert that we are heading down the panic
> path anyway. At the moment this is not always true. With UEs in kernel
> space, for instance, they are marked as recoverable by the hardware,
> so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
> panic() but would simply die() and OOPS. It doesn't make sense to be
> staggering on if we've just tried to reboot: we should panic().
>
> Explicitly panic() on unrecovered MCEs on PowerNV.
> Update the comments appropriately.
>
> This fixes some hangs following EEH events on cxlflash setups.
>
> Signed-off-by: Daniel Axtens <dja@axtens.net>

Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

-- 
Andrew Donnellan              Software Engineer, OzLabs
andrew.donnellan@au1.ibm.com  Australia Development Lab, Canberra
+61 2 6201 8874 (work)        IBM Australia Limited

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] powerpc/powernv: Panic on unhandled Machine Check
  2015-09-23  6:41 [PATCH] powerpc/powernv: Panic on unhandled Machine Check Daniel Axtens
  2015-09-24  0:12 ` Andrew Donnellan
@ 2015-09-24  3:17 ` Ian Munsie
  2015-10-13  3:47 ` Michael Ellerman
  2 siblings, 0 replies; 4+ messages in thread
From: Ian Munsie @ 2015-09-24  3:17 UTC (permalink / raw)
  To: Daniel Axtens
  Cc: linuxppc-dev, mpe, benh, mikey, andrew.donnellan, Matthew R. Ochs,
	Manoj Kumar

Thanks Daniel.

Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: powerpc/powernv: Panic on unhandled Machine Check
  2015-09-23  6:41 [PATCH] powerpc/powernv: Panic on unhandled Machine Check Daniel Axtens
  2015-09-24  0:12 ` Andrew Donnellan
  2015-09-24  3:17 ` Ian Munsie
@ 2015-10-13  3:47 ` Michael Ellerman
  2 siblings, 0 replies; 4+ messages in thread
From: Michael Ellerman @ 2015-10-13  3:47 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev
  Cc: mikey, Matthew R. Ochs, imunsie, andrew.donnellan, Manoj Kumar,
	Daniel Axtens

On Wed, 2015-23-09 at 06:41:48 UTC, Daniel Axtens wrote:
> All unrecovered machine check errors on PowerNV should cause an
> immediate panic. There are 2 reasons that this is the right policy:
> it's not safe to continue, and we're already trying to reboot.
...
> Explicitly panic() on unrecovered MCEs on PowerNV.
> Update the comments appropriately.
> 
> This fixes some hangs following EEH events on cxlflash setups.
> 
> Signed-off-by: Daniel Axtens <dja@axtens.net>
> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/f2dd80ecca5f06b46134f2bd

cheers

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-10-13  3:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-23  6:41 [PATCH] powerpc/powernv: Panic on unhandled Machine Check Daniel Axtens
2015-09-24  0:12 ` Andrew Donnellan
2015-09-24  3:17 ` Ian Munsie
2015-10-13  3:47 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).