[PATCH] cxl: Remove racy attempt to force EEH invocation in reset

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cxl: Remove racy attempt to force EEH invocation in reset
@ 2015-08-21  7:25 Daniel Axtens
  2015-08-21  8:06 ` Ian Munsie
  2015-08-27 21:58 ` Michael Ellerman
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel Axtens @ 2015-08-21  7:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, benh, mikey, imunsie, Ryan Grimm, Daniel Axtens

cxl_reset currently PERSTs the slot, and then repeatedly tries to
read MMIO space in order to kick off EEH.

There are 2 problems with this: it's unnecessary, and it's racy.

It's unnecessary because the PERST will bring down the PHB link.
That will be picked up by the CAPP, which will send out an HMI.
Skiboot, noticing an HMI from the CAPP, will send an OPAL
notification to the kernel, which will trigger EEH recovery.

It's also racy: the EEH recovery triggered by the CAPP will
eventually cause the MMIO space to have its mapping invalidated
and the pointer NULLed out. This races with our attempt to read
the MMIO space. This is causing OOPSes in testing.

Simply drop all the attempts to force EEH detection, and trust
that Skiboot will send the notification and that we'll act on it.
The Skiboot code to send the EEH notification has been in Skiboot
for as long as CAPP recovery has been supported, so we don't need
to worry about breaking obscure setups with ancient firmware.

Cc: Ryan Grimm <grimm@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 62fa19d4b4fd ("cxl: Add ability to reset the card")
Signed-off-by: Daniel Axtens <dja@axtens.net>

---
Oddly, the commit message in 62fa19d4b4fd ("cxl: Add ability to
reset the card") is correct in its explanation of EEH triggering.

My guess is that the extra code was added because in our development
environment (bml) the notifications don't seem to work reliably.

However, I have tested that the notifications do work on real/normal
hardware.
---
 drivers/misc/cxl/pci.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index be3839e34b7c..84bb9adbccb1 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -887,8 +887,6 @@ int cxl_reset(struct cxl *adapter)
 {
 	struct pci_dev *dev = to_pci_dev(adapter->dev.parent);
 	int rc;
-	int i;
-	u32 val;

 	if (adapter->perst_same_image) {
 		dev_warn(&dev->dev,
@@ -906,20 +904,6 @@ int cxl_reset(struct cxl *adapter)
 		return rc;
 	}

-	/* the PERST done above fences the PHB.  So, reset depends on EEH
-	 * to unbind the driver, tell Sapphire to reinit the PHB, and rebind
-	 * the driver.  Do an mmio read explictly to ensure EEH notices the
-	 * fenced PHB.  Retry for a few seconds before giving up. */
-	i = 0;
-	while (((val = mmio_read32be(adapter->p1_mmio)) != 0xffffffff) &&
-		(i < 5)) {
-		msleep(500);
-		i++;
-	}
-
-	if (val != 0xffffffff)
-		dev_err(&dev->dev, "cxl: PERST failed to trigger EEH\n");
-
 	return rc;
 }

-- 
2.5.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] cxl: Remove racy attempt to force EEH invocation in reset
  2015-08-21  7:25 [PATCH] cxl: Remove racy attempt to force EEH invocation in reset Daniel Axtens
@ 2015-08-21  8:06 ` Ian Munsie
  2015-08-27 21:58 ` Michael Ellerman
  1 sibling, 0 replies; 3+ messages in thread
From: Ian Munsie @ 2015-08-21  8:06 UTC (permalink / raw)
  To: Daniel Axtens; +Cc: linuxppc-dev, mpe, benh, mikey, Ryan Grimm

Acked-by: Ian Munsie <imunsie@au1.ibm.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: cxl: Remove racy attempt to force EEH invocation in reset
  2015-08-21  7:25 [PATCH] cxl: Remove racy attempt to force EEH invocation in reset Daniel Axtens
  2015-08-21  8:06 ` Ian Munsie
@ 2015-08-27 21:58 ` Michael Ellerman
  1 sibling, 0 replies; 3+ messages in thread
From: Michael Ellerman @ 2015-08-27 21:58 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: mikey, imunsie, Ryan Grimm, Daniel Axtens

On Fri, 2015-21-08 at 07:25:15 UTC, Daniel Axtens wrote:
> cxl_reset currently PERSTs the slot, and then repeatedly tries to
> read MMIO space in order to kick off EEH.
> 
> There are 2 problems with this: it's unnecessary, and it's racy.
> 
> It's unnecessary because the PERST will bring down the PHB link.
> That will be picked up by the CAPP, which will send out an HMI.
> Skiboot, noticing an HMI from the CAPP, will send an OPAL
> notification to the kernel, which will trigger EEH recovery.
> 
> It's also racy: the EEH recovery triggered by the CAPP will
> eventually cause the MMIO space to have its mapping invalidated
> and the pointer NULLed out. This races with our attempt to read
> the MMIO space. This is causing OOPSes in testing.
> 
> Simply drop all the attempts to force EEH detection, and trust
> that Skiboot will send the notification and that we'll act on it.
> The Skiboot code to send the EEH notification has been in Skiboot
> for as long as CAPP recovery has been supported, so we don't need
> to worry about breaking obscure setups with ancient firmware.
> 
> Cc: Ryan Grimm <grimm@linux.vnet.ibm.com>
> Cc: stable@vger.kernel.org
> Fixes: 62fa19d4b4fd ("cxl: Add ability to reset the card")
> Signed-off-by: Daniel Axtens <dja@axtens.net>
> Acked-by: Ian Munsie <imunsie@au1.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9d8e27673c45927fee9e7d89

cheers

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-08-27 21:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-21  7:25 [PATCH] cxl: Remove racy attempt to force EEH invocation in reset Daniel Axtens
2015-08-21  8:06 ` Ian Munsie
2015-08-27 21:58 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).