[PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support
@ 2007-04-20 20:41 Linas Vepstas
  2007-04-20 20:47 ` [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure Linas Vepstas
  0 siblings, 1 reply; 13+ messages in thread
From: Linas Vepstas @ 2007-04-20 20:41 UTC (permalink / raw)
  To: matthew; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi



Hi Matthew,

After a long hiatus, I took another stab at pci error recovery 
for the symbios. This is very nearly the same patch as before, 
with only an update to enable MWI, and to support chip workarounds.
I think I've addressed all the other issues that came up. Thus,
again, I'll ask that the patch go in (for 2.6.22 of course).


To recap the only outstanding issue:

>> @@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
>> +     /* Avoid spinloop trying to handle interrupts on frozen device */
>> +     if (pci_channel_offline(np->s.device))
>> +             return IRQ_HANDLED;
>
>Just wondering ... should we really be returning HANDLED?  What if the
>IRQ is shared?  Will the hardware de-assert the level interrupt when it
>puts the device in reset (ie is this a transitory glitch?), or do we
>have to cope with a screaming interrupt?

This routine *always* returns HANDLED anyway, so this patch does
not change semantics. For a symbios device plugged into a shared
irq line, this is a problem with or without my patch.

Yes, irq's will typically scream until handled. Yes, the device
reset will eventually clear the irq, assuming the system doesn't 
deadlock on a screaming irq. 

--linas

Here's the formal changelog entry:

Various PCI bus errors can be signaled by newer PCI controllers.  
This patch adds the PCI error recovery callbacks to the Symbios 
SCSI device driver.  The patch has been tested, and appears to 
work well.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

--
 drivers/scsi/sym53c8xx_2/sym_glue.c |  136 ++++++++++++++++++++++++++++++++++++
 drivers/scsi/sym53c8xx_2/sym_glue.h |    4 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |    6 +
 3 files changed, 146 insertions(+)

Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-04-20 12:07:38.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-04-20 12:52:01.000000000 -0500
@@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
 	unsigned long flags;
 	struct sym_hcb *np = (struct sym_hcb *)dev_id;
 
+	/* Avoid spinloop trying to handle interrupts on frozen device */
+	if (pci_channel_offline(np->s.device))
+		return IRQ_HANDLED;
+
 	if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("[");
 
 	spin_lock_irqsave(np->s.host->host_lock, flags);
@@ -726,6 +730,20 @@ static int sym_eh_handler(int op, char *
 
 	dev_warn(&cmd->device->sdev_gendev, "%s operation started.\n", opname);
 
+	/* We may be in an error condition because the PCI bus
+	 * went down. In this case, we need to wait until the
+	 * PCI bus is reset, the card is reset, and only then
+	 * proceed with the scsi error recovery.  There's no
+	 * point in hurrying; take a leisurely wait.
+	 */
+#define WAIT_FOR_PCI_RECOVERY	35
+	if (pci_channel_offline(np->s.device)) {
+		int finished_reset = wait_for_completion_timeout(
+			&np->s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ);
+		if (!finished_reset)
+			return SCSI_FAILED;
+	}
+
 	spin_lock_irq(host->host_lock);
 	/* This one is queued in some place -> to wait for completion */
 	FOR_EACH_QUEUED_ELEMENT(&np->busy_ccbq, qp) {
@@ -1510,6 +1528,7 @@ static struct Scsi_Host * __devinit sym_
 	np->maxoffs	= dev->chip.offset_max;
 	np->maxburst	= dev->chip.burst_max;
 	np->myaddr	= dev->host_id;
+	init_completion(&np->s.io_reset_wait);
 
 	/*
 	 *  Edit its name.
@@ -1948,6 +1967,116 @@ static void __devexit sym2_remove(struct
 	attach_count--;
 }
 
+/**
+ * sym2_io_error_detected() -- called when PCI error is detected
+ * @pdev: pointer to PCI device
+ * @state: current state of the PCI slot
+ */
+static pci_ers_result_t sym2_io_error_detected (struct pci_dev *pdev,
+                                         enum pci_channel_state state)
+{
+	struct sym_hcb *np = pci_get_drvdata(pdev);
+
+	/* If slot is permanently frozen, turn everything off */
+	if (state == pci_channel_io_perm_failure) {
+		sym2_remove(pdev);
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	init_completion(&np->s.io_reset_wait);
+	disable_irq(pdev->irq);
+	pci_disable_device(pdev);
+
+	/* Request a slot reset. */
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * sym2_reset_workarounds -- hardware-specific work-arounds
+ *
+ * This routine is similar to sym_set_workarounds(), except
+ * that, at this point, we already know that the device was 
+ * succesfully intialized at least once before, and so most
+ * of the steps taken there are un-needed here. 
+ */
+static void sym2_reset_workarounds (struct pci_dev *pdev)
+{
+	u_char revision;
+	u_short status_reg;
+	struct sym_chip *chip;
+	
+	pci_read_config_byte(pdev, PCI_CLASS_REVISION, &revision);
+	chip = sym_lookup_chip_table(pdev->device, revision);
+	
+	/* Work around for errant bit in 895A, in a fashion
+	 * similar to what is done in sym_set_workarounds(). 
+	 */
+	pci_read_config_word(pdev, PCI_STATUS, &status_reg);
+	if (!(chip->features & FE_66MHZ) && (status_reg & PCI_STATUS_66MHZ)) {
+	 	status_reg = PCI_STATUS_66MHZ;
+		pci_write_config_word(pdev, PCI_STATUS, status_reg);
+		pci_read_config_word(pdev, PCI_STATUS, &status_reg);
+	}
+}
+
+/**
+ * sym2_io_slot_reset() -- called when the pci bus has been reset.
+ * @pdev: pointer to PCI device
+ *
+ * Restart the card from scratch.
+ */
+static pci_ers_result_t sym2_io_slot_reset (struct pci_dev *pdev)
+{
+	struct sym_hcb *np = pci_get_drvdata(pdev);
+
+	printk(KERN_INFO "%s: recovering from a PCI slot reset\n",
+	          sym_name(np));
+
+	if (pci_enable_device(pdev)) {
+		printk(KERN_ERR "%s: Unable to enable afer PCI reset\n",
+		        sym_name(np));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	pci_set_master(pdev);
+	enable_irq(pdev->irq);
+
+	/* If the chip can do Memory Write Invalidate, enable it */
+	if (np->features & FE_WRIE) {
+		if (pci_set_mwi(pdev))
+				return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	/* Perform work-arounds, analogous to sym_set_workarounds() */
+	sym2_reset_workarounds(pdev);
+
+	/* Perform host reset only on one instance of the card */
+	if (PCI_FUNC (pdev->devfn) == 0) {
+		if (sym_reset_scsi_bus(np, 0)) {
+			printk(KERN_ERR "%s: Unable to reset scsi host\n",
+			        sym_name(np));
+			return PCI_ERS_RESULT_DISCONNECT;
+		}
+		sym_start_up(np, 1);
+	}
+
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * sym2_io_resume() -- resume normal ops after PCI reset
+ * @pdev: pointer to PCI device
+ *
+ * Called when the error recovery driver tells us that its
+ * OK to resume normal operation. Use completion to allow
+ * halted scsi ops to resume.
+ */
+static void sym2_io_resume (struct pci_dev *pdev)
+{
+	struct sym_hcb *np = pci_get_drvdata(pdev);
+	complete_all(&np->s.io_reset_wait);
+}
+
 static void sym2_get_signalling(struct Scsi_Host *shost)
 {
 	struct sym_hcb *np = sym_get_hcb(shost);
@@ -2110,11 +2239,18 @@ static struct pci_device_id sym2_id_tabl
 
 MODULE_DEVICE_TABLE(pci, sym2_id_table);
 
+static struct pci_error_handlers sym2_err_handler = {
+	.error_detected = sym2_io_error_detected,
+	.slot_reset = sym2_io_slot_reset,
+	.resume = sym2_io_resume,
+};
+
 static struct pci_driver sym2_driver = {
 	.name		= NAME53C8XX,
 	.id_table	= sym2_id_table,
 	.probe		= sym2_probe,
 	.remove		= __devexit_p(sym2_remove),
+	.err_handler 	= &sym2_err_handler,
 };
 
 static int __init sym2_init(void)
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.h	2007-04-20 12:07:38.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h	2007-04-20 12:15:07.000000000 -0500
@@ -40,6 +40,7 @@
 #ifndef SYM_GLUE_H
 #define SYM_GLUE_H
 
+#include <linux/completion.h>
 #include <linux/delay.h>
 #include <linux/ioport.h>
 #include <linux/pci.h>
@@ -179,6 +180,9 @@ struct sym_shcb {
 	char		chip_name[8];
 	struct pci_dev	*device;
 
+	/* Waiter for clearing of frozen PCI bus */
+	struct completion io_reset_wait;
+
 	struct Scsi_Host *host;
 
 	void __iomem *	ioaddr;		/* MMIO kernel io address	*/
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2007-04-20 12:07:38.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c	2007-04-20 12:18:59.000000000 -0500
@@ -2809,6 +2809,12 @@ void sym_interrupt (struct sym_hcb *np)
 			dstat |= INB(np, nc_dstat);
 		istatc = INB(np, nc_istat);
 		istat |= istatc;
+
+		/* Prevent deadlock waiting on a condition that may never clear. */
+		if (unlikely(sist == 0xffff && dstat == 0xff)) {
+			if (pci_channel_offline(np->s.device))
+				return;
+		}
 	} while (istatc & (SIP|DIP));
 
 	if (DEBUG_FLAGS & DEBUG_TINY)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-04-20 20:41 [PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support Linas Vepstas
@ 2007-04-20 20:47 ` Linas Vepstas
  2007-05-09 20:26   ` Linas Vepstas
  2007-09-26 15:02   ` Matthew Wilcox
  0 siblings, 2 replies; 13+ messages in thread
From: Linas Vepstas @ 2007-04-20 20:47 UTC (permalink / raw)
  To: matthew; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi


Implement the so-called "first failure data capture" (FFDC) for the
symbios PCI error recovery.  After a PCI error event is reported,
the driver requests that MMIO be enabled. Once enabled, it 
then reads and dumps assorted status registers, and concludes
by requesting the usual reset sequence.

(includes a whitespace fix for bad indentation).

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 drivers/scsi/sym53c8xx_2/sym_glue.c |   15 +++++++++++++++
 drivers/scsi/sym53c8xx_2/sym_glue.h |    1 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |   18 ++++++++++++++----
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-04-20 12:52:01.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c	2007-04-20 15:25:35.000000000 -0500
@@ -1987,6 +1987,20 @@ static pci_ers_result_t sym2_io_error_de
 	disable_irq(pdev->irq);
 	pci_disable_device(pdev);
 
+	/* Request that MMIO be enabled, so register dump can be taken. */
+	return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+/**
+ * sym2_io_slot_dump -- Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+static pci_ers_result_t sym2_io_slot_dump (struct pci_dev *pdev)
+{
+	struct sym_hcb *np = pci_get_drvdata(pdev);
+
+	sym_dump_registers(np);
+
 	/* Request a slot reset. */
 	return PCI_ERS_RESULT_NEED_RESET;
 }
@@ -2241,6 +2255,7 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table);
 
 static struct pci_error_handlers sym2_err_handler = {
 	.error_detected = sym2_io_error_detected,
+	.mmio_enabled = sym2_io_slot_dump,
 	.slot_reset = sym2_io_slot_reset,
 	.resume = sym2_io_resume,
 };
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.h	2007-04-20 12:15:07.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h	2007-04-20 15:21:31.000000000 -0500
@@ -270,5 +270,6 @@ void sym_xpt_async_bus_reset(struct sym_
 void sym_xpt_async_sent_bdr(struct sym_hcb *np, int target);
 int  sym_setup_data_and_start (struct sym_hcb *np, struct scsi_cmnd *csio, struct sym_ccb *cp);
 void sym_log_bus_error(struct sym_hcb *np);
+void sym_dump_registers(struct sym_hcb *np);
 
 #endif /* SYM_GLUE_H */
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c
===================================================================
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c	2007-04-20 12:18:59.000000000 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c	2007-04-20 15:18:01.000000000 -0500
@@ -1180,10 +1180,10 @@ static void sym_log_hard_error(struct sy
 			scr_to_cpu((int) *(u32 *)(script_base + script_ofs)));
 	}
 
-        printf ("%s: regdump:", sym_name(np));
-        for (i=0; i<24;i++)
-            printf (" %02x", (unsigned)INB_OFF(np, i));
-        printf (".\n");
+	printf ("%s: regdump:", sym_name(np));
+	for (i=0; i<24;i++)
+	    printf (" %02x", (unsigned)INB_OFF(np, i));
+	printf (".\n");
 
 	/*
 	 *  PCI BUS error.
@@ -1192,6 +1192,16 @@ static void sym_log_hard_error(struct sy
 		sym_log_bus_error(np);
 }
 
+void sym_dump_registers(struct sym_hcb *np)
+{
+	u_short sist;
+	u_char dstat;
+
+	sist = INW(np, nc_sist);
+	dstat = INB(np, nc_dstat);
+	sym_log_hard_error(np, sist, dstat);
+}
+
 static struct sym_chip sym_dev_table[] = {
  {PCI_DEVICE_ID_NCR_53C810, 0x0f, "810", 4, 8, 4, 64,
  FE_ERL}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-04-20 20:47 ` [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure Linas Vepstas
@ 2007-05-09 20:26   ` Linas Vepstas
  2007-05-17 19:53     ` Linas Vepstas
  2007-09-26 15:02   ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Linas Vepstas @ 2007-05-09 20:26 UTC (permalink / raw)
  To: matthew; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

Hi Matthew,

I had been hoping these patches might make it into 2.6.22,
... this is a nag note; please forward upstream.

--linas

On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
> 
> Implement the so-called "first failure data capture" (FFDC) for the
> symbios PCI error recovery.  After a PCI error event is reported,
> the driver requests that MMIO be enabled. Once enabled, it 
> then reads and dumps assorted status registers, and concludes
> by requesting the usual reset sequence.
> 
> (includes a whitespace fix for bad indentation).
> 
> Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-05-09 20:26   ` Linas Vepstas
@ 2007-05-17 19:53     ` Linas Vepstas
  0 siblings, 0 replies; 13+ messages in thread
From: Linas Vepstas @ 2007-05-17 19:53 UTC (permalink / raw)
  To: matthew; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Wed, May 09, 2007 at 03:26:21PM -0500, Linas Vepstas wrote:
> Hi Matthew,
> 
> I had been hoping these patches might make it into 2.6.22,
> ... this is a nag note; please forward upstream.


... should I repost the patches? 

--linas 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-04-20 20:47 ` [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure Linas Vepstas
  2007-05-09 20:26   ` Linas Vepstas
@ 2007-09-26 15:02   ` Matthew Wilcox
  2007-09-27 22:00     ` Linas Vepstas
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2007-09-26 15:02 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
> Implement the so-called "first failure data capture" (FFDC) for the
> symbios PCI error recovery.  After a PCI error event is reported,
> the driver requests that MMIO be enabled. Once enabled, it 
> then reads and dumps assorted status registers, and concludes
> by requesting the usual reset sequence.

> +	/* Request that MMIO be enabled, so register dump can be taken. */
> +	return PCI_ERS_RESULT_CAN_RECOVER;
> +}

I'm a little concerned by the mention of MMIO.  It's entirely possible
for the sym2 driver to be using ioports to access the card rather than
MMIO.  Is it simply that it can't on the platform you test on?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-09-26 15:02   ` Matthew Wilcox
@ 2007-09-27 22:00     ` Linas Vepstas
  2007-09-27 22:10       ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Linas Vepstas @ 2007-09-27 22:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Wed, Sep 26, 2007 at 09:02:16AM -0600, Matthew Wilcox wrote:
> On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
> > Implement the so-called "first failure data capture" (FFDC) for the
> > symbios PCI error recovery.  After a PCI error event is reported,
> > the driver requests that MMIO be enabled. Once enabled, it 
> > then reads and dumps assorted status registers, and concludes
> > by requesting the usual reset sequence.
> 
> > +	/* Request that MMIO be enabled, so register dump can be taken. */
> > +	return PCI_ERS_RESULT_CAN_RECOVER;
> > +}
> 
> I'm a little concerned by the mention of MMIO.  It's entirely possible
> for the sym2 driver to be using ioports to access the card rather than
> MMIO.  Is it simply that it can't on the platform you test on?

The comment is misleading. I've been in the bad habit of calling
it "mmio" whenever its not DMA.

The habit is because there are two distinct enable bits in the 
pci-host bridge during error recovery: one to enable mmio/ioports, 
and the other to enable DMA. If the adapter has gone crazy, I don't 
want to enable DMA, so that it doesn't scribble to bad places. But, 
by enabling mmio/ioports, perhaps it can be finessed back into a 
semi-sane state, e.g. sane enough to perform a dump of its internal
state.

--linas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-09-27 22:00     ` Linas Vepstas
@ 2007-09-27 22:10       ` Matthew Wilcox
  2007-09-27 23:34         ` Linas Vepstas
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2007-09-27 22:10 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Thu, Sep 27, 2007 at 05:00:22PM -0500, Linas Vepstas wrote:
> On Wed, Sep 26, 2007 at 09:02:16AM -0600, Matthew Wilcox wrote:
> > I'm a little concerned by the mention of MMIO.  It's entirely possible
> > for the sym2 driver to be using ioports to access the card rather than
> > MMIO.  Is it simply that it can't on the platform you test on?
> 
> The comment is misleading. I've been in the bad habit of calling
> it "mmio" whenever its not DMA.

OK, cool, thanks.  I'll update the comment for you.

One last thing (sorry, I only just noticed):
In the error handler, we wait_for_completion(io_reset_wait).
In sym2_io_error_detected, we init_completion(io_reset_wait).
Isn't it possible that we hit the error handler before we hit the
io_error_detected path, and thus the completion wait is lost?
Since the completion is already initialised in sym_attach(), I don't
think we need to initialise it in sym2_io_error_detected().
Makes sense to just delete it?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-09-27 22:10       ` Matthew Wilcox
@ 2007-09-27 23:34         ` Linas Vepstas
  2007-10-01 20:12           ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Linas Vepstas @ 2007-09-27 23:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Thu, Sep 27, 2007 at 04:10:31PM -0600, Matthew Wilcox wrote:
> In the error handler, we wait_for_completion(io_reset_wait).
> In sym2_io_error_detected, we init_completion(io_reset_wait).
> Isn't it possible that we hit the error handler before we hit the
> io_error_detected path, and thus the completion wait is lost?
> Since the completion is already initialised in sym_attach(), I don't
> think we need to initialise it in sym2_io_error_detected().
> Makes sense to just delete it?

Good catch. But no ... and I had to study this a bit. Bear with me:

It is enough to call init_completion() once, and not once per use:
it initializes spinlocks, which shouldn't be intialized twice. 

But, that completion might be used multiple times when there are
multiple errors, and so, before using it a second time, one must 
set completion->done = 0.  The INIT_COMPLETION() macro does this. 

One must have completion->done = 0 before every use, as otherwise, 
wait_for_completion() won't actually wait. And since complete_all()
sets x->done += UINT_MAX/2, I'm pretty sure x->done won't be zero
the next time we use it, unless we make it so.

So I need to find a place to safely call INIT_COMPLETION() again, 
after the completion has been used. At the moment, I'm stumped
as to where to do this. 

---- [think ... think ... think] ----

I think the race you describe above is harmless. The first time
that sym_eh_handler() will run, it will be with SYM_EH_ABORT, 
in it doesn't matter if we lose that, since the device is hosed
anyway. At some later time, it will run with SYM_EH_DEVICE_RESET
and then SYM_EH_BUS_RESET and then SYM_EH_HOST_RESET, and we won't 
miss those, since, by now, sym2_io_error_detected() will have run.

So, by my reading, I'd say that init_completion() in
sym2_io_error_detected() has to stay (although perhaps
it should be replaced by the INIT_COMPLETION() macro.)
Removing it will prevent correct operation on the second 
and subsequent errors.

--Linas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-09-27 23:34         ` Linas Vepstas
@ 2007-10-01 20:12           ` Matthew Wilcox
  2007-10-01 22:41             ` Linas Vepstas
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2007-10-01 20:12 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Thu, Sep 27, 2007 at 06:34:37PM -0500, Linas Vepstas wrote:
> Good catch. But no ... and I had to study this a bit. Bear with me:

I agree with the analysis which I've now snipped.

> I think the race you describe above is harmless. The first time
> that sym_eh_handler() will run, it will be with SYM_EH_ABORT, 
> in it doesn't matter if we lose that, since the device is hosed
> anyway. At some later time, it will run with SYM_EH_DEVICE_RESET
> and then SYM_EH_BUS_RESET and then SYM_EH_HOST_RESET, and we won't 
> miss those, since, by now, sym2_io_error_detected() will have run.
> 
> So, by my reading, I'd say that init_completion() in
> sym2_io_error_detected() has to stay (although perhaps
> it should be replaced by the INIT_COMPLETION() macro.)
> Removing it will prevent correct operation on the second 
> and subsequent errors.

I think the fundamental problem is that completions aren't really
supposed to be used like this.  Here's one attempt at using completions
perhaps a little more the way they're supposed to be used, although now
I've written it, I wonder if we shouldn't just use a waitqueue instead.

diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.c b/drivers/scsi/sym53c8xx_2/sym_glue.c
index e8a4361..b425b89 100644
--- a/drivers/scsi/sym53c8xx_2/sym_glue.c
+++ b/drivers/scsi/sym53c8xx_2/sym_glue.c
@@ -602,6 +602,7 @@ static int sym_eh_handler(int op, char *opname, struct scsi_cmnd *cmd)
 	struct sym_hcb *np = SYM_SOFTC_PTR(cmd);
 	struct sym_ucmd *ucmd = SYM_UCMD_PTR(cmd);
 	struct Scsi_Host *host = cmd->device->host;
+	struct pci_dev *pdev = np->s.device;
 	SYM_QUEHEAD *qp;
 	int cmd_queued = 0;
 	int sts = -1;
@@ -616,9 +617,19 @@ static int sym_eh_handler(int op, char *opname, struct scsi_cmnd *cmd)
 	 * point in hurrying; take a leisurely wait.
 	 */
 #define WAIT_FOR_PCI_RECOVERY	35
-	if (pci_channel_offline(np->s.device)) {
-		int finished_reset = wait_for_completion_timeout(
-			&np->s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ);
+	if (pci_channel_offline(pdev)) {
+		struct host_data *hostdata = shost_priv(host);
+		int finished_reset = 0;
+		init_completion(&eh_done);
+		spin_lock_irq(host->host_lock);
+		if (!hostdata->io_reset)
+			hostdata->io_reset = &eh_done;
+		if (!pci_channel_offline(pdev))
+			finished_reset = 1;
+		spin_unlock_irq(host->host_lock);
+		if (!finished_reset)
+			finished_reset = wait_for_completion_timeout(
+				hostdata->io_reset, WAIT_FOR_PCI_RECOVERY*HZ);
 		if (!finished_reset)
 			return SCSI_FAILED;
 	}
@@ -1396,7 +1407,6 @@ static struct Scsi_Host * __devinit sym_attach(struct scsi_host_template *tpnt,
 	np->maxoffs	= dev->chip.offset_max;
 	np->maxburst	= dev->chip.burst_max;
 	np->myaddr	= dev->host_id;
-	init_completion(&np->s.io_reset_wait);
 
 	/*
 	 *  Edit its name.
@@ -1842,15 +1852,12 @@ static void __devexit sym2_remove(struct pci_dev *pdev)
 static pci_ers_result_t sym2_io_error_detected(struct pci_dev *pdev,
                                          enum pci_channel_state state)
 {
-	struct sym_hcb *np = pci_get_drvdata(pdev);
-
 	/* If slot is permanently frozen, turn everything off */
 	if (state == pci_channel_io_perm_failure) {
 		sym2_remove(pdev);
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
-	init_completion(&np->s.io_reset_wait);
 	disable_irq(pdev->irq);
 	pci_disable_device(pdev);
 
@@ -1912,7 +1919,7 @@ static pci_ers_result_t sym2_io_slot_reset(struct pci_dev *pdev)
 	          sym_name(np));
 
 	if (pci_enable_device(pdev)) {
-		printk(KERN_ERR "%s: Unable to enable afer PCI reset\n",
+		printk(KERN_ERR "%s: Unable to enable after PCI reset\n",
 		        sym_name(np));
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
@@ -1953,7 +1960,14 @@ static pci_ers_result_t sym2_io_slot_reset(struct pci_dev *pdev)
 static void sym2_io_resume(struct pci_dev *pdev)
 {
 	struct sym_hcb *np = pci_get_drvdata(pdev);
-	complete_all(&np->s.io_reset_wait);
+	struct Scsi_Host *shost = np->s.host;
+	struct host_data *hostdata = shost_priv(shost);
+
+	spin_lock_irq(shost->host_lock);
+	if (hostdata->io_reset)
+		complete_all(hostdata->io_reset);
+	hostdata->io_reset = NULL;
+	spin_unlock_irq(shost->host_lock);
 }
 
 static void sym2_get_signalling(struct Scsi_Host *shost)
diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.h b/drivers/scsi/sym53c8xx_2/sym_glue.h
index a172cc5..b961f70 100644
--- a/drivers/scsi/sym53c8xx_2/sym_glue.h
+++ b/drivers/scsi/sym53c8xx_2/sym_glue.h
@@ -180,9 +180,6 @@ struct sym_shcb {
 	char		chip_name[8];
 	struct pci_dev	*device;
 
-	/* Waiter for clearing of frozen PCI bus */
-	struct completion io_reset_wait;
-
 	struct Scsi_Host *host;
 
 	void __iomem *	ioaddr;		/* MMIO kernel io address	*/
@@ -223,6 +220,7 @@ struct sym_device {
  */
 struct host_data {
 	struct sym_hcb *ncb;
+	struct completion *io_reset;		/* PCI error handling */
 };
 
 static inline struct sym_hcb * sym_get_hcb(struct Scsi_Host *host)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-10-01 20:12           ` Matthew Wilcox
@ 2007-10-01 22:41             ` Linas Vepstas
  2007-10-02  1:27               ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Linas Vepstas @ 2007-10-01 22:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Mon, Oct 01, 2007 at 02:12:47PM -0600, Matthew Wilcox wrote:
> 
> I think the fundamental problem is that completions aren't really
> supposed to be used like this.  Here's one attempt at using completions
> perhaps a little more the way they're supposed to be used, 

Yes, that looks very good to me.  I see it solves a bug that
I hadn't been quite aware of. I don't understand why 
struct host_data is preferable to struct sym_shcb (is it because 
this is the structure that is "naturally protectected" by the 
spinlock?)

My gut instinct is to say "ack", although prudence dictates that 
I should test first. Which might take a few days...

> although now
> I've written it, I wonder if we shouldn't just use a waitqueue instead.

I thought that earlier versions of the driver used waitqueues (I vaguely
remember "eh_wait" in the code), which were later converted to 
completions (I also vaguely recall thinking that the new code was
more elegant/simpler). I converted my patch to use the completions 
likewise, and, as you've clearly shown, did a rather sloppy job in 
the conversion.

I'm tempted to go with this patch; but if you prod, I could attempt
a wait-queue based patch.

--linas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-10-01 22:41             ` Linas Vepstas
@ 2007-10-02  1:27               ` Matthew Wilcox
  2007-10-02 21:59                 ` Linas Vepstas
  2007-10-04 18:36                 ` Linas Vepstas
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2007-10-02  1:27 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Mon, Oct 01, 2007 at 05:41:32PM -0500, Linas Vepstas wrote:
> On Mon, Oct 01, 2007 at 02:12:47PM -0600, Matthew Wilcox wrote:
> > I think the fundamental problem is that completions aren't really
> > supposed to be used like this.  Here's one attempt at using completions
> > perhaps a little more the way they're supposed to be used, 
> 
> Yes, that looks very good to me.  I see it solves a bug that
> I hadn't been quite aware of. I don't understand why 
> struct host_data is preferable to struct sym_shcb (is it because 
> this is the structure that is "naturally protectected" by the 
> spinlock?)

The thing to remember is that sym2 is in transition from being a dual
BSD/Linux driver to being a purely Linux driver.  I know, it's been more
than two years, and I'm still not done.

My latest thing I'm looking at fixing is to make Scsi_Host very
much the primary entity in the sym2 driver, and leave just the
device/scripts-accessible stuff in the hcb.

> My gut instinct is to say "ack", although prudence dictates that 
> I should test first. Which might take a few days...

Fine by me.  Do you have the ability to produce failures on a whim on
your platforms?  I've been vaguely musing a PCI device failure patch for
x86, just so people can test driver failure paths.

> I thought that earlier versions of the driver used waitqueues (I vaguely
> remember "eh_wait" in the code), which were later converted to 
> completions (I also vaguely recall thinking that the new code was
> more elegant/simpler). I converted my patch to use the completions 
> likewise, and, as you've clearly shown, did a rather sloppy job in 
> the conversion.

eh_wait (when I removed it) contained a completion ... I think it used
to be a semaphore, some time before 2.6.12

> I'm tempted to go with this patch; but if you prod, I could attempt
> a wait-queue based patch.

Let's leave it with a completion for now.  I think it works out more
efficiently in the end.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-10-02  1:27               ` Matthew Wilcox
@ 2007-10-02 21:59                 ` Linas Vepstas
  2007-10-04 18:36                 ` Linas Vepstas
  1 sibling, 0 replies; 13+ messages in thread
From: Linas Vepstas @ 2007-10-02 21:59 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
> 
> Fine by me.  Do you have the ability to produce failures on a whim on
> your platforms?  

Yes, although it is very platform specific -- there are actually
transistors in the pci bridge chip, which actually short out lines,
and so, from the point of view of the rest of the chip, it did
actually see a "real" error. Its supposed to be a very realistic 
test.

> I've been vaguely musing a PCI device failure patch for
> x86, just so people can test driver failure paths.

That would be good ... I've recently agreed to accept a fedex
to test someone elses card for them, which is outside my usual
activities.

There's also supposed to be some PCI-X riser card out there, 
(never seen one) which has the ability to inject actual pci 
errors. Its the Agilent PCI BestX card; I got the impression 
they might not sell it anymore; dunno.

One guy in the lab used to brush a grounding strap across
the pins; this usually got a rise out of the audience.

--linas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
  2007-10-02  1:27               ` Matthew Wilcox
  2007-10-02 21:59                 ` Linas Vepstas
@ 2007-10-04 18:36                 ` Linas Vepstas
  1 sibling, 0 replies; 13+ messages in thread
From: Linas Vepstas @ 2007-10-04 18:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linuxppc-dev, linux-pci, linux-kernel, linux-scsi

On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
> 
> The thing to remember is that sym2 is in transition from being a dual
> BSD/Linux driver to being a purely Linux driver. 

I was wondering about that; couldn't tell if the split in the code
was historical, or being intentionally maintained.

> > My gut instinct is to say "ack", although prudence dictates that 
> > I should test first. Which might take a few days...
> 
> Fine by me.  

I tested the patch, it worked great. It also seemed to recover 
much more quickly -- so quickly, in fact, that I thought something 
had gone wrong.

I reviewed it one more time, it really does look good. A formal
submission and acked by's at earliest convenience would be good. 

--linas

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-04 18:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-20 20:41 [PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support Linas Vepstas
2007-04-20 20:47 ` [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure Linas Vepstas
2007-05-09 20:26   ` Linas Vepstas
2007-05-17 19:53     ` Linas Vepstas
2007-09-26 15:02   ` Matthew Wilcox
2007-09-27 22:00     ` Linas Vepstas
2007-09-27 22:10       ` Matthew Wilcox
2007-09-27 23:34         ` Linas Vepstas
2007-10-01 20:12           ` Matthew Wilcox
2007-10-01 22:41             ` Linas Vepstas
2007-10-02  1:27               ` Matthew Wilcox
2007-10-02 21:59                 ` Linas Vepstas
2007-10-04 18:36                 ` Linas Vepstas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).