[PATCH 0/5] mpt fusion error handler patches

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] mpt fusion error handler patches
@ 2008-09-12 18:57 Bernd Schubert
  2008-09-12 18:59 ` [PATCH 1/5] scsi abort Bernd Schubert
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 18:57 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

Hello,

I'm going to submit several error handler patches for the MPT fusion 
driver. The purpose of these patches is mainly to fix errors happening 
on the second port of dual port 53C1030 based HBAs.
As I complained some time ago on this list, a device failure on one of the 
ports of LSI22320R HBAs, will also cause device failures of innocent devices 
on the other port of this HBA. In order to debug this Eric Moore sent me a 
fusion-tip version of this driver, which we have been using ever since. However, 
this version has issues with SAS HBAs and probably also won't work for recent kernel 
versions. So I spent quite some amount of time to figure out why fusion-tip 
version (4.x) of the driver doesn't have the issue.

Below I will provide the some examples of these issues. Errors on one of the attached 
scsi devices have been simulated using lsiutil by doing target of one of the attached 
devices on one of the port (5 0 4 0).

Unpatched 2.6.26 + a few scsi diagnostics and error handler patches:

[  224.819697] sd 5:0:4:0: last recovery: 4294911483, now: 4294948403
[  224.826142] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
[  224.831676] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
[  224.842803] sd 5:0:4:0: Activating scsi error recovery (1)
[  224.857824] sd 5:0:4:0: trying to abort command
[  224.865697] mptscsih: ioc1: attempting task abort! (sc=ffff8100f8f10000)
[  224.870572] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
[  227.047968] mptbase: ioc1: Initiating recovery
[  229.481849] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f8fbb180, mf = ffff8100
[...]
[  364.322013] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff8100f8f11b80)
[  371.924342] sd 4:0:2:0: scmd retry 6/6
[  371.928364] sd 4:0:2:0: last recovery: 0, now: 4294985148
[  371.932924] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
[  371.932924] sd 4:0:2:0: [sda] CDB: Write(16): 8a 00 00 00 00 01 31 8b 4a 4e 00 00 00 39 00 00
[  371.932924] sd 4:0:2:0: Activating scsi error recovery (1)
[  371.960382] sd 4:0:2:0: Sending BDR 0xffff81007eaf2538
[  371.984936] sd 4:0:2:0: trying device reset
[  371.989426] mptscsih: ioc0: attempting target reset! (sc=ffff81007eb7c780)

As you can see, suddenly also target 4 0 2 0 fails, which is ioc0. In the end:

[  398.596119] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
[  398.605291] end_request: I/O error, dev sda, sector 5126179406
[  398.612360] end_request: I/O error, dev sda, sector 5126179406
[  398.617818]  target4:0:2: Beginning Domain Validation

So the innocent device sda (which is really another device) failed.

Now the same with patches applied, but with the soft reset-handler deactivated:

[  912.861708] sd 5:0:4:0: last recovery: 4295082734, now: 4295120387
[  912.868130] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_

[  912.873757] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  912.873757] sd 5:0:4:0: Activating scsi error recovery (2)
[  912.889492] sd 5:0:4:0: trying to abort command
[  912.894118] mptscsih: ioc1: attempting task abort! (sc=ffff8100e361d180)
[  912.900951] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.025771] mptscsih: ioc1: task abort: FAILED (sc=ffff8100e361d180)
[  913.032269] sd 5:0:4:0: Sending BDR 0xffff8100f99e1428
[  913.040264] sd 5:0:4:0: trying device reset
[  913.044597] mptscsih: ioc1: attempting target reset! (sc=ffff8100e361d180)
[  913.049955] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.177284] mptscsih: ioc1: target reset: FAILED (sc=ffff8100e361d180)
[  913.181946] Sending BRST chan: 0
[  913.185945] sd 5:0:4:0: trying bus reset
[  913.189974] mptscsih: ioc1: attempting bus reset! (sc=ffff8100e361d180)
[  913.197310] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
[  913.325079] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100e361d180)
[  913.329668] sd 5:0:4:0: trying host reset
[  913.333864] mptscsih: ioc1: attempting host reset! (sc=ffff8100e361d180)
[  913.341832] mptscsih: ioc1: Skipping hard reset in order to prevent failures on ioc

[  913.349821] mptscsih: ioc1: host reset: FAILED (sc=ffff8100e361d180)
[  913.356704] sd 5:0:4:0: Device offlined - not ready after error recovery
[  913.363199] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK

=> The device was not recovered, but at least 4 0 2 0 didn't fail :)

Now with all patches applied:

[  214.903699] sd 5:0:4:0: last recovery: 0, now: 4294945953
[  214.910652] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
[  214.918652] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  214.918652] sd 5:0:4:0: Activating scsi error recovery (1)
[  214.934655] sd 5:0:4:0: trying to abort command
[  214.939581] mptscsih: ioc1: attempting task abort! (sc=ffff8100f9be0c80)
[  214.947581] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.077430] mptscsih: ioc1: task abort: FAILED (sc=ffff8100f9be0c80)
[  215.083645] sd 5:0:4:0: Sending BDR 0xffff81007eb51428
[  215.090298] sd 5:0:4:0: trying device reset
[  215.094810] mptscsih: ioc1: attempting target reset! (sc=ffff8100f9be0c80)
[  215.101917] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.229659] mptscsih: ioc1: target reset: FAILED (sc=ffff8100f9be0c80)
[  215.236367] Sending BRST chan: 0
[  215.240173] sd 5:0:4:0: trying bus reset
[  215.244313] mptscsih: ioc1: attempting bus reset! (sc=ffff8100f9be0c80)
[  215.251731] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
[  215.382449] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100f9be0c80)
[  215.388946] sd 5:0:4:0: trying host reset
[  215.393162] mptscsih: ioc1: attempting host reset! (sc=ffff8100f9be0c80)
[  215.400489] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f9be0c80, mf = ffff8105
[  217.317914] mptbase: ioc1: SoftResetHandler: completed (1 seconds): SUCCESS
[  217.324924] mptscsih: ioc1: host reset: SUCCESS (sc=ffff8100f9be0c80)
[  227.546452]  target5:0:4: Beginning Domain Validation
[  227.578775]  target5:0:4: Ending Domain Validation
[  227.584099]  target5:0:4: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
[  227.596959]  target5:0:5: Beginning Domain Validation
[  227.651196]  target5:0:5: Ending Domain Validation
[  227.656977]  target5:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] scsi abort
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
@ 2008-09-12 18:59 ` Bernd Schubert
  2008-09-12 19:00 ` [PATCH 2/5] fusion reset handler Bernd Schubert
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 18:59 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

For some (unknown) reasons, on task aborts the mptscsih_TMHandler() will
cause trouble on the second port of dual port 53C1030 based HBAs.
The mptscsih_IssueTaskMgmt() function takes the same arguments as 
mptscsih_TMHandler() but doesn't have this issue.
This is a backport of the fusion-4.x driver from LSI.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

 drivers/message/fusion/mptscsih.c |   55 ++++++++++++++++++----------
  1 file changed, 37 insertions(+), 18 deletions(-)


Index: linux-2.6.26/drivers/message/fusion/mptscsih.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptscsih.c
+++ linux-2.6.26/drivers/message/fusion/mptscsih.c
@@ -1810,7 +1810,22 @@ mptscsih_abort(struct scsi_cmnd * SCpnt)
 		    ioc->name, SCpnt));
 		SCpnt->result = DID_NO_CONNECT << 16;
 		SCpnt->scsi_done(SCpnt);
-		retval = 0;
+		retval = SUCCESS;
+		goto out;
+	}
+
+	/* Find this command
+	 */
+	scpnt_idx = SCPNT_TO_LOOKUP_IDX(ioc, SCpnt);
+	if (scpnt_idx < 0) {
+		/* Cmd not found in ScsiLookup.
+		 * Do OS callback.
+		 */
+		SCpnt->result = DID_RESET << 16;
+		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
+		    "task abort: command not in the active list! (sc=%p)\n",
+		    ioc->name, SCpnt));
+		retval = SUCCESS;
 		goto out;
 	}
 
@@ -1825,21 +1840,16 @@ mptscsih_abort(struct scsi_cmnd * SCpnt)
 		goto out;
 	}
 
-	/* Find this command
+	/* Task aborts are not supported for volumes.
 	 */
-	if ((scpnt_idx = SCPNT_TO_LOOKUP_IDX(ioc, SCpnt)) < 0) {
-		/* Cmd not found in ScsiLookup.
-		 * Do OS callback.
-		 */
+	if (vdevice->vtarget->raidVolume) {
+		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
+		    "task abort: raid volume (sc=%p)\n",
+		    ioc->name, SCpnt));
 		SCpnt->result = DID_RESET << 16;
 		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "task abort: "
 		   "Command not in the active list! (sc=%p)\n", ioc->name,
 		   SCpnt));
-		retval = 0;
-		goto out;
-	}
-
-	if (hd->resetPending) {
 		retval = FAILED;
 		goto out;
 	}
@@ -1859,22 +1869,31 @@ mptscsih_abort(struct scsi_cmnd * SCpnt)
 
 	hd->abortSCpnt = SCpnt;
 
-	retval = mptscsih_TMHandler(hd, MPI_SCSITASKMGMT_TASKTYPE_ABORT_TASK,
+	mptscsih_IssueTaskMgmt(hd, MPI_SCSITASKMGMT_TASKTYPE_ABORT_TASK,
 	    vdevice->vtarget->channel, vdevice->vtarget->id, vdevice->lun,
 	    ctx2abort, mptscsih_get_tm_timeout(ioc));
 
+	/* check to see whether command actually completed and/or
+	 * terminated
+	 */
 	if (SCPNT_TO_LOOKUP_IDX(ioc, SCpnt) == scpnt_idx &&
-	    SCpnt->serial_number == sn)
+	    SCpnt->serial_number == sn) {
+		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
+		    "task abort: command still in active list! (sc=%p)\n",
+		    ioc->name, SCpnt));
 		retval = FAILED;
+	} else {
+		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
+		    "task abort: command cleared from active list! (sc=%p)\n",
+		    ioc->name, SCpnt));
+		retval = SUCCESS;
+	}
 
  out:
 	printk(MYIOC_s_INFO_FMT "task abort: %s (sc=%p)\n",
-	    ioc->name, ((retval == 0) ? "SUCCESS" : "FAILED" ), SCpnt);
+	    ioc->name, ((retval == SUCCESS) ? "SUCCESS" : "FAILED"), SCpnt);
 
-	if (retval == 0)
-		return SUCCESS;
-	else
-		return FAILED;
+	return retval;
 }
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/5] fusion reset handler
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
  2008-09-12 18:59 ` [PATCH 1/5] scsi abort Bernd Schubert
@ 2008-09-12 19:00 ` Bernd Schubert
  2008-09-12 19:01 ` [PATCH 3/5] fusion remove the TMHandler Bernd Schubert
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 19:00 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

On dual port 53C1030 based HBAs such as the LSI22320R, the hard reset handler
will cause DID_SOFT_ERROR for innocent devices on the second port.
Introduce a mpt_SoftResetHandler() which doesn't cause this issue and 
slightly improve mpt_HardResetHandler().
This is a backport of the fusion-4.x driver available from LSI.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

 drivers/message/fusion/mptbase.c  |  187 ++++++++++++++++++++++++----
 drivers/message/fusion/mptbase.h  |   11 +
 drivers/message/fusion/mptctl.c   |    7 -
 drivers/message/fusion/mptsas.c   |    6
 drivers/message/fusion/mptscsih.c |   42 +++---
 5 files changed, 200 insertions(+), 53 deletions(-)


Index: linux-2.6.26/drivers/message/fusion/mptbase.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptbase.c
+++ linux-2.6.26/drivers/message/fusion/mptbase.c
@@ -6232,6 +6232,129 @@ mpt_print_ioc_summary(MPT_ADAPTER *ioc, 
 /*
  *	Reset Handling
  */
+
+/**
+ *	mpt_SoftResetHandler - Issues a less expensive reset
+ *	@ioc: Pointer to MPT_ADAPTER structure
+ *	@sleepFlag: Indicates if sleep or schedule must be called.
+
+ *
+ *	Returns 0 for SUCCESS or -1 if FAILED.
+ *
+ *	Message Unit Reset - instructs the IOC to reset the Reply Post and
+ *	Free FIFO's. All the Message Frames on Reply Free FIFO are discarded.
+ *	All posted buffers are freed, and event notification is turned off.
+ *	IOC doesnt reply to any outstanding request. This will transfer IOC
+ *	to READY state.
+ **/
+int
+mpt_SoftResetHandler(MPT_ADAPTER *ioc, int sleepFlag)
+{
+	int		 rc;
+	int		 ii;
+	u8		 cb_idx;
+	unsigned long	 flags;
+	u32		 ioc_state;
+	unsigned long	 time_count;
+
+	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "SoftResetHandler Entered!\n",
+			      ioc->name));
+
+	ioc_state = mpt_GetIocState(ioc, 0) & MPI_IOC_STATE_MASK;
+	if (ioc_state == MPI_IOC_STATE_FAULT ||
+	    ioc_state == MPI_IOC_STATE_RESET) {
+		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
+		    "skipping, either in FAULT or RESET state!\n", ioc->name));
+		return -1;
+	}
+
+	spin_lock_irqsave(&ioc->diagLock, flags);
+	if (ioc->ioc_reset_in_progress) {
+		spin_unlock_irqrestore(&ioc->diagLock, flags);
+		return -1;
+	}
+	ioc->ioc_reset_in_progress = 1;
+	spin_unlock_irqrestore(&ioc->diagLock, flags);
+
+	rc = -1;
+
+	for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
+		if (MptResetHandlers[cb_idx])
+			mpt_signal_reset(cb_idx, ioc, MPT_IOC_SETUP_RESET);
+	}
+
+	for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
+		if (MptResetHandlers[cb_idx])
+			mpt_signal_reset(cb_idx, ioc, MPT_IOC_PRE_RESET);
+	}
+
+	/* Disable reply interrupts (also blocks FreeQ) */
+	CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);
+	ioc->active = 0;
+	time_count = jiffies;
+	rc = SendIocReset(ioc, MPI_FUNCTION_IOC_MESSAGE_UNIT_RESET, sleepFlag);
+	if (rc != 0)
+		goto out;
+	ioc_state = mpt_GetIocState(ioc, 0) & MPI_IOC_STATE_MASK;
+	if (ioc_state != MPI_IOC_STATE_READY)
+		goto out;
+
+	for (ii = 0; ii < 5; ii++) {
+		/* Get IOC facts! Allow 5 retries */
+		rc = GetIocFacts(ioc, sleepFlag, MPT_HOSTEVENT_IOC_RECOVER);
+		if (rc == 0)
+			break;
+		if (sleepFlag == CAN_SLEEP)
+			msleep(100);
+		else
+			mdelay(100);
+	}
+	if (ii == 5)
+		goto out;
+
+	rc = PrimeIocFifos(ioc);
+	if (rc != 0)
+		goto out;
+
+	rc = SendIocInit(ioc, sleepFlag);
+	if (rc != 0)
+		goto out;
+
+	rc = SendEventNotification(ioc, 1);
+	if (rc != 0)
+		goto out;
+
+	if (ioc->hard_resets < -1)
+		ioc->hard_resets++;
+
+	/*
+	 * At this point, we know soft reset succeeded.
+	 */
+
+	ioc->active = 1;
+	CHIPREG_WRITE32(&ioc->chip->IntMask, MPI_HIM_DIM);
+
+ out:
+	spin_lock_irqsave(&ioc->diagLock, flags);
+	ioc->ioc_reset_in_progress = 0;
+	ioc->taskmgmt_quiesce_io = 0;
+	ioc->taskmgmt_in_progress = 0;
+	spin_unlock_irqrestore(&ioc->diagLock, flags);
+
+	if (ioc->active) {	/* otherwise, hard reset coming */
+		for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
+			if (MptResetHandlers[cb_idx])
+				mpt_signal_reset(cb_idx, ioc, MPT_IOC_POST_RESET);
+		}
+	}
+
+	printk(MYIOC_s_INFO_FMT "SoftResetHandler: completed (%d seconds): %s\n",
+	    ioc->name, jiffies_to_msecs(jiffies - time_count)/1000,
+	    ((rc == 0) ? "SUCCESS" : "FAILED"));
+
+	return rc;
+}
+
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
 /**
  *	mpt_HardResetHandler - Generic reset handler
@@ -6253,9 +6376,10 @@ int
 mpt_HardResetHandler(MPT_ADAPTER *ioc, int sleepFlag)
 {
 	int		 rc;
+	u8		 cb_idx;
 	unsigned long	 flags;
+	unsigned long	 time_count;
 
-	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "HardResetHandler Entered!\n", ioc->name));
 #ifdef MFCNT
 	printk(MYIOC_s_INFO_FMT "HardResetHandler Entered!\n", ioc->name);
 	printk("MF count 0x%x !\n", ioc->mfcnt);
@@ -6265,12 +6389,15 @@ mpt_HardResetHandler(MPT_ADAPTER *ioc, i
 	 * mpt_do_ioc_recovery at any instant in time.
 	 */
 	spin_lock_irqsave(&ioc->diagLock, flags);
-	if ((ioc->diagPending) || (ioc->alt_ioc && ioc->alt_ioc->diagPending)){
+	if (ioc->ioc_reset_in_progress) {
 		spin_unlock_irqrestore(&ioc->diagLock, flags);
 		return 0;
 	} else {
 		ioc->diagPending = 1;
 	}
+	ioc->ioc_reset_in_progress = 1;
+	if (ioc->alt_ioc)
+		ioc->alt_ioc->ioc_reset_in_progress = 1;
 	spin_unlock_irqrestore(&ioc->diagLock, flags);
 
 	/* FIXME: If do_ioc_recovery fails, repeat....
@@ -6281,39 +6408,46 @@ mpt_HardResetHandler(MPT_ADAPTER *ioc, i
 	 * Prevents timeouts occurring during a diagnostic reset...very bad.
 	 * For all other protocol drivers, this is a no-op.
 	 */
-	{
-		u8	 cb_idx;
-		int	 r = 0;
-
-		for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
-			if (MptResetHandlers[cb_idx]) {
-				dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "Calling IOC reset_setup handler #%d\n",
-						ioc->name, cb_idx));
-				r += mpt_signal_reset(cb_idx, ioc, MPT_IOC_SETUP_RESET);
-				if (ioc->alt_ioc) {
-					dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "Calling alt-%s setup reset handler #%d\n",
-							ioc->name, ioc->alt_ioc->name, cb_idx));
-					r += mpt_signal_reset(cb_idx, ioc->alt_ioc, MPT_IOC_SETUP_RESET);
-				}
-			}
+	for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
+		if (MptResetHandlers[cb_idx]) {
+			mpt_signal_reset(cb_idx, ioc, MPT_IOC_SETUP_RESET);
+			if (ioc->alt_ioc)
+				mpt_signal_reset(cb_idx, ioc->alt_ioc, MPT_IOC_SETUP_RESET);
 		}
 	}
 
-	if ((rc = mpt_do_ioc_recovery(ioc, MPT_HOSTEVENT_IOC_RECOVER, sleepFlag)) != 0) {
-		printk(MYIOC_s_WARN_FMT "Cannot recover rc = %d!\n", ioc->name, rc);
+	time_count = jiffies;
+	rc = mpt_do_ioc_recovery(ioc, MPT_HOSTEVENT_IOC_RECOVER, sleepFlag);
+	if (rc != 0) {
+		printk(KERN_WARNING MYNAM ": WARNING - (%d) Cannot recover %s\n",
+			rc, ioc->name);
+	} else {
+		if (ioc->hard_resets < -1)
+			ioc->hard_resets++;
 	}
-	ioc->reload_fw = 0;
-	if (ioc->alt_ioc)
-		ioc->alt_ioc->reload_fw = 0;
 
 	spin_lock_irqsave(&ioc->diagLock, flags);
-	ioc->diagPending = 0;
-	if (ioc->alt_ioc)
-		ioc->alt_ioc->diagPending = 0;
+	ioc->ioc_reset_in_progress = 0;
+	ioc->taskmgmt_quiesce_io = 0;
+	ioc->taskmgmt_in_progress = 0;
+	if (ioc->alt_ioc) {
+		ioc->alt_ioc->ioc_reset_in_progress = 0;
+		ioc->alt_ioc->taskmgmt_quiesce_io   = 0;
+		ioc->alt_ioc->taskmgmt_in_progress  = 0;
+	}
 	spin_unlock_irqrestore(&ioc->diagLock, flags);
 
-	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "HardResetHandler rc = %d!\n", ioc->name, rc));
+	for (cb_idx = MPT_MAX_PROTOCOL_DRIVERS-1; cb_idx; cb_idx--) {
+		if (MptResetHandlers[cb_idx]) {
+			mpt_signal_reset(cb_idx, ioc, MPT_IOC_POST_RESET);
+			if (ioc->alt_ioc)
+				mpt_signal_reset(cb_idx, ioc->alt_ioc, MPT_IOC_POST_RESET);
+		}
+	}
 
+	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "HardResetHandler: completed (%d seconds): %s\n",
+	    ioc->name, jiffies_to_msecs(jiffies - time_count)/1000,
+	    ((rc == 0) ? "SUCCESS" : "FAILED")));
 	return rc;
 }
 
@@ -7474,6 +7608,7 @@ EXPORT_SYMBOL(mpt_send_handshake_request
 EXPORT_SYMBOL(mpt_verify_adapter);
 EXPORT_SYMBOL(mpt_GetIocState);
 EXPORT_SYMBOL(mpt_print_ioc_summary);
+EXPORT_SYMBOL(mpt_SoftResetHandler);
 EXPORT_SYMBOL(mpt_HardResetHandler);
 EXPORT_SYMBOL(mpt_config);
 EXPORT_SYMBOL(mpt_findImVolumes);
Index: linux-2.6.26/drivers/message/fusion/mptbase.h
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptbase.h
+++ linux-2.6.26/drivers/message/fusion/mptbase.h
@@ -699,6 +699,9 @@ typedef struct _MPT_ADAPTER
 	MPT_SAS_MGMT		 sas_mgmt;
 	struct work_struct	 sas_persist_task;
 
+	int			 taskmgmt_in_progress;
+	u8			 taskmgmt_quiesce_io;
+
 	struct work_struct	 fc_setup_reset_work;
 	struct list_head	 fc_rports;
 	struct work_struct	 fc_lsc_work;
@@ -707,6 +710,11 @@ typedef struct _MPT_ADAPTER
 	struct work_struct	 fc_rescan_work;
 	char			 fc_rescan_work_q_name[KOBJ_NAME_LEN];
 	struct workqueue_struct *fc_rescan_work_q;
+
+	unsigned long		  hard_resets;	/* driver forced bus resets count */
+	unsigned long		  soft_resets;	/* fw/external bus resets count */
+	u8			  ioc_reset_in_progress;
+
 	struct scsi_cmnd	**ScsiLookup;
 	spinlock_t		  scsi_lookup_lock;
 } MPT_ADAPTER;
@@ -836,8 +844,6 @@ typedef struct _MPT_SCSI_HOST {
 	MPT_FRAME_HDR		 *cmdPtr;		/* Ptr to nonOS request */
 	struct scsi_cmnd	 *abortSCpnt;
 	MPT_LOCAL_REPLY		  localReply;		/* internal cmd reply struct */
-	unsigned long		  hard_resets;		/* driver forced bus resets count */
-	unsigned long		  soft_resets;		/* fw/external bus resets count */
 	unsigned long		  timeouts;		/* cmd timeouts */
 	ushort			  sel_timeout[MPT_MAX_FC_DEVICES];
 	char 			  *info_kbuf;
@@ -908,6 +914,7 @@ extern int	 mpt_verify_adapter(int iocid
 extern u32	 mpt_GetIocState(MPT_ADAPTER *ioc, int cooked);
 extern void	 mpt_print_ioc_summary(MPT_ADAPTER *ioc, char *buf, int *size, int len, int showlan);
 extern int	 mpt_HardResetHandler(MPT_ADAPTER *ioc, int sleepFlag);
+extern int	 mpt_SoftResetHandler(MPT_ADAPTER *ioc, int sleepFlag);
 extern int	 mpt_config(MPT_ADAPTER *ioc, CONFIGPARMS *cfg);
 extern int	 mpt_alloc_fw_memory(MPT_ADAPTER *ioc, int size);
 extern void	 mpt_free_fw_memory(MPT_ADAPTER *ioc);
Index: linux-2.6.26/drivers/message/fusion/mptscsih.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptscsih.c
+++ linux-2.6.26/drivers/message/fusion/mptscsih.c
@@ -1621,8 +1621,8 @@ mptscsih_TMHandler(MPT_SCSI_HOST *hd, u8
 
 	/* Isse the Task Mgmt request.
 	 */
-	if (hd->hard_resets < -1)
-		hd->hard_resets++;
+	if (ioc->hard_resets < -1)
+		ioc->hard_resets++;
 
 	rc = mptscsih_IssueTaskMgmt(hd, type, channel, id, lun,
 	    ctx2abort, timeout);
@@ -1724,7 +1724,9 @@ mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd
 			ioc, mf));
 		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "Calling HardReset! \n",
 			 ioc->name));
-		retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
+		retval = mpt_SoftResetHandler(ioc, CAN_SLEEP);
+		if (retval != 0)
+			retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
 		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "rc=%d \n",
 			 ioc->name, retval));
 		goto fail_out;
@@ -2018,11 +2020,12 @@ int
 mptscsih_host_reset(struct scsi_cmnd *SCpnt)
 {
 	MPT_SCSI_HOST *  hd;
-	int              retval;
+	int		 retval, status;
 	MPT_ADAPTER	*ioc;
 
 	/*  If we can't locate the host to reset, then we failed. */
-	if ((hd = shost_priv(SCpnt->device->host)) == NULL){
+	hd = shost_priv(SCpnt->device->host);
+	if (hd == NULL) {
 		printk(KERN_ERR MYNAM ": host reset: "
 		    "Can't locate host! (sc=%p)\n", SCpnt);
 		return FAILED;
@@ -2035,21 +2038,19 @@ mptscsih_host_reset(struct scsi_cmnd *SC
 	/*  If our attempts to reset the host failed, then return a failed
 	 *  status.  The host will be taken off line by the SCSI mid-layer.
 	 */
-	if (mpt_HardResetHandler(ioc, CAN_SLEEP) < 0) {
-		retval = FAILED;
-	} else {
-		/*  Make sure TM pending is cleared and TM state is set to
-		 *  NONE.
-		 */
-		retval = 0;
-		hd->tmPending = 0;
-		hd->tmState = TM_STATE_NONE;
-	}
+	retval = mpt_SoftResetHandler(ioc, CAN_SLEEP);
+	if (retval != 0)
+		retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
+
+	if (retval < 0)
+		status = FAILED;
+	else
+		status = SUCCESS;
 
 	printk(MYIOC_s_INFO_FMT "host reset: %s (sc=%p)\n",
-	    ioc->name, ((retval == 0) ? "SUCCESS" : "FAILED" ), SCpnt);
+	    ioc->name, ((status == SUCCESS) ? "SUCCESS" : "FAILED"), SCpnt);
 
-	return retval;
+	return status;
 }
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
@@ -2238,7 +2239,8 @@ mptscsih_taskmgmt_complete(MPT_ADAPTER *
 		 */
 		if (iocstatus == MPI_IOCSTATUS_SCSI_TASK_MGMT_FAILED ||
 		    hd->cmdPtr)
-			if (mpt_HardResetHandler(ioc, NO_SLEEP) < 0)
+			if (mpt_SoftResetHandler(ioc, NO_SLEEP) &&
+			   (mpt_HardResetHandler(ioc, NO_SLEEP) < 0))
 				printk(MYIOC_s_WARN_FMT " Firmware Reload FAILED!!\n", ioc->name);
 		break;
 
@@ -2760,8 +2762,8 @@ mptscsih_event_process(MPT_ADAPTER *ioc,
 		break;
 	case MPI_EVENT_IOC_BUS_RESET:			/* 04 */
 	case MPI_EVENT_EXT_BUS_RESET:			/* 05 */
-		if (hd && (ioc->bus_type == SPI) && (hd->soft_resets < -1))
-			hd->soft_resets++;
+		if (hd && (ioc->bus_type == SPI) && (ioc->soft_resets < -1))
+			ioc->soft_resets++;
 		break;
 	case MPI_EVENT_LOGOUT:				/* 09 */
 		/* FIXME! */
Index: linux-2.6.26/drivers/message/fusion/mptctl.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptctl.c
+++ linux-2.6.26/drivers/message/fusion/mptctl.c
@@ -323,7 +323,8 @@ static void mptctl_timeout_expired (MPT_
 		 */
 		dctlprintk(ioctl->ioc, printk(MYIOC_s_DEBUG_FMT "Calling HardReset! \n",
 			 ioctl->ioc->name));
-		mpt_HardResetHandler(ioctl->ioc, CAN_SLEEP);
+		if (mpt_SoftResetHandler(ioctl->ioc, CAN_SLEEP) != 0)
+			mpt_HardResetHandler(ioctl->ioc, CAN_SLEEP);
 	}
 	return;
 
@@ -2467,8 +2468,8 @@ mptctl_hp_hostinfo(unsigned long arg, un
 		MPT_SCSI_HOST *hd =  shost_priv(ioc->sh);
 
 		if (hd && (cim_rev == 1)) {
-			karg.hard_resets = hd->hard_resets;
-			karg.soft_resets = hd->soft_resets;
+			karg.hard_resets = ioc->hard_resets;
+			karg.soft_resets = ioc->soft_resets;
 			karg.timeouts = hd->timeouts;
 		}
 	}
Index: linux-2.6.26/drivers/message/fusion/mptsas.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptsas.c
+++ linux-2.6.26/drivers/message/fusion/mptsas.c
@@ -1167,7 +1167,8 @@ static int mptsas_phy_reset(struct sas_p
 	if (!timeleft) {
 		/* On timeout reset the board */
 		mpt_free_msg_frame(ioc, mf);
-		mpt_HardResetHandler(ioc, CAN_SLEEP);
+		if (mpt_SoftResetHandler(ioc, CAN_SLEEP) != 0)
+			mpt_HardResetHandler(ioc, CAN_SLEEP);
 		error = -ETIMEDOUT;
 		goto out_unlock;
 	}
@@ -1345,7 +1346,8 @@ static int mptsas_smp_handler(struct Scs
 	if (!timeleft) {
 		printk(MYIOC_s_ERR_FMT "%s: smp timeout!\n", ioc->name, __FUNCTION__);
 		/* On timeout reset the board */
-		mpt_HardResetHandler(ioc, CAN_SLEEP);
+		if (mpt_SoftResetHandler(ioc, CAN_SLEEP) != 0)
+			mpt_HardResetHandler(ioc, CAN_SLEEP);
 		ret = -ETIMEDOUT;
 		goto unmap;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/5] fusion remove the TMHandler
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
  2008-09-12 18:59 ` [PATCH 1/5] scsi abort Bernd Schubert
  2008-09-12 19:00 ` [PATCH 2/5] fusion reset handler Bernd Schubert
@ 2008-09-12 19:01 ` Bernd Schubert
  2008-09-12 19:02 ` [PATCH 4/5] fusion prevent DV deadlock Bernd Schubert
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 19:01 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

The mptscsih_TMHandler() seems to be redundant to mptscsih_IssueTaskMgmt().
Since there is no need to have two functions doing the same stuff, replace
all calls of mptscsih_TMHandler() by mptscsih_IssueTaskMgmt() and remove
mptscsih_TMHandler().

Signed-off-by: Bernd Schubert <bs@q-leap.de>

 drivers/message/fusion/mptscsih.c |  156 ----------------------------
 drivers/message/fusion/mptscsih.h |    2
 drivers/message/fusion/mptspi.c   |    2
 3 files changed, 6 insertions(+), 154 deletions(-)


Index: linux-2.6.26/drivers/message/fusion/mptscsih.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptscsih.c
+++ linux-2.6.26/drivers/message/fusion/mptscsih.c
@@ -92,11 +92,8 @@ static int	mptscsih_AddSGE(MPT_ADAPTER *
 				 SCSIIORequest_t *pReq, int req_idx);
 static void	mptscsih_freeChainBuffers(MPT_ADAPTER *ioc, int req_idx);
 static void	mptscsih_copy_sense_data(struct scsi_cmnd *sc, MPT_SCSI_HOST *hd, MPT_FRAME_HDR *mf, SCSIIOReply_t 
*pScsiReply);
-static int	mptscsih_tm_pending_wait(MPT_SCSI_HOST * hd);
 static int	mptscsih_tm_wait_for_completion(MPT_SCSI_HOST * hd, ulong timeout );
 
-static int	mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd, u8 type, u8 channel, u8 id, int lun, int ctx2abort, ulong 
timeout);
-
 int		mptscsih_ioc_reset(MPT_ADAPTER *ioc, int post_reset);
 int		mptscsih_event_process(MPT_ADAPTER *ioc, EventNotificationReply_t *pEvReply);
 
@@ -1528,120 +1525,6 @@ mptscsih_freeChainBuffers(MPT_ADAPTER *i
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
 /**
- *	mptscsih_TMHandler - Generic handler for SCSI Task Management.
- *	@hd: Pointer to MPT SCSI HOST structure
- *	@type: Task Management type
- *	@channel: channel number for task management
- *	@id: Logical Target ID for reset (if appropriate)
- *	@lun: Logical Unit for reset (if appropriate)
- *	@ctx2abort: Context for the task to be aborted (if appropriate)
- *	@timeout: timeout for task management control
- *
- *	Fall through to mpt_HardResetHandler if: not operational, too many
- *	failed TM requests or handshake failure.
- *
- *	Remark: Currently invoked from a non-interrupt thread (_bh).
- *
- *	Note: With old EH code, at most 1 SCSI TaskMgmt function per IOC
- *	will be active.
- *
- *	Returns 0 for SUCCESS, or %FAILED.
- **/
-int
-mptscsih_TMHandler(MPT_SCSI_HOST *hd, u8 type, u8 channel, u8 id, int lun, int ctx2abort, ulong timeout)
-{
-	MPT_ADAPTER	*ioc;
-	int		 rc = -1;
-	u32		 ioc_raw_state;
-	unsigned long	 flags;
-
-	ioc = hd->ioc;
-	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "TMHandler Entered!\n", ioc->name));
-
-	// SJR - CHECKME - Can we avoid this here?
-	// (mpt_HardResetHandler has this check...)
-	spin_lock_irqsave(&ioc->diagLock, flags);
-	if ((ioc->diagPending) || (ioc->alt_ioc && ioc->alt_ioc->diagPending)) {
-		spin_unlock_irqrestore(&ioc->diagLock, flags);
-		return FAILED;
-	}
-	spin_unlock_irqrestore(&ioc->diagLock, flags);
-
-	/*  Wait a fixed amount of time for the TM pending flag to be cleared.
-	 *  If we time out and not bus reset, then we return a FAILED status
-	 *  to the caller.
-	 *  The call to mptscsih_tm_pending_wait() will set the pending flag
-	 *  if we are
-	 *  successful. Otherwise, reload the FW.
-	 */
-	if (mptscsih_tm_pending_wait(hd) == FAILED) {
-		if (type == MPI_SCSITASKMGMT_TASKTYPE_ABORT_TASK) {
-			dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "TMHandler abort: "
-			   "Timed out waiting for last TM (%d) to complete! \n",
-			   ioc->name, hd->tmPending));
-			return FAILED;
-		} else if (type == MPI_SCSITASKMGMT_TASKTYPE_TARGET_RESET) {
-			dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "TMHandler target "
-				"reset: Timed out waiting for last TM (%d) "
-				"to complete! \n", ioc->name,
-				hd->tmPending));
-			return FAILED;
-		} else if (type == MPI_SCSITASKMGMT_TASKTYPE_RESET_BUS) {
-			dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "TMHandler bus reset: "
-			   "Timed out waiting for last TM (%d) to complete! \n",
-			  ioc->name, hd->tmPending));
-			return FAILED;
-		}
-	} else {
-		spin_lock_irqsave(&ioc->FreeQlock, flags);
-		hd->tmPending |=  (1 << type);
-		spin_unlock_irqrestore(&ioc->FreeQlock, flags);
-	}
-
-	ioc_raw_state = mpt_GetIocState(ioc, 0);
-
-	if ((ioc_raw_state & MPI_IOC_STATE_MASK) != MPI_IOC_STATE_OPERATIONAL) {
-		printk(MYIOC_s_WARN_FMT
-			"TM Handler for type=%x: IOC Not operational (0x%x)!\n",
-			ioc->name, type, ioc_raw_state);
-		printk(MYIOC_s_WARN_FMT " Issuing HardReset!!\n", ioc->name);
-		if (mpt_HardResetHandler(ioc, CAN_SLEEP) < 0)
-			printk(MYIOC_s_WARN_FMT "TMHandler: HardReset "
-			    "FAILED!!\n", ioc->name);
-		return FAILED;
-	}
-
-	if (ioc_raw_state & MPI_DOORBELL_ACTIVE) {
-		printk(MYIOC_s_WARN_FMT
-			"TM Handler for type=%x: ioc_state: "
-			"DOORBELL_ACTIVE (0x%x)!\n",
-			ioc->name, type, ioc_raw_state);
-		return FAILED;
-	}
-
-	/* Isse the Task Mgmt request.
-	 */
-	if (ioc->hard_resets < -1)
-		ioc->hard_resets++;
-
-	rc = mptscsih_IssueTaskMgmt(hd, type, channel, id, lun,
-	    ctx2abort, timeout);
-	if (rc)
-		printk(MYIOC_s_INFO_FMT "Issue of TaskMgmt failed!\n",
-		       ioc->name);
-	else
-		dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT "Issue of TaskMgmt Successful!\n",
-			   ioc->name));
-
-	dtmprintk(ioc, printk(MYIOC_s_DEBUG_FMT
-			"TMHandler rc = %d!\n", ioc->name, rc));
-
-	return rc;
-}
-
-
-/*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
-/**
  *	mptscsih_IssueTaskMgmt - Generic send Task Management function.
  *	@hd: Pointer to MPT_SCSI_HOST structure
  *	@type: Task Management type
@@ -1659,7 +1542,7 @@ mptscsih_TMHandler(MPT_SCSI_HOST *hd, u8
  *	Returns 0 for SUCCESS, or FAILED.
  *
  **/
-static int
+int
 mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd, u8 type, u8 channel, u8 id, int lun, int ctx2abort, ulong timeout)
 {
 	MPT_FRAME_HDR	*mf;
@@ -1946,7 +1829,7 @@ mptscsih_dev_reset(struct scsi_cmnd * SC
 		goto out;
 	}
 
-	retval = mptscsih_TMHandler(hd, MPI_SCSITASKMGMT_TASKTYPE_TARGET_RESET,
+	retval = mptscsih_IssueTaskMgmt(hd, MPI_SCSITASKMGMT_TASKTYPE_TARGET_RESET,
 	    vdevice->vtarget->channel, vdevice->vtarget->id, 0, 0,
 	    mptscsih_get_tm_timeout(ioc));
 
@@ -1995,7 +1878,7 @@ mptscsih_bus_reset(struct scsi_cmnd * SC
 		hd->timeouts++;
 
 	vdevice = SCpnt->device->hostdata;
-	retval = mptscsih_TMHandler(hd, MPI_SCSITASKMGMT_TASKTYPE_RESET_BUS,
+	retval = mptscsih_IssueTaskMgmt(hd, MPI_SCSITASKMGMT_TASKTYPE_RESET_BUS,
 	    vdevice->vtarget->channel, 0, 0, 0, mptscsih_get_tm_timeout(ioc));
 
 	printk(MYIOC_s_INFO_FMT "bus reset: %s (sc=%p)\n",
@@ -2055,37 +1938,6 @@ mptscsih_host_reset(struct scsi_cmnd *SC
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
 /**
- *	mptscsih_tm_pending_wait - wait for pending task management request to complete
- *	@hd: Pointer to MPT host structure.
- *
- *	Returns {SUCCESS,FAILED}.
- */
-static int
-mptscsih_tm_pending_wait(MPT_SCSI_HOST * hd)
-{
-	unsigned long  flags;
-	int            loop_count = 4 * 10;  /* Wait 10 seconds */
-	int            status = FAILED;
-	MPT_ADAPTER	*ioc = hd->ioc;
-
-	do {
-		spin_lock_irqsave(&ioc->FreeQlock, flags);
-		if (hd->tmState == TM_STATE_NONE) {
-			hd->tmState = TM_STATE_IN_PROGRESS;
-			hd->tmPending = 1;
-			spin_unlock_irqrestore(&ioc->FreeQlock, flags);
-			status = SUCCESS;
-			break;
-		}
-		spin_unlock_irqrestore(&ioc->FreeQlock, flags);
-		msleep(250);
-	} while (--loop_count);
-
-	return status;
-}
-
-/*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
-/**
  *	mptscsih_tm_wait_for_completion - wait for completion of TM task
  *	@hd: Pointer to MPT host structure.
  *	@timeout: timeout value
@@ -3526,12 +3378,12 @@ EXPORT_SYMBOL(mptscsih_bus_reset);
 EXPORT_SYMBOL(mptscsih_host_reset);
 EXPORT_SYMBOL(mptscsih_bios_param);
 EXPORT_SYMBOL(mptscsih_io_done);
+EXPORT_SYMBOL(mptscsih_IssueTaskMgmt);
 EXPORT_SYMBOL(mptscsih_taskmgmt_complete);
 EXPORT_SYMBOL(mptscsih_scandv_complete);
 EXPORT_SYMBOL(mptscsih_event_process);
 EXPORT_SYMBOL(mptscsih_ioc_reset);
 EXPORT_SYMBOL(mptscsih_change_queue_depth);
 EXPORT_SYMBOL(mptscsih_timer_expired);
-EXPORT_SYMBOL(mptscsih_TMHandler);
 
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
Index: linux-2.6.26/drivers/message/fusion/mptscsih.h
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptscsih.h
+++ linux-2.6.26/drivers/message/fusion/mptscsih.h
@@ -120,13 +120,13 @@ extern int mptscsih_bus_reset(struct scs
 extern int mptscsih_host_reset(struct scsi_cmnd *SCpnt);
 extern int mptscsih_bios_param(struct scsi_device * sdev, struct block_device *bdev, sector_t capacity, int geom[]);
 extern int mptscsih_io_done(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r);
+extern int mptscsih_IssueTaskMgmt(MPT_SCSI_HOST *hd, u8 type, u8 channel, u8 id, int lun, int ctx2abort, ulong timeout);
 extern int mptscsih_taskmgmt_complete(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r);
 extern int mptscsih_scandv_complete(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r);
 extern int mptscsih_event_process(MPT_ADAPTER *ioc, EventNotificationReply_t *pEvReply);
 extern int mptscsih_ioc_reset(MPT_ADAPTER *ioc, int post_reset);
 extern int mptscsih_change_queue_depth(struct scsi_device *sdev, int qdepth);
 extern void mptscsih_timer_expired(unsigned long data);
-extern int mptscsih_TMHandler(MPT_SCSI_HOST *hd, u8 type, u8 channel, u8 id, int lun, int ctx2abort, ulong timeout);
 extern u8 mptscsih_raid_id_to_num(MPT_ADAPTER *ioc, u8 channel, u8 id);
 extern int mptscsih_is_phys_disk(MPT_ADAPTER *ioc, u8 channel, u8 id);
 extern struct device_attribute *mptscsih_host_attrs[];
Index: linux-2.6.26/drivers/message/fusion/mptspi.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptspi.c
+++ linux-2.6.26/drivers/message/fusion/mptspi.c
@@ -1513,7 +1513,7 @@ mptspi_probe(struct pci_dev *pdev, const
 	 * issue internal bus reset
 	 */
 	if (ioc->spi_data.bus_reset)
-		mptscsih_TMHandler(hd,
+		mptscsih_IssueTaskMgmt(hd,
 		    MPI_SCSITASKMGMT_TASKTYPE_RESET_BUS,
 		    0, 0, 0, 0, 5);
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/5] fusion prevent DV deadlock
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
                   ` (2 preceding siblings ...)
  2008-09-12 19:01 ` [PATCH 3/5] fusion remove the TMHandler Bernd Schubert
@ 2008-09-12 19:02 ` Bernd Schubert
  2008-09-13 12:24   ` Bernd Schubert
  2008-09-12 19:03 ` [PATCH 5/5] fusion disable scsi hard resets Bernd Schubert
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 19:02 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

The mpt fusion driver will do a domain revalidation on an ion reset, this
DV might cause a live deadlock. The problem has been entirely analyzed in
this thread, but so far no real solution has been implemented.
This patch simply disables the domain revalidation, if it will run into
the deadlock.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

 drivers/message/fusion/mptspi.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)


Index: linux-2.6.26/drivers/message/fusion/mptspi.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptspi.c
+++ linux-2.6.26/drivers/message/fusion/mptspi.c
@@ -672,12 +672,24 @@ static void mptspi_dv_device(struct _MPT
 {
 	VirtTarget *vtarget = scsi_target(sdev)->hostdata;
 	MPT_ADAPTER *ioc = hd->ioc;
+	unsigned long nr_requests = sdev->request_queue->nr_requests;
+	struct request_list *rl = &sdev->request_queue->rq;
 
 	/* no DV on RAID devices */
 	if (sdev->channel == 0 &&
 	    mptspi_is_raid(hd, sdev->id))
 		return;
 
+	if (rl->count[0] + 1 >= nr_requests
+	||  rl->count[1] + 1 >= nr_requests) {
+		/* we must NOT do a DV after an error recovery, when we
+		 * don't have left a space in the request list, since
+		 * this will cause a live dead lock */
+		starget_printk(KERN_INFO, scsi_target(sdev), MYIOC_s_FMT
+		    "Skipping DV, to prevent dead lock!\n", ioc->name);
+		return;
+	}
+
 	/* If this is a piece of a RAID, then quiesce first */
 	if (sdev->channel == 1 &&
 	    mptscsih_quiesce_raid(hd, 1, vtarget->channel, vtarget->id) < 0) {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/5] fusion prevent DV deadlock
  2008-09-12 19:02 ` [PATCH 4/5] fusion prevent DV deadlock Bernd Schubert
@ 2008-09-13 12:24   ` Bernd Schubert
  0 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-13 12:24 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash, Andrew Patterson

On Fri, Sep 12, 2008 at 09:02:16PM +0200, Bernd Schubert wrote:
> The mpt fusion driver will do a domain revalidation on an ion reset, this

s/ion/ioc/

> DV might cause a live deadlock. The problem has been entirely analyzed in
> this thread, but so far no real solution has been implemented.
> This patch simply disables the domain revalidation, if it will run into
> the deadlock.

Forgot to paste the thread: http://marc.info/?t=118039577800004

> 
> Signed-off-by: Bernd Schubert <bs@q-leap.de>
> 
>  drivers/message/fusion/mptspi.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> 
> Index: linux-2.6.26/drivers/message/fusion/mptspi.c
> ===================================================================
> --- linux-2.6.26.orig/drivers/message/fusion/mptspi.c
> +++ linux-2.6.26/drivers/message/fusion/mptspi.c
> @@ -672,12 +672,24 @@ static void mptspi_dv_device(struct _MPT
>  {
>  	VirtTarget *vtarget = scsi_target(sdev)->hostdata;
>  	MPT_ADAPTER *ioc = hd->ioc;
> +	unsigned long nr_requests = sdev->request_queue->nr_requests;
> +	struct request_list *rl = &sdev->request_queue->rq;
>  
>  	/* no DV on RAID devices */
>  	if (sdev->channel == 0 &&
>  	    mptspi_is_raid(hd, sdev->id))
>  		return;
>  
> +	if (rl->count[0] + 1 >= nr_requests
> +	||  rl->count[1] + 1 >= nr_requests) {
> +		/* we must NOT do a DV after an error recovery, when we
> +		 * don't have left a space in the request list, since
> +		 * this will cause a live dead lock */
> +		starget_printk(KERN_INFO, scsi_target(sdev), MYIOC_s_FMT
> +		    "Skipping DV, to prevent dead lock!\n", ioc->name);
> +		return;
> +	}
> +
>  	/* If this is a piece of a RAID, then quiesce first */
>  	if (sdev->channel == 1 &&
>  	    mptscsih_quiesce_raid(hd, 1, vtarget->channel, vtarget->id) < 0) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 5/5] fusion disable scsi hard resets
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
                   ` (3 preceding siblings ...)
  2008-09-12 19:02 ` [PATCH 4/5] fusion prevent DV deadlock Bernd Schubert
@ 2008-09-12 19:03 ` Bernd Schubert
  2008-09-13  4:32 ` [PATCH 0/5] mpt fusion error handler patches Mr. James W. Laferriere
  2008-09-13 12:25 ` Bernd Schubert
  6 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-12 19:03 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

For 53C1030 based dual port HBAs the hard reset handler will still cause
trouble on the second channel with innocent devices. It is then better
to fail the device which activated the error handler than to fail
entirely innocent devices. Real solutions would be of course to figure out
why the hard reset handler cause trouble on the second channel. Probably
only LSI can do, though. Is it o.k. to do this for all mpt fusion based HBAs,
thus, are all of these 53C1030 based?

Signed-off-by: Bernd Schubert <bs@q-leap.de>

 drivers/message/fusion/mptscsih.c |   45 ++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)


Index: linux-2.6.26/drivers/message/fusion/mptscsih.c
===================================================================
--- linux-2.6.26.orig/drivers/message/fusion/mptscsih.c
+++ linux-2.6.26/drivers/message/fusion/mptscsih.c
@@ -1890,6 +1890,33 @@ mptscsih_bus_reset(struct scsi_cmnd * SC
 		return FAILED;
 }
 
+/**
+  * Check if there are devices connected to the second (alt) ioc.
+  * Return 1 if there is at least on device and 0 if there are
+  * none or no alt_ioc.
+  */
+static int
+alt_ioc_with_dev(MPT_ADAPTER *ioc)
+{
+	struct Scsi_Host	*shost;
+	struct scsi_device	*sdev;
+	int 			have_devices = 0;
+
+	if (!ioc->alt_ioc)
+		return 0;
+
+	shost = ioc->alt_ioc->sh;
+
+	shost_for_each_device(sdev, shost) {
+		/* when we are here, we know there is is a device
+		 * attached to this host, which is all we need to know */
+		have_devices = 1;
+		break;
+	}
+
+	return have_devices ? 1 : 0;
+}
+
 /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
 /**
  *	mptscsih_host_reset - Perform a SCSI host adapter RESET (new_eh variant)
@@ -1922,8 +1949,22 @@ mptscsih_host_reset(struct scsi_cmnd *SC
 	 *  status.  The host will be taken off line by the SCSI mid-layer.
 	 */
 	retval = mpt_SoftResetHandler(ioc, CAN_SLEEP);
-	if (retval != 0)
-		retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
+	if (retval != 0) {
+		if (alt_ioc_with_dev(ioc) == 0) {
+			/* On dual port HBAs based on the 53C1030 chip the
+			* hard reset handler will cause DID_SOFT_ERROR on
+			* the second (in principle independent) port.
+			* Almost always this error cannot be recovered
+			* causing entire device failures. So it better not
+			* to call the hard reset handler at all in order to
+			* prevent failures of independent devices */
+			retval = mpt_HardResetHandler(ioc, CAN_SLEEP);
+		} else {
+			printk(MYIOC_s_INFO_FMT "Skipping hard reset in "
+				"order to prevent failures on %s.\n",
+				ioc->name, ioc->alt_ioc->name);
+		}
+	}
 
 	if (retval < 0)
 		status = FAILED;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] mpt fusion error handler patches
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
                   ` (4 preceding siblings ...)
  2008-09-12 19:03 ` [PATCH 5/5] fusion disable scsi hard resets Bernd Schubert
@ 2008-09-13  4:32 ` Mr. James W. Laferriere
  2008-09-13 12:25 ` Bernd Schubert
  6 siblings, 0 replies; 9+ messages in thread
From: Mr. James W. Laferriere @ 2008-09-13  4:32 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Linux SCSI Mailing List, Eric Moore, Sathya Prakash

 	Hello Bernd & All ,

On Fri, 12 Sep 2008, Bernd Schubert wrote:
> Hello,
> I'm going to submit several error handler patches for the MPT fusion
> driver. The purpose of these patches is mainly to fix errors happening
> on the second port of dual port 53C1030 based HBAs.
> As I complained some time ago on this list, a device failure on one of the
> ports of LSI22320R HBAs, will also cause device failures of innocent devices
> on the other port of this HBA. In order to debug this Eric Moore sent me a
> fusion-tip version of this driver, which we have been using ever since. However,
> this version has issues with SAS HBAs and probably also won't work for recent kernel
> versions. So I spent quite some amount of time to figure out why fusion-tip
> version (4.x) of the driver doesn't have the issue.
>
> Below I will provide the some examples of these issues. Errors on one of the attached
> scsi devices have been simulated using lsiutil by doing target of one of the attached
> devices on one of the port (5 0 4 0).
>
> Unpatched 2.6.26 + a few scsi diagnostics and error handler patches:
>
> [  224.819697] sd 5:0:4:0: last recovery: 4294911483, now: 4294948403
> [  224.826142] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
> [  224.831676] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
> [  224.842803] sd 5:0:4:0: Activating scsi error recovery (1)
> [  224.857824] sd 5:0:4:0: trying to abort command
> [  224.865697] mptscsih: ioc1: attempting task abort! (sc=ffff8100f8f10000)
> [  224.870572] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
> [  227.047968] mptbase: ioc1: Initiating recovery
> [  229.481849] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f8fbb180, mf = ffff8100
> [...]
> [  364.322013] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff8100f8f11b80)
> [  371.924342] sd 4:0:2:0: scmd retry 6/6
> [  371.928364] sd 4:0:2:0: last recovery: 0, now: 4294985148
> [  371.932924] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> [  371.932924] sd 4:0:2:0: [sda] CDB: Write(16): 8a 00 00 00 00 01 31 8b 4a 4e 00 00 00 39 00 00
> [  371.932924] sd 4:0:2:0: Activating scsi error recovery (1)
> [  371.960382] sd 4:0:2:0: Sending BDR 0xffff81007eaf2538
> [  371.984936] sd 4:0:2:0: trying device reset
> [  371.989426] mptscsih: ioc0: attempting target reset! (sc=ffff81007eb7c780)
>
> As you can see, suddenly also target 4 0 2 0 fails, which is ioc0. In the end:
>
> [  398.596119] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> [  398.605291] end_request: I/O error, dev sda, sector 5126179406
> [  398.612360] end_request: I/O error, dev sda, sector 5126179406
> [  398.617818]  target4:0:2: Beginning Domain Validation
>
> So the innocent device sda (which is really another device) failed.
>
> Now the same with patches applied, but with the soft reset-handler deactivated:
>
> [  912.861708] sd 5:0:4:0: last recovery: 4295082734, now: 4295120387
> [  912.868130] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_
>
> [  912.873757] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  912.873757] sd 5:0:4:0: Activating scsi error recovery (2)
> [  912.889492] sd 5:0:4:0: trying to abort command
> [  912.894118] mptscsih: ioc1: attempting task abort! (sc=ffff8100e361d180)
> [  912.900951] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.025771] mptscsih: ioc1: task abort: FAILED (sc=ffff8100e361d180)
> [  913.032269] sd 5:0:4:0: Sending BDR 0xffff8100f99e1428
> [  913.040264] sd 5:0:4:0: trying device reset
> [  913.044597] mptscsih: ioc1: attempting target reset! (sc=ffff8100e361d180)
> [  913.049955] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.177284] mptscsih: ioc1: target reset: FAILED (sc=ffff8100e361d180)
> [  913.181946] Sending BRST chan: 0
> [  913.185945] sd 5:0:4:0: trying bus reset
> [  913.189974] mptscsih: ioc1: attempting bus reset! (sc=ffff8100e361d180)
> [  913.197310] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.325079] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100e361d180)
> [  913.329668] sd 5:0:4:0: trying host reset
> [  913.333864] mptscsih: ioc1: attempting host reset! (sc=ffff8100e361d180)
> [  913.341832] mptscsih: ioc1: Skipping hard reset in order to prevent failures on ioc
>
> [  913.349821] mptscsih: ioc1: host reset: FAILED (sc=ffff8100e361d180)
> [  913.356704] sd 5:0:4:0: Device offlined - not ready after error recovery
> [  913.363199] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
>
> => The device was not recovered, but at least 4 0 2 0 didn't fail :)
>
> Now with all patches applied:
>
> [  214.903699] sd 5:0:4:0: last recovery: 0, now: 4294945953
> [  214.910652] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
> [  214.918652] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  214.918652] sd 5:0:4:0: Activating scsi error recovery (1)
> [  214.934655] sd 5:0:4:0: trying to abort command
> [  214.939581] mptscsih: ioc1: attempting task abort! (sc=ffff8100f9be0c80)
> [  214.947581] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.077430] mptscsih: ioc1: task abort: FAILED (sc=ffff8100f9be0c80)
> [  215.083645] sd 5:0:4:0: Sending BDR 0xffff81007eb51428
> [  215.090298] sd 5:0:4:0: trying device reset
> [  215.094810] mptscsih: ioc1: attempting target reset! (sc=ffff8100f9be0c80)
> [  215.101917] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.229659] mptscsih: ioc1: target reset: FAILED (sc=ffff8100f9be0c80)
> [  215.236367] Sending BRST chan: 0
> [  215.240173] sd 5:0:4:0: trying bus reset
> [  215.244313] mptscsih: ioc1: attempting bus reset! (sc=ffff8100f9be0c80)
> [  215.251731] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.382449] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100f9be0c80)
> [  215.388946] sd 5:0:4:0: trying host reset
> [  215.393162] mptscsih: ioc1: attempting host reset! (sc=ffff8100f9be0c80)
> [  215.400489] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f9be0c80, mf = ffff8105
> [  217.317914] mptbase: ioc1: SoftResetHandler: completed (1 seconds): SUCCESS
> [  217.324924] mptscsih: ioc1: host reset: SUCCESS (sc=ffff8100f9be0c80)
> [  227.546452]  target5:0:4: Beginning Domain Validation
> [  227.578775]  target5:0:4: Ending Domain Validation
> [  227.584099]  target5:0:4: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
> [  227.596959]  target5:0:5: Beginning Domain Validation
> [  227.651196]  target5:0:5: Ending Domain Validation
> [  227.656977]  target5:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)

 	Thank you Bernd for tracking this down .  I have run into this very 
issue every time one of my drives starts going bad on me & I reboot to see if 
the errors are actually just a bad firmware that requires a power reset to clear 
up .  But when it is a trully failing drive (or I am testing a new type of 
cabling which shows to be inferior) then these rolling resets of the controller 
-> channel -> device itself ,  caused me no end of impatient waiting for them to 
end ,  So far they do eventually end .  In some earlier driver versions they did 
NOT timeout ,  This usually told me that something was amiss & I had to just hit 
an dmiss change things out trying to find the actual culrpit ,  There always was 
a culprit in the chain someplace .

 	I would also like to Thank "The LSI team" for creating this in kernel (& 
module) driver for their line of fusion cards (& fixing atto's as well) ,  as 
well as for maintaining it & putting up with my pissing and moaning about this 
and some other issues that had cropped up .

 	Again Thank you all ,  JimL
-- 
+------------------------------------------------------------------+
| James   W.   Laferriere | System    Techniques | Give me VMS     |
| Network&System Engineer | 2133    McCullam Ave |  Give me Linux  |
| babydr@baby-dragons.com | Fairbanks, AK. 99701 |   only  on  AXP |
+------------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] mpt fusion error handler patches
  2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
                   ` (5 preceding siblings ...)
  2008-09-13  4:32 ` [PATCH 0/5] mpt fusion error handler patches Mr. James W. Laferriere
@ 2008-09-13 12:25 ` Bernd Schubert
  6 siblings, 0 replies; 9+ messages in thread
From: Bernd Schubert @ 2008-09-13 12:25 UTC (permalink / raw)
  To: Linux SCSI Mailing List; +Cc: Eric Moore, Sathya Prakash

On Fri, Sep 12, 2008 at 08:57:40PM +0200, Bernd Schubert wrote:
> Hello,
> 
> I'm going to submit several error handler patches for the MPT fusion 
> driver. The purpose of these patches is mainly to fix errors happening 
> on the second port of dual port 53C1030 based HBAs.
> As I complained some time ago on this list, a device failure on one of the 
> ports of LSI22320R HBAs, will also cause device failures of innocent devices 
> on the other port of this HBA. In order to debug this Eric Moore sent me a 
> fusion-tip version of this driver, which we have been using ever since. However, 
> this version has issues with SAS HBAs and probably also won't work for recent kernel 
> versions. So I spent quite some amount of time to figure out why fusion-tip 
> version (4.x) of the driver doesn't have the issue.
> 
> Below I will provide the some examples of these issues. Errors on one of the attached 
> scsi devices have been simulated using lsiutil by doing target of one of the attached 

This was supposed to be "... by doing target resets of one ..."

> devices on one of the port (5 0 4 0).
> 
> Unpatched 2.6.26 + a few scsi diagnostics and error handler patches:
> 
> [  224.819697] sd 5:0:4:0: last recovery: 4294911483, now: 4294948403
> [  224.826142] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
> [  224.831676] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
> [  224.842803] sd 5:0:4:0: Activating scsi error recovery (1)
> [  224.857824] sd 5:0:4:0: trying to abort command
> [  224.865697] mptscsih: ioc1: attempting task abort! (sc=ffff8100f8f10000)
> [  224.870572] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 0c 27 2e 98 00 00 04 00 00 00
> [  227.047968] mptbase: ioc1: Initiating recovery
> [  229.481849] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f8fbb180, mf = ffff8100
> [...]
> [  364.322013] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff8100f8f11b80)
> [  371.924342] sd 4:0:2:0: scmd retry 6/6
> [  371.928364] sd 4:0:2:0: last recovery: 0, now: 4294985148
> [  371.932924] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> [  371.932924] sd 4:0:2:0: [sda] CDB: Write(16): 8a 00 00 00 00 01 31 8b 4a 4e 00 00 00 39 00 00
> [  371.932924] sd 4:0:2:0: Activating scsi error recovery (1)
> [  371.960382] sd 4:0:2:0: Sending BDR 0xffff81007eaf2538
> [  371.984936] sd 4:0:2:0: trying device reset
> [  371.989426] mptscsih: ioc0: attempting target reset! (sc=ffff81007eb7c780)
> 
> As you can see, suddenly also target 4 0 2 0 fails, which is ioc0. In the end:
> 
> [  398.596119] sd 4:0:2:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> [  398.605291] end_request: I/O error, dev sda, sector 5126179406
> [  398.612360] end_request: I/O error, dev sda, sector 5126179406
> [  398.617818]  target4:0:2: Beginning Domain Validation
> 
> So the innocent device sda (which is really another device) failed.
> 
> Now the same with patches applied, but with the soft reset-handler deactivated:
> 
> [  912.861708] sd 5:0:4:0: last recovery: 4295082734, now: 4295120387
> [  912.868130] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_
> 
> [  912.873757] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  912.873757] sd 5:0:4:0: Activating scsi error recovery (2)
> [  912.889492] sd 5:0:4:0: trying to abort command
> [  912.894118] mptscsih: ioc1: attempting task abort! (sc=ffff8100e361d180)
> [  912.900951] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.025771] mptscsih: ioc1: task abort: FAILED (sc=ffff8100e361d180)
> [  913.032269] sd 5:0:4:0: Sending BDR 0xffff8100f99e1428
> [  913.040264] sd 5:0:4:0: trying device reset
> [  913.044597] mptscsih: ioc1: attempting target reset! (sc=ffff8100e361d180)
> [  913.049955] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.177284] mptscsih: ioc1: target reset: FAILED (sc=ffff8100e361d180)
> [  913.181946] Sending BRST chan: 0
> [  913.185945] sd 5:0:4:0: trying bus reset
> [  913.189974] mptscsih: ioc1: attempting bus reset! (sc=ffff8100e361d180)
> [  913.197310] sd 5:0:4:0: [sdc] CDB: Write(10): 2a 00 73 11 33 08 00 04 00 00
> [  913.325079] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100e361d180)
> [  913.329668] sd 5:0:4:0: trying host reset
> [  913.333864] mptscsih: ioc1: attempting host reset! (sc=ffff8100e361d180)
> [  913.341832] mptscsih: ioc1: Skipping hard reset in order to prevent failures on ioc
> 
> [  913.349821] mptscsih: ioc1: host reset: FAILED (sc=ffff8100e361d180)
> [  913.356704] sd 5:0:4:0: Device offlined - not ready after error recovery
> [  913.363199] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
> 
> => The device was not recovered, but at least 4 0 2 0 didn't fail :)
> 
> Now with all patches applied:
> 
> [  214.903699] sd 5:0:4:0: last recovery: 0, now: 4294945953
> [  214.910652] sd 5:0:4:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
> [  214.918652] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  214.918652] sd 5:0:4:0: Activating scsi error recovery (1)
> [  214.934655] sd 5:0:4:0: trying to abort command
> [  214.939581] mptscsih: ioc1: attempting task abort! (sc=ffff8100f9be0c80)
> [  214.947581] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.077430] mptscsih: ioc1: task abort: FAILED (sc=ffff8100f9be0c80)
> [  215.083645] sd 5:0:4:0: Sending BDR 0xffff81007eb51428
> [  215.090298] sd 5:0:4:0: trying device reset
> [  215.094810] mptscsih: ioc1: attempting target reset! (sc=ffff8100f9be0c80)
> [  215.101917] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.229659] mptscsih: ioc1: target reset: FAILED (sc=ffff8100f9be0c80)
> [  215.236367] Sending BRST chan: 0
> [  215.240173] sd 5:0:4:0: trying bus reset
> [  215.244313] mptscsih: ioc1: attempting bus reset! (sc=ffff8100f9be0c80)
> [  215.251731] sd 5:0:4:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 31 8b 9c e7 00 00 00 39 00 00
> [  215.382449] mptscsih: ioc1: bus reset: FAILED (sc=ffff8100f9be0c80)
> [  215.388946] sd 5:0:4:0: trying host reset
> [  215.393162] mptscsih: ioc1: attempting host reset! (sc=ffff8100f9be0c80)
> [  215.400489] sd 5:0:4:0: mptscsih: ioc1: completing cmds: fw_channel 0, fw_id 4, sc=ffff8100f9be0c80, mf = ffff8105
> [  217.317914] mptbase: ioc1: SoftResetHandler: completed (1 seconds): SUCCESS
> [  217.324924] mptscsih: ioc1: host reset: SUCCESS (sc=ffff8100f9be0c80)
> [  227.546452]  target5:0:4: Beginning Domain Validation
> [  227.578775]  target5:0:4: Ending Domain Validation
> [  227.584099]  target5:0:4: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
> [  227.596959]  target5:0:5: Beginning Domain Validation
> [  227.651196]  target5:0:5: Ending Domain Validation
> [  227.656977]  target5:0:5: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
> 
> 
> -- 
> Bernd Schubert
> Q-Leap Networks GmbH
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-09-13 12:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-12 18:57 [PATCH 0/5] mpt fusion error handler patches Bernd Schubert
2008-09-12 18:59 ` [PATCH 1/5] scsi abort Bernd Schubert
2008-09-12 19:00 ` [PATCH 2/5] fusion reset handler Bernd Schubert
2008-09-12 19:01 ` [PATCH 3/5] fusion remove the TMHandler Bernd Schubert
2008-09-12 19:02 ` [PATCH 4/5] fusion prevent DV deadlock Bernd Schubert
2008-09-13 12:24   ` Bernd Schubert
2008-09-12 19:03 ` [PATCH 5/5] fusion disable scsi hard resets Bernd Schubert
2008-09-13  4:32 ` [PATCH 0/5] mpt fusion error handler patches Mr. James W. Laferriere
2008-09-13 12:25 ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox