[PATCH 0/7] scsi error handler improvements

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] scsi error handler improvements
@ 2008-11-26 17:40 Bernd Schubert
  2008-11-26 17:44 ` [PATCH 1/7] print eh activation Bernd Schubert
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 17:40 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Hello,

over the last year we had an endless number of issues with hard raid units 
from various vendors. Due this we also have been running into problems with 
the scsi error handler. 


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/7] print eh activation
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
@ 2008-11-26 17:44 ` Bernd Schubert
  2008-11-26 18:47   ` James Bottomley
  2008-11-26 17:46 ` [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR Bernd Schubert
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 17:44 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Print activation of the scsi error handler to let the user know what was 
the the error handler was activated. These information are essential to
diagnose hardware issues.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

---
 drivers/scsi/scsi_error.c |   17 ++++++++++++-----
 drivers/scsi/scsi_lib.c   |    2 +-
 drivers/scsi/scsi_priv.h  |    2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_error.c
+++ linux-2.6/drivers/scsi/scsi_error.c
@@ -48,12 +48,18 @@
 #define HOST_RESET_SETTLE_TIME  (10)
 
 /* called with shost->host_lock held */
-void scsi_eh_wakeup(struct Scsi_Host *shost)
+void scsi_eh_wakeup(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
 {
 	if (shost->host_busy == shost->host_failed) {
-		wake_up_process(shost->ehandler);
-		SCSI_LOG_ERROR_RECOVERY(5,
+		if (scmd) {
+			scsi_print_result(scmd);
+			scsi_print_command(scmd);
+			sdev_printk(KERN_WARNING, scmd->device,
+				    "Activating error handler\n");
+		} else
+			SCSI_LOG_ERROR_RECOVERY(5,
 				printk("Waking error handler thread\n"));
+		wake_up_process(shost->ehandler);
 	}
 }
 
@@ -72,7 +78,7 @@ void scsi_schedule_eh(struct Scsi_Host *
 	if (scsi_host_set_state(shost, SHOST_RECOVERY) == 0 ||
 	    scsi_host_set_state(shost, SHOST_CANCEL_RECOVERY) == 0) {
 		shost->host_eh_scheduled++;
-		scsi_eh_wakeup(shost);
+		scsi_eh_wakeup(shost, NULL);
 	}
 
 	spin_unlock_irqrestore(shost->host_lock, flags);
@@ -105,7 +111,8 @@ int scsi_eh_scmd_add(struct scsi_cmnd *s
 	scmd->eh_eflags |= eh_flag;
 	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
 	shost->host_failed++;
-	scsi_eh_wakeup(shost);
+
+	scsi_eh_wakeup(shost, scmd);
  out_unlock:
 	spin_unlock_irqrestore(shost->host_lock, flags);
 	return ret;
Index: linux-2.6/drivers/scsi/scsi_priv.h
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_priv.h
+++ linux-2.6/drivers/scsi/scsi_priv.h
@@ -52,7 +52,7 @@ extern void scsi_exit_devinfo(void);
 extern enum blk_eh_timer_return scsi_times_out(struct request *req);
 extern int scsi_error_handler(void *host);
 extern int scsi_decide_disposition(struct scsi_cmnd *cmd);
-extern void scsi_eh_wakeup(struct Scsi_Host *shost);
+extern void scsi_eh_wakeup(struct Scsi_Host *shost, struct scsi_cmnd *scmd);
 extern int scsi_eh_scmd_add(struct scsi_cmnd *, int);
 void scsi_eh_ready_devs(struct Scsi_Host *shost,
 			struct list_head *work_q,
Index: linux-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6/drivers/scsi/scsi_lib.c
@@ -476,7 +476,7 @@ void scsi_device_unbusy(struct scsi_devi
 	starget->target_busy--;
 	if (unlikely(scsi_host_in_recovery(shost) &&
 		     (shost->host_failed || shost->host_eh_scheduled)))
-		scsi_eh_wakeup(shost);
+		scsi_eh_wakeup(shost, NULL);
 	spin_unlock(shost->host_lock);
 	spin_lock(sdev->request_queue->queue_lock);
 	sdev->device_busy--;

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
  2008-11-26 17:44 ` [PATCH 1/7] print eh activation Bernd Schubert
@ 2008-11-26 17:46 ` Bernd Schubert
  2008-11-26 18:47   ` James Bottomley
  2008-11-26 18:25 ` [PATCH 03/07] Don't online offlined devices in scsi_target_quiesce() Bernd Schubert
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 17:46 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Activate the error handler if DID_SOFT_ERROR failed to often, but only 
for commands which have a scmd->allowed > 1. 
Also make a function out of a goto-block.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

---
 drivers/scsi/scsi_error.c |   65 +++++++++++++++++++++++++++++++---------------
 1 file changed, 45 insertions(+), 20 deletions(-)

Index: linux-2.6/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_error.c
+++ linux-2.6/drivers/scsi/scsi_error.c
@@ -1271,6 +1271,47 @@ int scsi_noretry_cmd(struct scsi_cmnd *s
 }
 
 /**
+ * maybe_retry - decide if to retry a scsi-command or to return an error
+ * @scmd:	scsi command to examine
+ * @retry_code	ADD_TO_MLQUEUE or NEEDS_RETRY
+ *
+ * Notes:
+ *    Returning FAILED will activate the error handler,
+ *    returning SUCCESS will fail the scsi command
+ **/
+static int maybe_retry(struct scsi_cmnd *scmd, int retry_code)
+{
+	struct scsi_device *sdev = scmd->device;
+
+	scmd->retries++;
+
+	/* we requeue for retry because the error was retryable, and
+	 * the request was not marked fast fail.  Note that above,
+	 * even if the request is marked fast fail, we still requeue
+	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
+	if (scmd->retries <= scmd->allowed
+	    && !blk_noretry_request(scmd->request)) {
+		return retry_code;
+	} else  if (scmd->retries <= scmd->allowed + 1
+		&& !blk_noretry_request(scmd->request)
+		&& retry_code == ADD_TO_MLQUEUE
+		&& scmd->allowed > 1) {
+		/*
+		 * activate error recovery
+		 */
+		sdev_printk(KERN_INFO, sdev, "scmd retry %d/%d\n",
+			    scmd->retries, scmd->allowed + 1);
+		return FAILED;
+	} else {
+		/*
+		 * no more retries - report this one back to upper level.
+		 */
+		return SUCCESS;
+	}
+}
+
+
+/**
  * scsi_decide_disposition - Disposition a cmd on return from LLD.
  * @scmd:	SCSI cmd to examine.
  *
@@ -1336,7 +1377,7 @@ int scsi_decide_disposition(struct scsi_
 		 * and not get stuck in a loop.
 		 */
 	case DID_SOFT_ERROR:
-		goto maybe_retry;
+		return maybe_retry(scmd, ADD_TO_MLQUEUE);
 	case DID_IMM_RETRY:
 		return NEEDS_RETRY;
 
@@ -1350,7 +1391,7 @@ int scsi_decide_disposition(struct scsi_
 		 * based on its timers and recovery capablilities if
 		 * there are enough retries.
 		 */
-		goto maybe_retry;
+		return maybe_retry(scmd, NEEDS_RETRY);
 	case DID_TRANSPORT_FAILFAST:
 		/*
 		 * The transport decided to failfast the IO (most likely
@@ -1369,7 +1410,7 @@ int scsi_decide_disposition(struct scsi_
 
 	case DID_BUS_BUSY:
 	case DID_PARITY:
-		goto maybe_retry;
+		return maybe_retry(scmd, NEEDS_RETRY);
 	case DID_TIME_OUT:
 		/*
 		 * when we scan the bus, we get timeout messages for
@@ -1418,7 +1459,7 @@ int scsi_decide_disposition(struct scsi_
 	case CHECK_CONDITION:
 		rtn = scsi_check_sense(scmd);
 		if (rtn == NEEDS_RETRY)
-			goto maybe_retry;
+			return maybe_retry(scmd, NEEDS_RETRY);
 		/* if rtn == FAILED, we have no sense information;
 		 * returning FAILED will wake the error handler thread
 		 * to collect the sense and redo the decide
@@ -1441,22 +1482,6 @@ int scsi_decide_disposition(struct scsi_
 		return FAILED;
 	}
 	return FAILED;
-
-      maybe_retry:
-
-	/* we requeue for retry because the error was retryable, and
-	 * the request was not marked fast fail.  Note that above,
-	 * even if the request is marked fast fail, we still requeue
-	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
-	if ((++scmd->retries) <= scmd->allowed
-	    && !scsi_noretry_cmd(scmd)) {
-		return NEEDS_RETRY;
-	} else {
-		/*
-		 * no more retries - report this one back to upper level.
-		 */
-		return SUCCESS;
-	}
 }
 
 /**


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 03/07] Don't online offlined devices in scsi_target_quiesce()
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
  2008-11-26 17:44 ` [PATCH 1/7] print eh activation Bernd Schubert
  2008-11-26 17:46 ` [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR Bernd Schubert
@ 2008-11-26 18:25 ` Bernd Schubert
  2008-11-26 18:26 ` [PATCH 4/7] allow activation of eh on DID_NO_CONNECT Bernd Schubert
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 18:25 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Don't online offlined devices in scsi_target_quiesce(), it often onlines
devices which just have been take offline due to an error on the host.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

---
 drivers/scsi/scsi_lib.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6/drivers/scsi/scsi_lib.c
@@ -2487,7 +2487,15 @@ device_quiesce_fn(struct scsi_device *sd
 void
 scsi_target_quiesce(struct scsi_target *starget)
 {
-	starget_for_each_device(starget, NULL, device_quiesce_fn);
+	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+	struct scsi_device *sdev;
+
+	shost_for_each_device(sdev, shost) {
+		if ((sdev->channel == starget->channel) &&
+		    (sdev->id == starget->id) &&
+		    (sdev->sdev_state != SDEV_OFFLINE))
+			device_quiesce_fn(sdev, NULL);
+	}
 }
 EXPORT_SYMBOL(scsi_target_quiesce);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/7] allow activation of eh on DID_NO_CONNECT
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
                   ` (2 preceding siblings ...)
  2008-11-26 18:25 ` [PATCH 03/07] Don't online offlined devices in scsi_target_quiesce() Bernd Schubert
@ 2008-11-26 18:26 ` Bernd Schubert
  2008-11-26 18:29 ` [PATCH 5/7] time needs to be adjusted when eh was running Bernd Schubert
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 18:26 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Introduce a sysfs attribute to allow to activate the error handler for
these devices in case of DID_NO_CONNECT. This is useful for our parallel
scsi devices.

Signed-off-by: Bernd Schubert: <bs@q-leap.de>

---
 drivers/scsi/scsi_error.c  |    2 ++
 drivers/scsi/scsi_sysfs.c  |   28 ++++++++++++++++++++++++++++
 include/scsi/scsi_device.h |    1 +
 3 files changed, 31 insertions(+)

Index: linux-2.6/include/scsi/scsi_device.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_device.h
+++ linux-2.6/include/scsi/scsi_device.h
@@ -145,6 +145,7 @@ struct scsi_device {
 	unsigned retry_hwerror:1;	/* Retry HARDWARE_ERROR */
 	unsigned last_sector_bug:1;	/* do not use multisector accesses on
 					   SD_LAST_BUGGY_SECTORS */
+	unsigned eh_did_no_connect:1;	/* eh on DID_NO_CONNECT */
 
 	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
 	struct list_head event_list;	/* asserted events */
Index: linux-2.6/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_error.c
+++ linux-2.6/drivers/scsi/scsi_error.c
@@ -1359,6 +1359,8 @@ int scsi_decide_disposition(struct scsi_
 		 */
 		break;
 	case DID_NO_CONNECT:
+		if (scmd->device->eh_did_no_connect)
+			return FAILED;
 	case DID_BAD_TARGET:
 	case DID_ABORT:
 		/*
Index: linux-2.6/drivers/scsi/scsi_sysfs.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_sysfs.c
+++ linux-2.6/drivers/scsi/scsi_sysfs.c
@@ -565,6 +565,33 @@ sdev_rd_attr (rev, "%.4s\n");
  * TODO: can we make these symlinks to the block layer ones?
  */
 static ssize_t
+sdev_show_eh_dnc(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct scsi_device *sdev;
+	sdev = to_scsi_device(dev);
+	return snprintf(buf, 20, "%d\n", sdev->eh_did_no_connect);
+}
+
+static ssize_t
+sdev_store_eh_dnc(struct device *dev, struct device_attribute *attr,
+		  const char *buf, size_t count)
+{
+	struct scsi_device *sdev;
+	int eh_dnc;
+	sdev = to_scsi_device(dev);
+	sscanf(buf, "%d\n", &eh_dnc);
+
+	if (eh_dnc != 0 && eh_dnc != 1)
+		return -EINVAL;
+
+	sdev->eh_did_no_connect = eh_dnc;
+	return count;
+}
+static DEVICE_ATTR(eh_did_no_connect, S_IRUGO | S_IWUSR,
+		   sdev_show_eh_dnc, sdev_store_eh_dnc);
+
+
+static ssize_t
 sdev_show_timeout (struct device *dev, struct device_attribute *attr, char *buf)
 {
 	struct scsi_device *sdev;
@@ -755,6 +782,7 @@ static struct attribute *scsi_sdev_attrs
 	&dev_attr_ioerr_cnt.attr,
 	&dev_attr_modalias.attr,
 	REF_EVT(media_change),
+	&dev_attr_eh_did_no_connect,
 	NULL
 };
 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/7] time needs to be adjusted when eh was running
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
                   ` (3 preceding siblings ...)
  2008-11-26 18:26 ` [PATCH 4/7] allow activation of eh on DID_NO_CONNECT Bernd Schubert
@ 2008-11-26 18:29 ` Bernd Schubert
  2009-01-07 18:09   ` Bernd Schubert
  2008-11-26 18:31 ` [PATCH 6/7] SYNCHRONIZE_CACHE command used fixed value Bernd Schubert
  2008-11-26 18:32 ` [PATCH 0/7] trivial: move a variable from function to if-scope Bernd Schubert
  6 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 18:29 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

If the error handler was active and we use the normal command timeout,
the command will probably fail, even though the error handler may have
succeeded to recover the device.

Signed-off-by: Bernd Schubert <bs@q-leap.de>


 drivers/scsi/scsi_error.c |    3 +++
 drivers/scsi/scsi_lib.c   |   13 ++++++++++++-
 include/scsi/scsi_host.h  |    2 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6/drivers/scsi/scsi_lib.c
@@ -1537,7 +1537,9 @@ static void scsi_softirq_done(struct req
 {
 	struct scsi_cmnd *cmd = rq->special;
 	unsigned long wait_for = (cmd->allowed + 1) * rq->timeout;
+	unsigned long cmd_start;
 	int disposition;
+	struct Scsi_Host *shost = cmd->device->host;
 
 	INIT_LIST_HEAD(&cmd->eh_entry);
 
@@ -1550,9 +1552,18 @@ static void scsi_softirq_done(struct req
 	if (cmd->result)
 		atomic_inc(&cmd->device->ioerr_cnt);
 
+	/*
+	 * If the error handler was active, the command will need more time,
+	 * of course.
+	 */
+	if (shost->last_recovery > cmd->jiffies_at_alloc)
+		cmd_start = shost->last_recovery;
+	else
+		cmd_start = cmd->jiffies_at_alloc;
+
 	disposition = scsi_decide_disposition(cmd);
 	if (disposition != SUCCESS &&
-	    time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
+	    time_before(cmd_start + wait_for, jiffies)) {
 		sdev_printk(KERN_ERR, cmd->device,
 			    "timing out command, waited %lus\n",
 			    wait_for/HZ);
Index: linux-2.6/include/scsi/scsi_host.h
===================================================================
--- linux-2.6.orig/include/scsi/scsi_host.h
+++ linux-2.6/include/scsi/scsi_host.h
@@ -521,6 +521,8 @@ struct Scsi_Host {
 	struct task_struct    * ehandler;  /* Error recovery thread. */
 	struct completion     * eh_action; /* Wait for specific actions on the
 					      host. */
+	unsigned long		last_recovery;  /* last time eh completed */
+
 	wait_queue_head_t       host_wait;
 	struct scsi_host_template *hostt;
 	struct scsi_transport_template *transportt;
Index: linux-2.6/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_error.c
+++ linux-2.6/drivers/scsi/scsi_error.c
@@ -1559,6 +1559,9 @@ static void scsi_restart_operations(stru
 			BUG_ON(scsi_host_set_state(shost, SHOST_DEL));
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
+	/* before starting the queues save the time of recovery */
+	shost->last_recovery = jiffies;
+
 	wake_up(&shost->host_wait);
 
 	/*


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/7] SYNCHRONIZE_CACHE command used fixed value
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
                   ` (4 preceding siblings ...)
  2008-11-26 18:29 ` [PATCH 5/7] time needs to be adjusted when eh was running Bernd Schubert
@ 2008-11-26 18:31 ` Bernd Schubert
  2008-11-26 18:32 ` [PATCH 0/7] trivial: move a variable from function to if-scope Bernd Schubert
  6 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 18:31 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Switch to adjustable timeout value for scsi SYNCHRONIZE_CACHE commands.
We have been running into problems with fixed values on troublesome devices.

Signed-off-by: Bernd Schubert <bs@q-leap.de>


---
 drivers/scsi/sd.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/drivers/scsi/sd.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sd.c
+++ linux-2.6/drivers/scsi/sd.c
@@ -884,7 +884,8 @@ static int sd_sync_cache(struct scsi_dis
 		 * flush everything.
 		 */
 		res = scsi_execute_req(sdp, cmd, DMA_NONE, NULL, 0, &sshdr,
-				       SD_TIMEOUT, SD_MAX_RETRIES);
+				       sdp->request_queue->rq_timeout,
+				       SD_MAX_RETRIES);
 		if (res == 0)
 			break;
 	}
@@ -903,7 +904,7 @@ static int sd_sync_cache(struct scsi_dis
 static void sd_prepare_flush(struct request_queue *q, struct request *rq)
 {
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->timeout = SD_TIMEOUT;
+	rq->timeout = q->rq_timeout;
 	rq->cmd[0] = SYNCHRONIZE_CACHE;
 	rq->cmd_len = 10;
 }


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/7] trivial: move a variable from function to if-scope
  2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
                   ` (5 preceding siblings ...)
  2008-11-26 18:31 ` [PATCH 6/7] SYNCHRONIZE_CACHE command used fixed value Bernd Schubert
@ 2008-11-26 18:32 ` Bernd Schubert
  6 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-11-26 18:32 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

Move a variable from functions scope to if-scope.

Signed-off-by: Bernd Schubert <bs@q-leap.de>

---
 drivers/scsi/scsi_error.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_error.c
+++ linux-2.6/drivers/scsi/scsi_error.c
@@ -132,7 +132,6 @@ enum blk_eh_timer_return scsi_times_out(
 {
 	struct scsi_cmnd *scmd = req->special;
 	enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *);
-	enum blk_eh_timer_return rtn = eh_timed_out(scmd);
 
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
@@ -152,6 +151,8 @@ enum blk_eh_timer_return scsi_times_out(
 		eh_timed_out = NULL;
 
 	if (eh_timed_out) {
+		enum blk_eh_timer_return rtn = eh_timed_out(scmd);
+
 		switch (rtn) {
 		case BLK_EH_NOT_HANDLED:
 			break;



-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] print eh activation
  2008-11-26 17:44 ` [PATCH 1/7] print eh activation Bernd Schubert
@ 2008-11-26 18:47   ` James Bottomley
  2008-12-03 11:19     ` Bernd Schubert
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2008-11-26 18:47 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-scsi

On Wed, 2008-11-26 at 18:44 +0100, Bernd Schubert wrote:
> Print activation of the scsi error handler to let the user know what was 
> the the error handler was activated. These information are essential to
> diagnose hardware issues.

But it can be turned on already with SCSI logging ... at least the
activation message.  I don't think we want this to be printed all the
time, because the error handler can be activated in non-error situations
for some HBAs (like sense collection for non-ACA emulating drivers).

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-11-26 17:46 ` [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR Bernd Schubert
@ 2008-11-26 18:47   ` James Bottomley
  2008-12-03 12:17     ` Bernd Schubert
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2008-11-26 18:47 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-scsi

On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> Activate the error handler if DID_SOFT_ERROR failed to often, but only 
> for commands which have a scmd->allowed > 1. 
> Also make a function out of a goto-block.

What is the rationale for this?  It really doesn't look right since
DID_SOFT_ERROR is supposed to be for temporary out of resource
conditions in the HBA driver ... activating the error handler isn't
really going to fix this because the eh is taking us through a state
model for device conditions, which DID_SOFT_ERROR shouldn't be.

If you just need a DID_FAIL to activate the eh, it can be added without
changing the meaning of DID_SOFT_ERROR.

Also, you changed the return to make it device blocking (which also
doesn't look right) but didn't document that in the change log.

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] print eh activation
  2008-11-26 18:47   ` James Bottomley
@ 2008-12-03 11:19     ` Bernd Schubert
  2008-12-03 15:16       ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-12-03 11:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Wednesday 26 November 2008 19:47:02 James Bottomley wrote:
> On Wed, 2008-11-26 at 18:44 +0100, Bernd Schubert wrote:
> > Print activation of the scsi error handler to let the user know what was
> > the the error handler was activated. These information are essential to
> > diagnose hardware issues.
>
> But it can be turned on already with SCSI logging ... at least the
> activation message.  I don't think we want this to be printed all the
> time, because the error handler can be activated in non-error situations
> for some HBAs (like sense collection for non-ACA emulating drivers).

Sorry for the late reply, I didn't have access to my mails for a few days. 

Actually I entirely disagree, activating the error handler should be an 
exception and as such exception, it shall print it was activated and also the 
reason why it was activated. Without these information we see quite often in 
our logs something like:

[12165690.357905] mptscsih: ioc1: attempting task abort! (sc=ffff81012a957500)
[12165690.357966] sd 3:0:1:0:
[12165690.358018]         command: cdb[0]=0x28: 28 00 37 10 e9 4f 00 00 08 00
[12165690.732712] mptbase: ioc1: IOCStatus(0x0048): SCSI Task Terminated
[12165690.733699] mptscsih: ioc1: task abort: SUCCESS (sc=ffff81012a957500)

But this gives you no chance to see, where it comes from. After adding the 
additional printks from my patch, we recognized the error handler was 
activated mostly due to command timeouts. So increasing the timeouts to >90s 
already solved 2/3rds of our problems. Please also see patch nr. 6, the 
additional printks did help me to recognize always only one special scsi 
command fails.

In my opinion, if a driver needs the error handler for specific actions, we 
should create another interface for that. Could you please point me to such a 
non-ACA river?
I also only see two calling functions of scsi_eh_scmd_add(), namely 
scsi_times_out() and scsi_softirq_done() and only for these calls the 
additinal printks will be done (since scmd is required to do the printks).

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-11-26 18:47   ` James Bottomley
@ 2008-12-03 12:17     ` Bernd Schubert
  2008-12-03 15:16       ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-12-03 12:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Wednesday 26 November 2008 19:47:47 James Bottomley wrote:
> On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> > Activate the error handler if DID_SOFT_ERROR failed to often, but only
> > for commands which have a scmd->allowed > 1.
> > Also make a function out of a goto-block.
>
> What is the rationale for this?  It really doesn't look right since
> DID_SOFT_ERROR is supposed to be for temporary out of resource
> conditions in the HBA driver ... activating the error handler isn't
> really going to fix this because the eh is taking us through a state
> model for device conditions, which DID_SOFT_ERROR shouldn't be.

What do you suggest instead of? Just returning an I/O error without even to 
try to recover the device isn't nice.

>
> If you just need a DID_FAIL to activate the eh, it can be added without
> changing the meaning of DID_SOFT_ERROR.
>
> Also, you changed the return to make it device blocking (which also
> doesn't look right) but didn't document that in the change log.

Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE

http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.html

When I wrote the patch documentation, I already forgot about it, sorry. 
Unfortunately, it didn't help much for our devices. So I made it to activate 
the eh only, if it fails too often. With activated eh, devices sometimes can 
be recovered. But I'm certainly grateful for any hints to further improve 
recovery and to prevent i/o errors. 

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] print eh activation
  2008-12-03 11:19     ` Bernd Schubert
@ 2008-12-03 15:16       ` James Bottomley
  2008-12-03 15:52         ` Bernd Schubert
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2008-12-03 15:16 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-scsi

On Wed, 2008-12-03 at 12:19 +0100, Bernd Schubert wrote:
> On Wednesday 26 November 2008 19:47:02 James Bottomley wrote:
> > On Wed, 2008-11-26 at 18:44 +0100, Bernd Schubert wrote:
> > > Print activation of the scsi error handler to let the user know what was
> > > the the error handler was activated. These information are essential to
> > > diagnose hardware issues.
> >
> > But it can be turned on already with SCSI logging ... at least the
> > activation message.  I don't think we want this to be printed all the
> > time, because the error handler can be activated in non-error situations
> > for some HBAs (like sense collection for non-ACA emulating drivers).
> 
> Sorry for the late reply, I didn't have access to my mails for a few days. 
> 
> Actually I entirely disagree, activating the error handler should be an 
> exception and as such exception, it shall print it was activated and also the 
> reason why it was activated. Without these information we see quite often in 
> our logs something like:
> 
> [12165690.357905] mptscsih: ioc1: attempting task abort! (sc=ffff81012a957500)
> [12165690.357966] sd 3:0:1:0:
> [12165690.358018]         command: cdb[0]=0x28: 28 00 37 10 e9 4f 00 00 08 00
> [12165690.732712] mptbase: ioc1: IOCStatus(0x0048): SCSI Task Terminated
> [12165690.733699] mptscsih: ioc1: task abort: SUCCESS (sc=ffff81012a957500)
> 
> But this gives you no chance to see, where it comes from. After adding the 
> additional printks from my patch, we recognized the error handler was 
> activated mostly due to command timeouts. So increasing the timeouts to >90s 
> already solved 2/3rds of our problems. Please also see patch nr. 6, the 
> additional printks did help me to recognize always only one special scsi 
> command fails.

But surely what you're arguing for then, is a printk on command timeout?

> In my opinion, if a driver needs the error handler for specific actions, we 
> should create another interface for that. Could you please point me to such a 
> non-ACA river?
> I also only see two calling functions of scsi_eh_scmd_add(), namely 
> scsi_times_out() and scsi_softirq_done() and only for these calls the 
> additinal printks will be done (since scmd is required to do the printks).

Mostly we converted the in-use drivers, but things like the parallel
port drivers still use this mechanism.

James



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-12-03 12:17     ` Bernd Schubert
@ 2008-12-03 15:16       ` James Bottomley
  2008-12-03 16:00         ` Bernd Schubert
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2008-12-03 15:16 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-scsi

On Wed, 2008-12-03 at 13:17 +0100, Bernd Schubert wrote:
> On Wednesday 26 November 2008 19:47:47 James Bottomley wrote:
> > On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> > > Activate the error handler if DID_SOFT_ERROR failed to often, but only
> > > for commands which have a scmd->allowed > 1.
> > > Also make a function out of a goto-block.
> >
> > What is the rationale for this?  It really doesn't look right since
> > DID_SOFT_ERROR is supposed to be for temporary out of resource
> > conditions in the HBA driver ... activating the error handler isn't
> > really going to fix this because the eh is taking us through a state
> > model for device conditions, which DID_SOFT_ERROR shouldn't be.
> 
> What do you suggest instead of? Just returning an I/O error without even to 
> try to recover the device isn't nice.

it doesn't do that ... it retries up to the retry limit before failing
the command.  There is an argument that we should treat this as other
temporary resource conditions like BUSY and QUEUE_FULL, so return
ADD_TO_MLQUEUE.  On the other hand, DID_REQUEUE already does that, so
this would lose the only unconditional DID_ code going generically
through the retry path.

> >
> > If you just need a DID_FAIL to activate the eh, it can be added without
> > changing the meaning of DID_SOFT_ERROR.
> >
> > Also, you changed the return to make it device blocking (which also
> > doesn't look right) but didn't document that in the change log.
> 
> Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE
> 
> http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.html
> 
> When I wrote the patch documentation, I already forgot about it, sorry. 
> Unfortunately, it didn't help much for our devices. So I made it to activate 
> the eh only, if it fails too often. With activated eh, devices sometimes can 
> be recovered. But I'm certainly grateful for any hints to further improve 
> recovery and to prevent i/o errors. 

Well, what exactly is the problem?  changing to ADD_TO_MLQUEUE will
retry intermittently up to the command timeout.  If activating the error
handler actually fixes the problem, then the driver was probably wrongly
returning DID_SOFT_ERROR.

James



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] print eh activation
  2008-12-03 15:16       ` James Bottomley
@ 2008-12-03 15:52         ` Bernd Schubert
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-12-03 15:52 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Wednesday 03 December 2008 16:16:30 James Bottomley wrote:
> On Wed, 2008-12-03 at 12:19 +0100, Bernd Schubert wrote:
> > On Wednesday 26 November 2008 19:47:02 James Bottomley wrote:
> > > On Wed, 2008-11-26 at 18:44 +0100, Bernd Schubert wrote:
> > > > Print activation of the scsi error handler to let the user know what
> > > > was the the error handler was activated. These information are
> > > > essential to diagnose hardware issues.
> > >
> > > But it can be turned on already with SCSI logging ... at least the
> > > activation message.  I don't think we want this to be printed all the
> > > time, because the error handler can be activated in non-error
> > > situations for some HBAs (like sense collection for non-ACA emulating
> > > drivers).
> >
> > Sorry for the late reply, I didn't have access to my mails for a few
> > days.
> >
> > Actually I entirely disagree, activating the error handler should be an
> > exception and as such exception, it shall print it was activated and also
> > the reason why it was activated. Without these information we see quite
> > often in our logs something like:
> >
> > [12165690.357905] mptscsih: ioc1: attempting task abort!
> > (sc=ffff81012a957500) [12165690.357966] sd 3:0:1:0:
> > [12165690.358018]         command: cdb[0]=0x28: 28 00 37 10 e9 4f 00 00
> > 08 00 [12165690.732712] mptbase: ioc1: IOCStatus(0x0048): SCSI Task
> > Terminated [12165690.733699] mptscsih: ioc1: task abort: SUCCESS
> > (sc=ffff81012a957500)
> >
> > But this gives you no chance to see, where it comes from. After adding
> > the additional printks from my patch, we recognized the error handler was
> > activated mostly due to command timeouts. So increasing the timeouts to
> > >90s already solved 2/3rds of our problems. Please also see patch nr. 6,
> > the additional printks did help me to recognize always only one special
> > scsi command fails.
>
> But surely what you're arguing for then, is a printk on command timeout?

I'm arguing that calling the error handler is a rare exception and that the 
admin wants to know what has caused this exception. This is also nothing
you want to enable with scsi logging, since it mostly errors happen after 
weeks when the system is already in production and nobody then has error
logging active.

Another example for the timeouts patch:

sd 6:0:2:2: [sdk] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK
sd 6:0:2:2: [sdk] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
sd 6:0:2:2: Activating scsi error recovery (1)
sd 6:0:2:2: trying to abort command
qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20


So without the printk patch you would see many messages like these:

qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20
qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20
qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20
qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20
qla2xxx 0000:07:02.0: scsi(6:2:2): Abort command issued -- 1 36e2df2 20

But you wouldn't have an idea why and which command was aborted. Eventually
this will cause a severe failure of qla2xxx driver, but you never would
figure out the underlying reason. Actually, I wouldn't mind to suppress these
driver messages, but the eh activation printks are essential to understand
what is going.

>
> > In my opinion, if a driver needs the error handler for specific actions,
> > we should create another interface for that. Could you please point me to
> > such a non-ACA river?
> > I also only see two calling functions of scsi_eh_scmd_add(), namely
> > scsi_times_out() and scsi_softirq_done() and only for these calls the
> > additinal printks will be done (since scmd is required to do the
> > printks).
>
> Mostly we converted the in-use drivers, but things like the parallel
> port drivers still use this mechanism.

Thanks, going to check these now.


Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-12-03 15:16       ` James Bottomley
@ 2008-12-03 16:00         ` Bernd Schubert
  2008-12-03 16:29           ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2008-12-03 16:00 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Wednesday 03 December 2008 16:16:31 James Bottomley wrote:
> On Wed, 2008-12-03 at 13:17 +0100, Bernd Schubert wrote:
> > On Wednesday 26 November 2008 19:47:47 James Bottomley wrote:
> > > On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> > > > Activate the error handler if DID_SOFT_ERROR failed to often, but
> > > > only for commands which have a scmd->allowed > 1.
> > > > Also make a function out of a goto-block.
> > >
> > > What is the rationale for this?  It really doesn't look right since
> > > DID_SOFT_ERROR is supposed to be for temporary out of resource
> > > conditions in the HBA driver ... activating the error handler isn't
> > > really going to fix this because the eh is taking us through a state
> > > model for device conditions, which DID_SOFT_ERROR shouldn't be.
> >
> > What do you suggest instead of? Just returning an I/O error without even
> > to try to recover the device isn't nice.
>
> it doesn't do that ... it retries up to the retry limit before failing
> the command.  There is an argument that we should treat this as other
> temporary resource conditions like BUSY and QUEUE_FULL, so return
> ADD_TO_MLQUEUE.  On the other hand, DID_REQUEUE already does that, so
> this would lose the only unconditional DID_ code going generically
> through the retry path.
>
> > > If you just need a DID_FAIL to activate the eh, it can be added without
> > > changing the meaning of DID_SOFT_ERROR.
> > >
> > > Also, you changed the return to make it device blocking (which also
> > > doesn't look right) but didn't document that in the change log.
> >
> > Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE
> >
> > http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.html
> >
> > When I wrote the patch documentation, I already forgot about it, sorry.
> > Unfortunately, it didn't help much for our devices. So I made it to
> > activate the eh only, if it fails too often. With activated eh, devices
> > sometimes can be recovered. But I'm certainly grateful for any hints to
> > further improve recovery and to prevent i/o errors.
>
> Well, what exactly is the problem?  changing to ADD_TO_MLQUEUE will
> retry intermittently up to the command timeout.  If activating the error
> handler actually fixes the problem, then the driver was probably wrongly
> returning DID_SOFT_ERROR.

Well, I have certainly no experience with hardware/driver programming but all 
drivers I have looked at, seem to use DID_SOFT_ERROR as something like 
DID_UNKNOWN_ERROR. 

I certainly do not insist on using ADD_TO_MLQUEUE instead of NEEDS_RETRY and I 
will happlily modify the patch, if you think NEEDS_RETRY is better. But I 
would really prefer to try to recover the device when DID_SOFT_ERROR came up. 
I mean without the eh we get an I/O error anyway. So as last attempt to try 
to do some device resets won't hurt, will it?
And I have really seen some successful mpt fusion device recoveries.


Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-12-03 16:00         ` Bernd Schubert
@ 2008-12-03 16:29           ` James Bottomley
  2008-12-03 17:06             ` Bernd Schubert
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2008-12-03 16:29 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-scsi

On Wed, 2008-12-03 at 17:00 +0100, Bernd Schubert wrote:
> On Wednesday 03 December 2008 16:16:31 James Bottomley wrote:
> > On Wed, 2008-12-03 at 13:17 +0100, Bernd Schubert wrote:
> > > On Wednesday 26 November 2008 19:47:47 James Bottomley wrote:
> > > > On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> > > > > Activate the error handler if DID_SOFT_ERROR failed to often, but
> > > > > only for commands which have a scmd->allowed > 1.
> > > > > Also make a function out of a goto-block.
> > > >
> > > > What is the rationale for this?  It really doesn't look right since
> > > > DID_SOFT_ERROR is supposed to be for temporary out of resource
> > > > conditions in the HBA driver ... activating the error handler isn't
> > > > really going to fix this because the eh is taking us through a state
> > > > model for device conditions, which DID_SOFT_ERROR shouldn't be.
> > >
> > > What do you suggest instead of? Just returning an I/O error without even
> > > to try to recover the device isn't nice.
> >
> > it doesn't do that ... it retries up to the retry limit before failing
> > the command.  There is an argument that we should treat this as other
> > temporary resource conditions like BUSY and QUEUE_FULL, so return
> > ADD_TO_MLQUEUE.  On the other hand, DID_REQUEUE already does that, so
> > this would lose the only unconditional DID_ code going generically
> > through the retry path.
> >
> > > > If you just need a DID_FAIL to activate the eh, it can be added without
> > > > changing the meaning of DID_SOFT_ERROR.
> > > >
> > > > Also, you changed the return to make it device blocking (which also
> > > > doesn't look right) but didn't document that in the change log.
> > >
> > > Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE
> > >
> > > http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.html
> > >
> > > When I wrote the patch documentation, I already forgot about it, sorry.
> > > Unfortunately, it didn't help much for our devices. So I made it to
> > > activate the eh only, if it fails too often. With activated eh, devices
> > > sometimes can be recovered. But I'm certainly grateful for any hints to
> > > further improve recovery and to prevent i/o errors.
> >
> > Well, what exactly is the problem?  changing to ADD_TO_MLQUEUE will
> > retry intermittently up to the command timeout.  If activating the error
> > handler actually fixes the problem, then the driver was probably wrongly
> > returning DID_SOFT_ERROR.
> 
> Well, I have certainly no experience with hardware/driver programming but all 
> drivers I have looked at, seem to use DID_SOFT_ERROR as something like 
> DID_UNKNOWN_ERROR. 

DID_SOFT_ERROR means specifically that the driver ran into a resource or
other soft (as in retryable) error.

> I certainly do not insist on using ADD_TO_MLQUEUE instead of NEEDS_RETRY and I 
> will happlily modify the patch, if you think NEEDS_RETRY is better. But I 
> would really prefer to try to recover the device when DID_SOFT_ERROR came up. 
> I mean without the eh we get an I/O error anyway. So as last attempt to try 
> to do some device resets won't hurt, will it?

Yes, it will on a SAN.  For an error condition internal to the driver
what is the point of causing external disruptions to the devices?

> And I have really seen some successful mpt fusion device recoveries.

Well, my guess there would be the internal sequencer is hosed and the
host reset corrected it, is that right?  In which case, there might be a
place where mpt fusion is returning DID_SOFT_ERROR incorrectly.

James



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR
  2008-12-03 16:29           ` James Bottomley
@ 2008-12-03 17:06             ` Bernd Schubert
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2008-12-03 17:06 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Wednesday 03 December 2008 17:29:22 James Bottomley wrote:
> On Wed, 2008-12-03 at 17:00 +0100, Bernd Schubert wrote:
> > On Wednesday 03 December 2008 16:16:31 James Bottomley wrote:
> > > On Wed, 2008-12-03 at 13:17 +0100, Bernd Schubert wrote:
> > > > On Wednesday 26 November 2008 19:47:47 James Bottomley wrote:
> > > > > On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote:
> > > > > > Activate the error handler if DID_SOFT_ERROR failed to often, but
> > > > > > only for commands which have a scmd->allowed > 1.
> > > > > > Also make a function out of a goto-block.
> > > > >
> > > > > What is the rationale for this?  It really doesn't look right since
> > > > > DID_SOFT_ERROR is supposed to be for temporary out of resource
> > > > > conditions in the HBA driver ... activating the error handler isn't
> > > > > really going to fix this because the eh is taking us through a
> > > > > state model for device conditions, which DID_SOFT_ERROR shouldn't
> > > > > be.
> > > >
> > > > What do you suggest instead of? Just returning an I/O error without
> > > > even to try to recover the device isn't nice.
> > >
> > > it doesn't do that ... it retries up to the retry limit before failing
> > > the command.  There is an argument that we should treat this as other
> > > temporary resource conditions like BUSY and QUEUE_FULL, so return
> > > ADD_TO_MLQUEUE.  On the other hand, DID_REQUEUE already does that, so
> > > this would lose the only unconditional DID_ code going generically
> > > through the retry path.
> > >
> > > > > If you just need a DID_FAIL to activate the eh, it can be added
> > > > > without changing the meaning of DID_SOFT_ERROR.
> > > > >
> > > > > Also, you changed the return to make it device blocking (which also
> > > > > doesn't look right) but didn't document that in the change log.
> > > >
> > > > Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE
> > > >
> > > > http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.htm
> > > >l
> > > >
> > > > When I wrote the patch documentation, I already forgot about it,
> > > > sorry. Unfortunately, it didn't help much for our devices. So I made
> > > > it to activate the eh only, if it fails too often. With activated eh,
> > > > devices sometimes can be recovered. But I'm certainly grateful for
> > > > any hints to further improve recovery and to prevent i/o errors.
> > >
> > > Well, what exactly is the problem?  changing to ADD_TO_MLQUEUE will
> > > retry intermittently up to the command timeout.  If activating the
> > > error handler actually fixes the problem, then the driver was probably
> > > wrongly returning DID_SOFT_ERROR.
> >
> > Well, I have certainly no experience with hardware/driver programming but
> > all drivers I have looked at, seem to use DID_SOFT_ERROR as something
> > like DID_UNKNOWN_ERROR.
>
> DID_SOFT_ERROR means specifically that the driver ran into a resource or
> other soft (as in retryable) error.
>
> > I certainly do not insist on using ADD_TO_MLQUEUE instead of NEEDS_RETRY
> > and I will happlily modify the patch, if you think NEEDS_RETRY is better.
> > But I would really prefer to try to recover the device when
> > DID_SOFT_ERROR came up. I mean without the eh we get an I/O error anyway.
> > So as last attempt to try to do some device resets won't hurt, will it?
>
> Yes, it will on a SAN.  For an error condition internal to the driver
> what is the point of causing external disruptions to the devices?

Hrm, right. For SAN it is really a problem :(

>
> > And I have really seen some successful mpt fusion device recoveries.
>
> Well, my guess there would be the internal sequencer is hosed and the
> host reset corrected it, is that right?  In which case, there might be a
> place where mpt fusion is returning DID_SOFT_ERROR incorrectly.

Well, these DID_SOFT_ERRORs are not easily reproducibly. If they would, I 
already would have debugged it more in detail.

Any idea what we can do?

Thanks,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] time needs to be adjusted when eh was running
  2008-11-26 18:29 ` [PATCH 5/7] time needs to be adjusted when eh was running Bernd Schubert
@ 2009-01-07 18:09   ` Bernd Schubert
  0 siblings, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2009-01-07 18:09 UTC (permalink / raw)
  To: linux-scsi; +Cc: James Bottomley

James,

you rejected the patches 1/7 and 2/7, but what about the other patches?

Especially 

[PATCH 5/7] time needs to be adjusted when eh was running
[PATCH 6/7] SYNCHRONIZE_CACHE command used fixed value

are important for us.


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-01-07 18:09 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-26 17:40 [PATCH 0/7] scsi error handler improvements Bernd Schubert
2008-11-26 17:44 ` [PATCH 1/7] print eh activation Bernd Schubert
2008-11-26 18:47   ` James Bottomley
2008-12-03 11:19     ` Bernd Schubert
2008-12-03 15:16       ` James Bottomley
2008-12-03 15:52         ` Bernd Schubert
2008-11-26 17:46 ` [PATCH 2/7] Allow requeuement on DID_SOFT_ERROR Bernd Schubert
2008-11-26 18:47   ` James Bottomley
2008-12-03 12:17     ` Bernd Schubert
2008-12-03 15:16       ` James Bottomley
2008-12-03 16:00         ` Bernd Schubert
2008-12-03 16:29           ` James Bottomley
2008-12-03 17:06             ` Bernd Schubert
2008-11-26 18:25 ` [PATCH 03/07] Don't online offlined devices in scsi_target_quiesce() Bernd Schubert
2008-11-26 18:26 ` [PATCH 4/7] allow activation of eh on DID_NO_CONNECT Bernd Schubert
2008-11-26 18:29 ` [PATCH 5/7] time needs to be adjusted when eh was running Bernd Schubert
2009-01-07 18:09   ` Bernd Schubert
2008-11-26 18:31 ` [PATCH 6/7] SYNCHRONIZE_CACHE command used fixed value Bernd Schubert
2008-11-26 18:32 ` [PATCH 0/7] trivial: move a variable from function to if-scope Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).