[RFC] Prevent infinite retries due to DID

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Prevent infinite retries due to DID_RESET return status
@ 2006-12-11 21:42 Michael Reed
  2007-01-02 12:15 ` Christoph Hellwig
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Reed @ 2006-12-11 21:42 UTC (permalink / raw)
  To: linux-scsi; +Cc: Jeremy Higdon

[-- Attachment #1: Type: text/plain, Size: 1147 bytes --]

Due to a firmware mismatch between a host and target (names withheld to
protect the innocent?), the LLDD was returning DID_RESET for every
i/o command.  This patch modifies the scsi layer to take into account
when the command which received DID_RESET was issued and eventually
give up on it instead of unconditionally reissuing it forever
when it receives a DID_RESET.  With this patch, on my test system,
the command receiving the constant DID_RESET times out after about
360 seconds.

The premise for this patch is that no command should have an infinite
lifetime.  The impetus for this patch was a system which would not
reach a command prompt without disconnecting the storage from the
host.

The significant change in this patch is to call scsi_retry_command()
instead of scsi_requeue_command() if the command which receives a
DID_RESET did not complete any i/o (good_bytes==0).  scsi_retry_command()
does not release the command and regenerate it like scsi_requeue_command()
does, hence jiffies_at_alloc reflects when the command was first issued.

This patch is based upon 2.6.19.  Thanks for taking the time to
look at this.

Mike









[-- Attachment #2: did_reset.patch --]
[-- Type: text/x-patch, Size: 6879 bytes --]

--- kdbu/drivers/scsi/scsi_priv.h	2006-10-09 01:58:19.000000000 -0500
+++ kdb/drivers/scsi/scsi_priv.h	2006-12-07 14:15:19.925332776 -0600
@@ -28,7 +28,7 @@ extern int scsi_dispatch_cmd(struct scsi
 extern int scsi_setup_command_freelist(struct Scsi_Host *shost);
 extern void scsi_destroy_command_freelist(struct Scsi_Host *shost);
 extern void __scsi_done(struct scsi_cmnd *cmd);
-extern int scsi_retry_command(struct scsi_cmnd *cmd);
+extern int scsi_retry_command(struct scsi_cmnd *cmd, int reason);
 #ifdef CONFIG_SCSI_LOGGING
 void scsi_log_send(struct scsi_cmnd *cmd);
 void scsi_log_completion(struct scsi_cmnd *cmd, int disposition);
--- kdbu/include/scsi/scsi.h	2006-10-31 21:08:47.000000000 -0600
+++ kdb/include/scsi/scsi.h	2006-12-07 14:13:09.188052974 -0600
@@ -353,6 +353,7 @@ struct scsi_lun {
 #define SCSI_MLQUEUE_HOST_BUSY   0x1055
 #define SCSI_MLQUEUE_DEVICE_BUSY 0x1056
 #define SCSI_MLQUEUE_EH_RETRY    0x1057
+#define SCSI_MLQUEUE_DID_RESET   0x1058
 
 /*
  *  Use these to separate status msg and our bytes
--- kdbu/drivers/scsi/scsi.c	2006-10-09 01:58:19.000000000 -0500
+++ kdb/drivers/scsi/scsi.c	2006-12-07 14:15:49.835794930 -0600
@@ -673,7 +673,7 @@ void __scsi_done(struct scsi_cmnd *cmd)
  *              level drivers should not become re-entrant as a result of
  *              this.
  */
-int scsi_retry_command(struct scsi_cmnd *cmd)
+int scsi_retry_command(struct scsi_cmnd *cmd, int reason)
 {
         /*
          * Zero the sense information from the last time we tried
@@ -681,7 +681,7 @@ int scsi_retry_command(struct scsi_cmnd 
          */
 	memset(cmd->sense_buffer, 0, sizeof(cmd->sense_buffer));
 
-	return scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY);
+	return scsi_queue_insert(cmd, reason);
 }
 
 /*
--- kdbu/drivers/scsi/scsi_lib.c	2006-11-29 21:09:07.000000000 -0600
+++ kdb/drivers/scsi/scsi_lib.c	2006-12-11 14:22:52.756579311 -0600
@@ -65,6 +65,7 @@ static struct scsi_host_sg_pool scsi_sg_
 #undef SP
 
 static void scsi_run_queue(struct request_queue *q);
+static void scsi_release_buffers(struct scsi_cmnd *cmd);
 
 /*
  * Function:	scsi_unprep_request()
@@ -100,10 +101,10 @@ static void scsi_unprep_request(struct r
  *
  * Returns:     Nothing.
  *
- * Notes:       We do this for one of two cases.  Either the host is busy
+ * Notes:       We do this for one of three cases.  1) the host is busy
  *              and it cannot accept any more commands for the time being,
- *              or the device returned QUEUE_FULL and can accept no more
- *              commands.
+ *              2) the device returned QUEUE_FULL and can accept no more
+ *              commands, or 3) the LLDD returned DID_RESET.
  * Notes:       This could be called either from an interrupt context or a
  *              normal process context.
  */
@@ -137,9 +138,11 @@ int scsi_queue_insert(struct scsi_cmnd *
 
 	/*
 	 * Decrement the counters, since these commands are no longer
-	 * active on the host/device.
+	 * active on the host/device.  If the reason is SCSI_MLQUEUE_DID_RESET
+	 * then scsi_device_unbusy() was previously called.
 	 */
-	scsi_device_unbusy(device);
+	if (reason != SCSI_MLQUEUE_DID_RESET)
+		scsi_device_unbusy(device);
 
 	/*
 	 * Requeue this command.  It will go before all other commands
@@ -601,6 +604,7 @@ static void scsi_requeue_command(struct 
 	struct request *req = cmd->request;
 	unsigned long flags;
 
+	scsi_release_buffers(cmd);
 	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, req);
@@ -646,6 +650,7 @@ void scsi_run_host_queues(struct Scsi_Ho
  * Lock status: Assumed that lock is not held upon entry.
  *
  * Returns:     cmd if requeue required, NULL otherwise.
+ * 		If cmd is returned then its buffers have not been released.
  *
  * Notes:       This is called for block device requests in order to
  *              mark some number of sectors as complete.
@@ -688,6 +693,7 @@ static struct scsi_cmnd *scsi_end_reques
 		}
 	}
 
+	scsi_release_buffers(cmd);
 	add_disk_randomness(req->rq_disk);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -786,6 +792,33 @@ static void scsi_release_buffers(struct 
 }
 
 /*
+ * Function:    scsi_command_expired()
+ *
+ * Purpose:     Check scsi a command's age before retrying it.
+ *
+ * Arguments:   cmd	- command that we are checking for timeout.
+ *
+ * Returns:     non-zero if command has exceeded its lifetime
+ *              zero otherwise
+ *
+ * Notes:       A commands lifetime is considered to be the number
+ *              of (retries permitted plus one) * command timeout.
+ *
+ */
+static int scsi_command_expired(struct scsi_cmnd *cmd)
+{
+	int ret = 0;
+	unsigned long wait_for = (cmd->allowed + 1) * cmd->timeout_per_command;
+	if (time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
+		sdev_printk(KERN_ERR, cmd->device,
+			    "timing out command, waited %lus\n",
+			    wait_for/HZ);
+		ret = 1;
+	}
+	return ret;
+}
+
+/*
  * Function:    scsi_io_completion()
  *
  * Purpose:     Completion processing for block device I/O requests.
@@ -824,8 +857,6 @@ void scsi_io_completion(struct scsi_cmnd
 	int sense_valid = 0;
 	int sense_deferred = 0;
 
-	scsi_release_buffers(cmd);
-
 	if (result) {
 		sense_valid = scsi_command_normalize_sense(cmd, &sshdr);
 		if (sense_valid)
@@ -961,9 +992,20 @@ void scsi_io_completion(struct scsi_cmnd
 		/* Third party bus reset or reset for error recovery
 		 * reasons.  Just retry the request and see what
 		 * happens.
+		 * If no data was transferred, just reissue this
+		 * command.  If data was transferred, regenerate
+		 * the command to transfer only untransferred data.
 		 */
-		scsi_requeue_command(q, cmd);
-		return;
+		if (!good_bytes) {
+			if (!(scsi_command_expired(cmd))) {
+				scsi_retry_command(cmd, SCSI_MLQUEUE_DID_RESET);
+				return;
+			}
+		}
+		else {
+			scsi_requeue_command(q, cmd);
+			return;
+		}
 	}
 	if (result) {
 		if (!(req->cmd_flags & REQ_QUIET)) {
@@ -1358,17 +1400,12 @@ static void scsi_kill_request(struct req
 static void scsi_softirq_done(struct request *rq)
 {
 	struct scsi_cmnd *cmd = rq->completion_data;
-	unsigned long wait_for = (cmd->allowed + 1) * cmd->timeout_per_command;
 	int disposition;
 
 	INIT_LIST_HEAD(&cmd->eh_entry);
 
 	disposition = scsi_decide_disposition(cmd);
-	if (disposition != SUCCESS &&
-	    time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
-		sdev_printk(KERN_ERR, cmd->device,
-			    "timing out command, waited %lus\n",
-			    wait_for/HZ);
+	if (disposition != SUCCESS && scsi_command_expired(cmd)) {
 		disposition = SUCCESS;
 	}
 			
@@ -1379,7 +1416,7 @@ static void scsi_softirq_done(struct req
 			scsi_finish_command(cmd);
 			break;
 		case NEEDS_RETRY:
-			scsi_retry_command(cmd);
+			scsi_retry_command(cmd, SCSI_MLQUEUE_EH_RETRY);
 			break;
 		case ADD_TO_MLQUEUE:
 			scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Prevent infinite retries due to DID_RESET return status
  2006-12-11 21:42 [RFC] Prevent infinite retries due to DID_RESET return status Michael Reed
@ 2007-01-02 12:15 ` Christoph Hellwig
  2007-01-31 18:54   ` [PATCH 0/2] " Michael Reed
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2007-01-02 12:15 UTC (permalink / raw)
  To: Michael Reed; +Cc: linux-scsi, Jeremy Higdon

On Mon, Dec 11, 2006 at 03:42:34PM -0600, Michael Reed wrote:
> Due to a firmware mismatch between a host and target (names withheld to
> protect the innocent?), the LLDD was returning DID_RESET for every
> i/o command.  This patch modifies the scsi layer to take into account
> when the command which received DID_RESET was issued and eventually
> give up on it instead of unconditionally reissuing it forever
> when it receives a DID_RESET.  With this patch, on my test system,
> the command receiving the constant DID_RESET times out after about
> 360 seconds.
> 
> The premise for this patch is that no command should have an infinite
> lifetime.  The impetus for this patch was a system which would not
> reach a command prompt without disconnecting the storage from the
> host.
> 
> The significant change in this patch is to call scsi_retry_command()
> instead of scsi_requeue_command() if the command which receives a
> DID_RESET did not complete any i/o (good_bytes==0).  scsi_retry_command()
> does not release the command and regenerate it like scsi_requeue_command()
> does, hence jiffies_at_alloc reflects when the command was first issued.

Generally this patch looks good to me.  Some comments:


> -extern int scsi_retry_command(struct scsi_cmnd *cmd);
> +extern int scsi_retry_command(struct scsi_cmnd *cmd, int reason);

	I've just sent a patch to kill scsi_retry_command.  Use
	scsi_queue_insert directly instead.

> +	scsi_release_buffers(cmd);

	Can you please separate out the moving of the scsi_release_buffer
	calls into a separate patch so it can be audited better?

> @@ -961,9 +992,20 @@ void scsi_io_completion(struct scsi_cmnd
>  		/* Third party bus reset or reset for error recovery
>  		 * reasons.  Just retry the request and see what
>  		 * happens.
> +		 * If no data was transferred, just reissue this
> +		 * command.  If data was transferred, regenerate
> +		 * the command to transfer only untransferred data.
>  		 */

The whole comment should look more like:

		/*
		 * Third party bus reset or reset for error recovery reasons.
		 * If no data was transferred, just reissue this command.
		 * If data was transferred, regenerate the command to transfer
		 * only untransferred data.
		 */

> -		scsi_requeue_command(q, cmd);
> -		return;
> +		if (!good_bytes) {
> +			if (!(scsi_command_expired(cmd))) {
> +				scsi_retry_command(cmd, SCSI_MLQUEUE_DID_RESET);
> +				return;
> +			}
> +		}
> +		else {
> +			scsi_requeue_command(q, cmd);
> +			return;
> +		}

	With this code we now fallthrough if we don't have any good bytes
	and the command has expired.  Is this the expected behaviour? If
	yes we need a good comment describing it.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 0/2] Prevent infinite retries due to DID_RESET return status
  2007-01-02 12:15 ` Christoph Hellwig
@ 2007-01-31 18:54   ` Michael Reed
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Reed @ 2007-01-31 18:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-scsi, Jeremy Higdon

I have implemented Christoph's suggestions and comments in his reply
to my RFC.
--------------------------------------------------------------------

Due to a firmware mismatch between a host and target (names withheld to
protect the innocent?), the LLDD was returning DID_RESET for every
i/o command.  This patch modifies the scsi layer to take into account
when the command which received DID_RESET was issued and eventually
give up on it instead of unconditionally reissuing it forever.
With this patch, on my test system, the command receiving the solid
DID_RESET times out after about 360 seconds.

The premise for this patch is that no command should have an infinite
lifetime.  The impetus for this patch was a system which would not
reach a command prompt without disconnecting the storage from the
host.

The significant change in this patch is to call scsi_queue_insert()
instead of scsi_requeue_command() if the command which receives a
DID_RESET did not complete any i/o (good_bytes==0).  scsi_queue_insert()
does not release the command and regenerate it like scsi_requeue_command()
does, hence jiffies_at_alloc reflects when the command was first issued.

Per Christoph's suggestion, I have broken this into two patches.  The
first implements the moving of the scsi_release_buffer() calls so that
it can be more easily reviewed.  The second patch implements the
lifetime timer for commands receiving DID_RESET.

These patches differ from the first posting in that scsi_queue_insert()
is called directly instead of calling the since removed scsi_retry_command();
comments have been cleaned up; and a comment has been added to indicate
that the code is supposed to fall through to call scsi_end_request()
if the command which received DID_RESET has expired.

These patches were tested by modifying a LLDD to return DID_RESET for
every command received.

Patches apply to 2.6.20-rc6-git1.

Signed-off-by: Michael Reed <mdr@sgi.com>

Christoph Hellwig wrote:
> On Mon, Dec 11, 2006 at 03:42:34PM -0600, Michael Reed wrote:
>> Due to a firmware mismatch between a host and target (names withheld to
>> protect the innocent?), the LLDD was returning DID_RESET for every
>> i/o command.  This patch modifies the scsi layer to take into account
>> when the command which received DID_RESET was issued and eventually
>> give up on it instead of unconditionally reissuing it forever
>> when it receives a DID_RESET.  With this patch, on my test system,
>> the command receiving the constant DID_RESET times out after about
>> 360 seconds.
>>
>> The premise for this patch is that no command should have an infinite
>> lifetime.  The impetus for this patch was a system which would not
>> reach a command prompt without disconnecting the storage from the
>> host.
>>
>> The significant change in this patch is to call scsi_retry_command()
>> instead of scsi_requeue_command() if the command which receives a
>> DID_RESET did not complete any i/o (good_bytes==0).  scsi_retry_command()
>> does not release the command and regenerate it like scsi_requeue_command()
>> does, hence jiffies_at_alloc reflects when the command was first issued.
> 
> Generally this patch looks good to me.  Some comments:
> 
> 
>> -extern int scsi_retry_command(struct scsi_cmnd *cmd);
>> +extern int scsi_retry_command(struct scsi_cmnd *cmd, int reason);
> 
> 	I've just sent a patch to kill scsi_retry_command.  Use
> 	scsi_queue_insert directly instead.
> 
>> +	scsi_release_buffers(cmd);
> 
> 	Can you please separate out the moving of the scsi_release_buffer
> 	calls into a separate patch so it can be audited better?
> 
>> @@ -961,9 +992,20 @@ void scsi_io_completion(struct scsi_cmnd
>>  		/* Third party bus reset or reset for error recovery
>>  		 * reasons.  Just retry the request and see what
>>  		 * happens.
>> +		 * If no data was transferred, just reissue this
>> +		 * command.  If data was transferred, regenerate
>> +		 * the command to transfer only untransferred data.
>>  		 */
> 
> The whole comment should look more like:
> 
> 		/*
> 		 * Third party bus reset or reset for error recovery reasons.
> 		 * If no data was transferred, just reissue this command.
> 		 * If data was transferred, regenerate the command to transfer
> 		 * only untransferred data.
> 		 */
> 
>> -		scsi_requeue_command(q, cmd);
>> -		return;
>> +		if (!good_bytes) {
>> +			if (!(scsi_command_expired(cmd))) {
>> +				scsi_retry_command(cmd, SCSI_MLQUEUE_DID_RESET);
>> +				return;
>> +			}
>> +		}
>> +		else {
>> +			scsi_requeue_command(q, cmd);
>> +			return;
>> +		}
> 
> 	With this code we now fallthrough if we don't have any good bytes
> 	and the command has expired.  Is this the expected behaviour? If
> 	yes we need a good comment describing it.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-01-31 18:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-11 21:42 [RFC] Prevent infinite retries due to DID_RESET return status Michael Reed
2007-01-02 12:15 ` Christoph Hellwig
2007-01-31 18:54   ` [PATCH 0/2] " Michael Reed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox