[PATCH 2/4] 2.4 SCSI error handling fixes

Linux SCSI subsystem development
 help / color / mirror / Atom feed

* [PATCH 2/4] 2.4 SCSI error handling fixes
@ 2002-09-12 18:13 Russell King
  2002-09-12 18:18 ` Doug Ledford
  0 siblings, 1 reply; 4+ messages in thread
From: Russell King @ 2002-09-12 18:13 UTC (permalink / raw)
  To: linux-scsi

I've been chasing a few SCSI problems today, and here's 4 patches to
address the issues I found.  These patches are against 2.4.19.

My main aim here was to solve the bad behaviour with a medium error.
Some problems were found in my HBA driver, but some were found in
the error handling code.

Problems found were:

a) retrying a command with stale or invalid command information.
   (Patches 1 and 2)
b) reporting the wrong command on IO error.
   (Patch 3)
c) drives behaving badly in error condition, reporting success without
   data transfer.
   (Patch 4)

Results of investigation into these issues can be found at in my lkml
messages at:

  http://marc.theaimsgroup.com/?l=linux-kernel&m=103184937214222&w=2

Patch 4 is the one I'm least happy about; we shouldn't unconditionally
set the FUA bit in READ10 commands.  It seems that we need to set this
bit only after receiving an error, and only when we try to read the
remaining blocks which we believe are good.

Patch 2: remove loop from scsi_send_eh_cmd().  The parent function
is responsible for re-initialising the various command fields when
we need to retry the command.

--- orig/drivers/scsi/scsi_error.c	Thu Sep 12 18:01:02 2002
+++ linux/drivers/scsi/scsi_error.c	Thu Sep 12 18:07:40 2002
@@ -385,8 +385,10 @@
  */
 STATIC int scsi_eh_retry_command(Scsi_Cmnd * SCpnt)
 {
-	scsi_setup_cmd_retry(SCpnt);
-	scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
+	do {
+		scsi_setup_cmd_retry(SCpnt);
+		scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
+	} while (SCpnt->eh_state == NEEDS_RETRY);
 
 	/*
 	 * Hey, we are done.  Let's look to see what happened.
@@ -417,12 +419,6 @@
 
 	ASSERT_LOCK(&io_request_lock, 0);
 
-	memcpy((void *) SCpnt->cmnd, (void *) generic_sense,
-	       sizeof(generic_sense));
-
-	if (SCpnt->device->scsi_level <= SCSI_2)
-		SCpnt->cmnd[1] = SCpnt->lun << 5;
-
 	scsi_result = (!SCpnt->host->hostt->unchecked_isa_dma)
 	    ? &scsi_result0[0] : kmalloc(512, GFP_ATOMIC | GFP_DMA);
 
@@ -430,24 +426,40 @@
 		printk("cannot allocate scsi_result in scsi_request_sense.\n");
 		return FAILED;
 	}
-	/*
-	 * Zero the sense buffer.  Some host adapters automatically always request
-	 * sense, so it is not a good idea that SCpnt->request_buffer and
-	 * SCpnt->sense_buffer point to the same address (DB).
-	 * 0 is not a valid sense code. 
-	 */
-	memset((void *) SCpnt->sense_buffer, 0, sizeof(SCpnt->sense_buffer));
-	memset((void *) scsi_result, 0, 256);
 
 	saved_result = SCpnt->result;
-	SCpnt->request_buffer = scsi_result;
-	SCpnt->request_bufflen = 256;
-	SCpnt->use_sg = 0;
-	SCpnt->cmd_len = COMMAND_SIZE(SCpnt->cmnd[0]);
-	SCpnt->sc_data_direction = SCSI_DATA_READ;
-	SCpnt->underflow = 0;
 
-	scsi_send_eh_cmnd(SCpnt, SENSE_TIMEOUT);
+	do {
+		memcpy((void *) SCpnt->cmnd, (void *) generic_sense,
+		       sizeof(generic_sense));
+
+		if (SCpnt->device->scsi_level <= SCSI_2)
+			SCpnt->cmnd[1] = SCpnt->lun << 5;
+
+		/*
+		 * Zero the sense buffer.  Some host adapters automatically
+		 * always request sense, so it is not a good idea that
+		 * SCpnt->request_buffer and SCpnt->sense_buffer point to
+		 * the same address (DB).  0 is not a valid sense code. 
+		 */
+		memset((void *) SCpnt->sense_buffer, 0,
+		       sizeof(SCpnt->sense_buffer));
+		memset((void *) scsi_result, 0, 256);
+
+		SCpnt->request_buffer = scsi_result;
+		SCpnt->request_bufflen = 256;
+		SCpnt->use_sg = 0;
+		SCpnt->cmd_len = COMMAND_SIZE(SCpnt->cmnd[0]);
+		SCpnt->sc_data_direction = SCSI_DATA_READ;
+		SCpnt->underflow = 0;
+
+		scsi_send_eh_cmnd(SCpnt, SENSE_TIMEOUT);
+		/*
+		 * If the SCSI device responded with "logical unit
+		 * is in process of becoming ready", we need to
+		 * retry this command.
+		 */
+	} while (SCpnt->eh_state == NEEDS_RETRY);
 
 	/* Last chance to have valid sense data */
 	if (!scsi_sense_valid(SCpnt))
@@ -489,26 +501,34 @@
 	static unsigned char tur_command[6] =
 	{TEST_UNIT_READY, 0, 0, 0, 0, 0};
 
-	memcpy((void *) SCpnt->cmnd, (void *) tur_command,
-	       sizeof(tur_command));
+	do {
+		memcpy((void *) SCpnt->cmnd, (void *) tur_command,
+		       sizeof(tur_command));
 
-	if (SCpnt->device->scsi_level <= SCSI_2)
-		SCpnt->cmnd[1] = SCpnt->lun << 5;
+		if (SCpnt->device->scsi_level <= SCSI_2)
+			SCpnt->cmnd[1] = SCpnt->lun << 5;
 
-	/*
-	 * Zero the sense buffer.  The SCSI spec mandates that any
-	 * untransferred sense data should be interpreted as being zero.
-	 */
-	memset((void *) SCpnt->sense_buffer, 0, sizeof(SCpnt->sense_buffer));
+		/*
+		 * Zero the sense buffer.  The SCSI spec mandates that any
+		 * untransferred sense data should be interpreted as being zero.
+		 */
+		memset((void *) SCpnt->sense_buffer, 0,
+		       sizeof(SCpnt->sense_buffer));
 
-	SCpnt->request_buffer = NULL;
-	SCpnt->request_bufflen = 0;
-	SCpnt->use_sg = 0;
-	SCpnt->cmd_len = COMMAND_SIZE(SCpnt->cmnd[0]);
-	SCpnt->underflow = 0;
-	SCpnt->sc_data_direction = SCSI_DATA_NONE;
+		SCpnt->request_buffer = NULL;
+		SCpnt->request_bufflen = 0;
+		SCpnt->use_sg = 0;
+		SCpnt->cmd_len = COMMAND_SIZE(SCpnt->cmnd[0]);
+		SCpnt->underflow = 0;
+		SCpnt->sc_data_direction = SCSI_DATA_NONE;
 
-	scsi_send_eh_cmnd(SCpnt, SENSE_TIMEOUT);
+		scsi_send_eh_cmnd(SCpnt, SENSE_TIMEOUT);
+		/*
+		 * If the SCSI device responded with "logical unit
+		 * is in process of becoming ready", we need to
+		 * retry this command.
+		 */
+	} while (SCpnt->eh_state == NEEDS_RETRY);
 
 	/*
 	 * When we eventually call scsi_finish, we really wish to complete
@@ -581,7 +601,6 @@
 
 	host = SCpnt->host;
 
-      retry:
 	/*
 	 * We will use a queued command if possible, otherwise we will emulate the
 	 * queuing and calling of completion function ourselves.
@@ -664,14 +683,13 @@
 		SCSI_LOG_ERROR_RECOVERY(3,
 			printk("scsi_send_eh_cmnd: scsi_eh_completed_normally %x\n", ret));
 		switch (ret) {
-		case SUCCESS:
-			SCpnt->eh_state = SUCCESS;
-			break;
-		case NEEDS_RETRY:
-			goto retry;
-		case FAILED:
 		default:
-			SCpnt->eh_state = FAILED;
+			ret = FAILED;
+			/*FALLTHROUGH*/
+		case FAILED:
+		case NEEDS_RETRY:
+		case SUCCESS:
+			SCpnt->eh_state = ret;
 			break;
 		}
 	} else {

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/4] 2.4 SCSI error handling fixes
  2002-09-12 18:13 [PATCH 2/4] 2.4 SCSI error handling fixes Russell King
@ 2002-09-12 18:18 ` Doug Ledford
  2002-09-12 18:20   ` Russell King
  0 siblings, 1 reply; 4+ messages in thread
From: Doug Ledford @ 2002-09-12 18:18 UTC (permalink / raw)
  To: Russell King; +Cc: linux-scsi

On Thu, Sep 12, 2002 at 07:13:54PM +0100, Russell King wrote:
> -	scsi_setup_cmd_retry(SCpnt);
> -	scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> +	do {
> +		scsi_setup_cmd_retry(SCpnt);
> +		scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> +	} while (SCpnt->eh_state == NEEDS_RETRY);

> +		 * If the SCSI device responded with "logical unit
> +		 * is in process of becoming ready", we need to
> +		 * retry this command.
> +		 */
> +	} while (SCpnt->eh_state == NEEDS_RETRY);

> +		/*
> +		 * If the SCSI device responded with "logical unit
> +		 * is in process of becoming ready", we need to
> +		 * retry this command.
> +		 */
> +	} while (SCpnt->eh_state == NEEDS_RETRY);

>  		default:
> -			SCpnt->eh_state = FAILED;
> +			ret = FAILED;
> +			/*FALLTHROUGH*/
> +		case FAILED:
> +		case NEEDS_RETRY:
> +		case SUCCESS:
> +			SCpnt->eh_state = ret;
>  			break;
>  		}
>  	} else {

I don't see any bounding here.  You *have* to bound this.  It is not that 
uncommon for a device to report "logical unit is in the process of 
becoming ready" forever on certain types of hardware failures.  Without 
bounding, we won't ever give up on it.

-- 
  Doug Ledford <dledford@redhat.com>     919-754-3700 x44233
         Red Hat, Inc. 
         1801 Varsity Dr.
         Raleigh, NC 27606
  

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/4] 2.4 SCSI error handling fixes
  2002-09-12 18:18 ` Doug Ledford
@ 2002-09-12 18:20   ` Russell King
  2002-09-12 19:31     ` Mike Anderson
  0 siblings, 1 reply; 4+ messages in thread
From: Russell King @ 2002-09-12 18:20 UTC (permalink / raw)
  To: linux-scsi

On Thu, Sep 12, 2002 at 02:18:06PM -0400, Doug Ledford wrote:
> On Thu, Sep 12, 2002 at 07:13:54PM +0100, Russell King wrote:
> > -	scsi_setup_cmd_retry(SCpnt);
> > -	scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> > +	do {
> > +		scsi_setup_cmd_retry(SCpnt);
> > +		scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> 
> > +		 * If the SCSI device responded with "logical unit
> > +		 * is in process of becoming ready", we need to
> > +		 * retry this command.
> > +		 */
> > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> 
> > +		/*
> > +		 * If the SCSI device responded with "logical unit
> > +		 * is in process of becoming ready", we need to
> > +		 * retry this command.
> > +		 */
> > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> 
> >  		default:
> > -			SCpnt->eh_state = FAILED;
> > +			ret = FAILED;
> > +			/*FALLTHROUGH*/
> > +		case FAILED:
> > +		case NEEDS_RETRY:
> > +		case SUCCESS:
> > +			SCpnt->eh_state = ret;
> >  			break;
> >  		}
> >  	} else {
> 
> I don't see any bounding here.  You *have* to bound this.  It is not that 
> uncommon for a device to report "logical unit is in the process of 
> becoming ready" forever on certain types of hardware failures.  Without 
> bounding, we won't ever give up on it.

Shrug.  I've not changed the behaviour here.  I've just fixed the ONE
problem where we reissue the command back to the HBA driver with stale
state information.  I never claimed that it fixed ALL bugs.

I'm afraid that I'm going to have to leave the other bugs are for
someone else to address; I've already spent too long on this issue.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/4] 2.4 SCSI error handling fixes
  2002-09-12 18:20   ` Russell King
@ 2002-09-12 19:31     ` Mike Anderson
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Anderson @ 2002-09-12 19:31 UTC (permalink / raw)
  To: Russell King; +Cc: linux-scsi

Russell King [rmk@arm.linux.org.uk] wrote:
> On Thu, Sep 12, 2002 at 02:18:06PM -0400, Doug Ledford wrote:
> > On Thu, Sep 12, 2002 at 07:13:54PM +0100, Russell King wrote:
> > > -	scsi_setup_cmd_retry(SCpnt);
> > > -	scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> > > +	do {
> > > +		scsi_setup_cmd_retry(SCpnt);
> > > +		scsi_send_eh_cmnd(SCpnt, SCpnt->timeout_per_command);
> > > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> > 
> > > +		 * If the SCSI device responded with "logical unit
> > > +		 * is in process of becoming ready", we need to
> > > +		 * retry this command.
> > > +		 */
> > > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> > 
> > > +		/*
> > > +		 * If the SCSI device responded with "logical unit
> > > +		 * is in process of becoming ready", we need to
> > > +		 * retry this command.
> > > +		 */
> > > +	} while (SCpnt->eh_state == NEEDS_RETRY);
> > 
> > >  		default:
> > > -			SCpnt->eh_state = FAILED;
> > > +			ret = FAILED;
> > > +			/*FALLTHROUGH*/
> > > +		case FAILED:
> > > +		case NEEDS_RETRY:
> > > +		case SUCCESS:
> > > +			SCpnt->eh_state = ret;
> > >  			break;
> > >  		}
> > >  	} else {
> > 
> > I don't see any bounding here.  You *have* to bound this.  It is not that 
> > uncommon for a device to report "logical unit is in the process of 
> > becoming ready" forever on certain types of hardware failures.  Without 
> > bounding, we won't ever give up on it.
> 
> Shrug.  I've not changed the behaviour here.  I've just fixed the ONE
> problem where we reissue the command back to the HBA driver with stale
> state information.  I never claimed that it fixed ALL bugs.
> 
> I'm afraid that I'm going to have to leave the other bugs are for
> someone else to address; I've already spent too long on this issue.
> 

In 2.5 we are carrying a change in scsi_eh_completed_normally (I believe
this came from  Andries.Brouwer@cwi.nl as I cannot find another
reference to it) that limits the number of retries. 

Though this is modifying the original cmds retry count which I do not
believe it is a good thing at least it stops the loop. I know retries
are going to change, but until then I made a similar adjustment to my
scsi_error cleanup and will repost shortly.

-andmike
-- 
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-09-12 19:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-12 18:13 [PATCH 2/4] 2.4 SCSI error handling fixes Russell King
2002-09-12 18:18 ` Doug Ledford
2002-09-12 18:20   ` Russell King
2002-09-12 19:31     ` Mike Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox