public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sd: medium access timeout counter fails to reset
@ 2014-04-10 15:08 David Jeffery
  2014-04-14 15:50 ` Ewan Milne
  0 siblings, 1 reply; 3+ messages in thread
From: David Jeffery @ 2014-04-10 15:08 UTC (permalink / raw)
  To: linux-scsi

There is an error with the medium access timeout feature of the sd driver. The
sdkp->medium_access_timed_out value is reset to zero in sd_done() in the wrong
place.  Currently it is reset to zero only when a command returns sense data.
This can result in cases where the medium access check falsely triggers from
timed out commands which are hours or days apart.

For example, an I/O command times out and is aborted.  It then retries and
succeeds.  But with no sense data generated and returned, the
medium_access_timed_out value is not reset.  If no sd command returns sense
data, then the next command to time out (however far in time from the first
failure) will trigger the medium access timeout and put the device offline.

The resetting of sdkp->medium_access_timed_out should occur before the check
for sense data.

Signed-off-by: David Jeffery <djeffery@redhat.com>

---

To reproduce using scsi_debug, use SCSI_DEBUG_OPT_TIMEOUT or
SCSI_DEBUG_OPT_MAC_TIMEOUT to force an I/O command to timeout.  Then, remove
the opt value so the I/O will succeed on retry.  Perform more I/O as desired.
Finally, repeat the process to make a new I/O command time out.  Without the
patch, the device will be marked offline even though many I/O commands have
succeeded between the 2 instances of timed out commands.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 470954a..a41e68e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1689,12 +1689,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 						   sshdr.ascq));
 	}
 #endif
+	sdkp->medium_access_timed_out = 0;
+
 	if (driver_byte(result) != DRIVER_SENSE &&
 	    (!sense_valid || sense_deferred))
 		goto out;
 
-	sdkp->medium_access_timed_out = 0;
-
 	switch (sshdr.sense_key) {
 	case HARDWARE_ERROR:
 	case MEDIUM_ERROR:

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sd: medium access timeout counter fails to reset
  2014-04-10 15:08 [PATCH] sd: medium access timeout counter fails to reset David Jeffery
@ 2014-04-14 15:50 ` Ewan Milne
  2014-04-17 19:23   ` Martin K. Petersen
  0 siblings, 1 reply; 3+ messages in thread
From: Ewan Milne @ 2014-04-14 15:50 UTC (permalink / raw)
  To: linux-scsi; +Cc: djeffery

On Thu, 2014-04-10 at 11:08 -0400, David Jeffery wrote:
> There is an error with the medium access timeout feature of the sd driver. The
> sdkp->medium_access_timed_out value is reset to zero in sd_done() in the wrong
> place.  Currently it is reset to zero only when a command returns sense data.
> This can result in cases where the medium access check falsely triggers from
> timed out commands which are hours or days apart.
> 
> For example, an I/O command times out and is aborted.  It then retries and
> succeeds.  But with no sense data generated and returned, the
> medium_access_timed_out value is not reset.  If no sd command returns sense
> data, then the next command to time out (however far in time from the first
> failure) will trigger the medium access timeout and put the device offline.
> 
> The resetting of sdkp->medium_access_timed_out should occur before the check
> for sense data.
> 
> Signed-off-by: David Jeffery <djeffery@redhat.com>
> 
> ---
> 
> To reproduce using scsi_debug, use SCSI_DEBUG_OPT_TIMEOUT or
> SCSI_DEBUG_OPT_MAC_TIMEOUT to force an I/O command to timeout.  Then, remove
> the opt value so the I/O will succeed on retry.  Perform more I/O as desired.
> Finally, repeat the process to make a new I/O command time out.  Without the
> patch, the device will be marked offline even though many I/O commands have
> succeeded between the 2 instances of timed out commands.
> 
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 470954a..a41e68e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1689,12 +1689,12 @@ static int sd_done(struct scsi_cmnd *SCpnt)
>  						   sshdr.ascq));
>  	}
>  #endif
> +	sdkp->medium_access_timed_out = 0;
> +
>  	if (driver_byte(result) != DRIVER_SENSE &&
>  	    (!sense_valid || sense_deferred))
>  		goto out;
>  
> -	sdkp->medium_access_timed_out = 0;
> -
>  	switch (sshdr.sense_key) {
>  	case HARDWARE_ERROR:
>  	case MEDIUM_ERROR:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hey James-

Is there some reason why this patch was never accepted?  David posted it
a couple of times last year and Martin ack'ed it, but I don't see it in
your tree, and I don't see any other comments on it.

It seems like something that ought to be fixed.

Reviewed-by:  Ewan D. Milne <emilne@redhat.com>





^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sd: medium access timeout counter fails to reset
  2014-04-14 15:50 ` Ewan Milne
@ 2014-04-17 19:23   ` Martin K. Petersen
  0 siblings, 0 replies; 3+ messages in thread
From: Martin K. Petersen @ 2014-04-17 19:23 UTC (permalink / raw)
  To: emilne; +Cc: linux-scsi, djeffery

>>>>> "Ewan" == Ewan Milne <emilne@redhat.com> writes:

Ewan> Is there some reason why this patch was never accepted?  David
Ewan> posted it a couple of times last year and Martin ack'ed it, but I
Ewan> don't see it in your tree, and I don't see any other comments on
Ewan> it.

I still don't have any objections.

FWIW, the reason my patch didn't handle the recovery scenario is that
the devices I originally wrote it for never came back (irreversibly
stuck head assembly or dead FTL).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-04-17 19:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-10 15:08 [PATCH] sd: medium access timeout counter fails to reset David Jeffery
2014-04-14 15:50 ` Ewan Milne
2014-04-17 19:23   ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox