Scsi error handler strategy question

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Scsi error handler strategy question
@ 2004-01-04 23:24 Willem Riede
  2004-01-05  2:42 ` Andre Hedrick
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Willem Riede @ 2004-01-04 23:24 UTC (permalink / raw)
  To: linux-scsi

While testing the ide-scsi error handling, I observed that my ATAPI
device gets offlined too easily.

At some point, the host + device are getting reset. That's desired.
The error handler is programmed to then expect a "CC/UA" (check
condition / unit attention) when it does TUR (test unit ready)
following reset. That's appropriate.

But here is my first question: is there typically any need to wait 
some time between doing the host/bus/device reset and the first TUR?
Is there a standard that governs how fast devices have to be done
resetting to the point that they can respond to commands (if only to
say they're not ready?

When the first TUR completes, the CC/UA expected flag takes care of
the reported sense 06:29:00 (power on reset or device reset occurred).
So far so good. Second TUR issued. That one typically gets 02:04:01
(not ready - in the process of becoming ready) reported. The error
handler is programmed to retry TUR once if it sees this.

Second question: if the device firmware takes some time to re-initiate
the device, this code can be returned multiple times. So am I allowed
to submit a patch to increase that retry count? What would be a good
number? Hard to say in general, as this depends on what devices you
have and how fast commands get executed :-(

Finally, at least my device, the OnStream DI-30, will eventually want 
to report 06:28:00 (not ready to ready transition, medium may have
changed). The error handler considers that an error, and is guaranteed
to take the device offline, just as it came back to life :-(

Am I allowed to submit a patch that will also retry on that condition?

Thanks, Willem Riede.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scsi error handler strategy question
  2004-01-04 23:24 Scsi error handler strategy question Willem Riede
@ 2004-01-05  2:42 ` Andre Hedrick
  2004-01-05  2:55   ` Willem Riede
  2004-01-05  9:44 ` Kurt Garloff
  2004-01-18 15:20 ` Scsi error handler strategy question - now with [PATCH] Willem Riede
  2 siblings, 1 reply; 5+ messages in thread
From: Andre Hedrick @ 2004-01-05  2:42 UTC (permalink / raw)
  To: Willem Riede; +Cc: linux-scsi


Willem,

You are free to actually deploy an entire independent eh_strategy series
of functions.  I never got around to doing it but one of the goals was to
remove the mid-layer timer and manage it in ide-scsi.

There is much more that can be done, I just do not have my notes handy.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Sun, 4 Jan 2004, Willem Riede wrote:

> While testing the ide-scsi error handling, I observed that my ATAPI
> device gets offlined too easily.
> 
> At some point, the host + device are getting reset. That's desired.
> The error handler is programmed to then expect a "CC/UA" (check
> condition / unit attention) when it does TUR (test unit ready)
> following reset. That's appropriate.
> 
> But here is my first question: is there typically any need to wait 
> some time between doing the host/bus/device reset and the first TUR?
> Is there a standard that governs how fast devices have to be done
> resetting to the point that they can respond to commands (if only to
> say they're not ready?
> 
> When the first TUR completes, the CC/UA expected flag takes care of
> the reported sense 06:29:00 (power on reset or device reset occurred).
> So far so good. Second TUR issued. That one typically gets 02:04:01
> (not ready - in the process of becoming ready) reported. The error
> handler is programmed to retry TUR once if it sees this.
> 
> Second question: if the device firmware takes some time to re-initiate
> the device, this code can be returned multiple times. So am I allowed
> to submit a patch to increase that retry count? What would be a good
> number? Hard to say in general, as this depends on what devices you
> have and how fast commands get executed :-(
> 
> Finally, at least my device, the OnStream DI-30, will eventually want 
> to report 06:28:00 (not ready to ready transition, medium may have
> changed). The error handler considers that an error, and is guaranteed
> to take the device offline, just as it came back to life :-(
> 
> Am I allowed to submit a patch that will also retry on that condition?
> 
> Thanks, Willem Riede.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scsi error handler strategy question
  2004-01-05  2:42 ` Andre Hedrick
@ 2004-01-05  2:55   ` Willem Riede
  0 siblings, 0 replies; 5+ messages in thread
From: Willem Riede @ 2004-01-05  2:55 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: linux-scsi

On 2004.01.04 21:42, Andre Hedrick wrote:
> 
> Willem,
> 
> You are free to actually deploy an entire independent eh_strategy series
> of functions.  I never got around to doing it but one of the goals was to
> remove the mid-layer timer and manage it in ide-scsi.
> 
> There is much more that can be done, I just do not have my notes handy.

Andre,

If you are able to locate those notes, I'd be _very_ interested.

Thanks, Willem Riede.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scsi error handler strategy question
  2004-01-04 23:24 Scsi error handler strategy question Willem Riede
  2004-01-05  2:42 ` Andre Hedrick
@ 2004-01-05  9:44 ` Kurt Garloff
  2004-01-18 15:20 ` Scsi error handler strategy question - now with [PATCH] Willem Riede
  2 siblings, 0 replies; 5+ messages in thread
From: Kurt Garloff @ 2004-01-05  9:44 UTC (permalink / raw)
  To: Willem Riede; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]

Hi Willem,

On Sun, Jan 04, 2004 at 06:24:25PM -0500, Willem Riede wrote:
> But here is my first question: is there typically any need to wait 
> some time between doing the host/bus/device reset and the first TUR?
> Is there a standard that governs how fast devices have to be done
> resetting to the point that they can respond to commands (if only to
> say they're not ready?

Ths SCSI-2 standard does not say anything about this. 
However, it seems to be assumed that a device has recovered enough 
to respond to commands like INQUIRY and TUR after the normal 
parallel SCSI selection timeout of 250ms.

In practice, we many devices need more time to recover.
If you look at the SCSI scanning code, there are timeouts of a few
(6) seconds. They are only needed because of devices needing time to 
recover after a reset.

> When the first TUR completes, the CC/UA expected flag takes care of
> the reported sense 06:29:00 (power on reset or device reset occurred).
> So far so good. Second TUR issued. That one typically gets 02:04:01
> (not ready - in the process of becoming ready) reported. The error
> handler is programmed to retry TUR once if it sees this.

Good.

> Second question: if the device firmware takes some time to re-initiate
> the device, this code can be returned multiple times. So am I allowed
> to submit a patch to increase that retry count? What would be a good
> number? Hard to say in general, as this depends on what devices you
> have and how fast commands get executed :-(

TUR tends to be answered immediately. So you should wait a second 
before retrying. Allowing for 32 retries does not seem exxagerated then,
as we know that the device is expected to come back.

> Finally, at least my device, the OnStream DI-30, will eventually want 
> to report 06:28:00 (not ready to ready transition, medium may have
> changed). The error handler considers that an error, and is guaranteed
> to take the device offline, just as it came back to life :-(
> 
> Am I allowed to submit a patch that will also retry on that condition?

I'd support it.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                            Cologne, DE 
SUSE LINUX AG, Nuernberg, DE                          SUSE Labs (Head)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scsi error handler strategy question - now with [PATCH]
  2004-01-04 23:24 Scsi error handler strategy question Willem Riede
  2004-01-05  2:42 ` Andre Hedrick
  2004-01-05  9:44 ` Kurt Garloff
@ 2004-01-18 15:20 ` Willem Riede
  2 siblings, 0 replies; 5+ messages in thread
From: Willem Riede @ 2004-01-18 15:20 UTC (permalink / raw)
  To: linux-scsi

On 2004.01.04 18:24, Willem Riede wrote:
> While testing the ide-scsi error handling, I observed that my ATAPI
> device gets offlined too easily.
> 
[snip]
> 
> Am I allowed to submit a patch that will also retry on that condition?

Below is the patch I developed to make the error recovery robust for
my OnStream DI-30 with osst and ide-scsi.

I realize that there may be objections to applying this to the main scsi
subsystem because the slowdown may be extreme for say a fiber connected
SAN with hundreds of disks, but it is absolutely necessary for my ATAPI
devices. So if you don't want to apply this, tell me, and I'll work it 
into a ide-scsi specific error strategy handler.

Thanks, Willem Riede.

--- linux-2.6.1-1.34/drivers/scsi/scsi_error.c	2004-01-09 01:59:03.000000000 -0500
+++ linux-2.6.1-test/drivers/scsi/scsi_error.c	2004-01-18 08:59:30.000000000 -0500
@@ -282,6 +282,11 @@
 			(scmd->sense_buffer[13] == 0x01)) {
 			return NEEDS_RETRY;
 		}
+		/* same for reset occurred and not ready to ready transition */
+		if (((scmd->sense_buffer[12] & ~1) == 0x28) &&
+			(scmd->sense_buffer[13] == 0x00)) {
+			return NEEDS_RETRY;
+		}
 		return SUCCESS;
 
 		/* these three are not supported */
@@ -713,9 +718,11 @@
 static int scsi_eh_tur(struct scsi_cmnd *scmd)
 {
 	static unsigned char tur_command[6] = {TEST_UNIT_READY, 0, 0, 0, 0, 0};
-	int retry_cnt = 1, rtn;
+	int retry_cnt = 32, rtn;
 
 retry_tur:
+	SCSI_LOG_ERROR_RECOVERY(3, printk("%s: send TUR, attempt %d\n",
+					  current->comm, 33-retry_cnt));
 	memcpy(scmd->cmnd, tur_command, sizeof(tur_command));
 
 	/*
@@ -747,8 +754,10 @@
 	if (rtn == SUCCESS)
 		return 0;
 	else if (rtn == NEEDS_RETRY)
-		if (retry_cnt--)
+		if (retry_cnt--) {
+			scsi_sleep(HZ);
 			goto retry_tur;
+		}
 	return 1;
 }
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-01-18 15:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-04 23:24 Scsi error handler strategy question Willem Riede
2004-01-05  2:42 ` Andre Hedrick
2004-01-05  2:55   ` Willem Riede
2004-01-05  9:44 ` Kurt Garloff
2004-01-18 15:20 ` Scsi error handler strategy question - now with [PATCH] Willem Riede

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox