* Re: Scsi error handler strategy question
2004-01-04 23:24 Scsi error handler strategy question Willem Riede
@ 2004-01-05 2:42 ` Andre Hedrick
2004-01-05 2:55 ` Willem Riede
2004-01-05 9:44 ` Kurt Garloff
2004-01-18 15:20 ` Scsi error handler strategy question - now with [PATCH] Willem Riede
2 siblings, 1 reply; 5+ messages in thread
From: Andre Hedrick @ 2004-01-05 2:42 UTC (permalink / raw)
To: Willem Riede; +Cc: linux-scsi
Willem,
You are free to actually deploy an entire independent eh_strategy series
of functions. I never got around to doing it but one of the goals was to
remove the mid-layer timer and manage it in ide-scsi.
There is much more that can be done, I just do not have my notes handy.
Cheers,
Andre Hedrick
LAD Storage Consulting Group
On Sun, 4 Jan 2004, Willem Riede wrote:
> While testing the ide-scsi error handling, I observed that my ATAPI
> device gets offlined too easily.
>
> At some point, the host + device are getting reset. That's desired.
> The error handler is programmed to then expect a "CC/UA" (check
> condition / unit attention) when it does TUR (test unit ready)
> following reset. That's appropriate.
>
> But here is my first question: is there typically any need to wait
> some time between doing the host/bus/device reset and the first TUR?
> Is there a standard that governs how fast devices have to be done
> resetting to the point that they can respond to commands (if only to
> say they're not ready?
>
> When the first TUR completes, the CC/UA expected flag takes care of
> the reported sense 06:29:00 (power on reset or device reset occurred).
> So far so good. Second TUR issued. That one typically gets 02:04:01
> (not ready - in the process of becoming ready) reported. The error
> handler is programmed to retry TUR once if it sees this.
>
> Second question: if the device firmware takes some time to re-initiate
> the device, this code can be returned multiple times. So am I allowed
> to submit a patch to increase that retry count? What would be a good
> number? Hard to say in general, as this depends on what devices you
> have and how fast commands get executed :-(
>
> Finally, at least my device, the OnStream DI-30, will eventually want
> to report 06:28:00 (not ready to ready transition, medium may have
> changed). The error handler considers that an error, and is guaranteed
> to take the device offline, just as it came back to life :-(
>
> Am I allowed to submit a patch that will also retry on that condition?
>
> Thanks, Willem Riede.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Scsi error handler strategy question
2004-01-05 2:42 ` Andre Hedrick
@ 2004-01-05 2:55 ` Willem Riede
0 siblings, 0 replies; 5+ messages in thread
From: Willem Riede @ 2004-01-05 2:55 UTC (permalink / raw)
To: Andre Hedrick; +Cc: linux-scsi
On 2004.01.04 21:42, Andre Hedrick wrote:
>
> Willem,
>
> You are free to actually deploy an entire independent eh_strategy series
> of functions. I never got around to doing it but one of the goals was to
> remove the mid-layer timer and manage it in ide-scsi.
>
> There is much more that can be done, I just do not have my notes handy.
Andre,
If you are able to locate those notes, I'd be _very_ interested.
Thanks, Willem Riede.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Scsi error handler strategy question
2004-01-04 23:24 Scsi error handler strategy question Willem Riede
2004-01-05 2:42 ` Andre Hedrick
@ 2004-01-05 9:44 ` Kurt Garloff
2004-01-18 15:20 ` Scsi error handler strategy question - now with [PATCH] Willem Riede
2 siblings, 0 replies; 5+ messages in thread
From: Kurt Garloff @ 2004-01-05 9:44 UTC (permalink / raw)
To: Willem Riede; +Cc: linux-scsi
[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]
Hi Willem,
On Sun, Jan 04, 2004 at 06:24:25PM -0500, Willem Riede wrote:
> But here is my first question: is there typically any need to wait
> some time between doing the host/bus/device reset and the first TUR?
> Is there a standard that governs how fast devices have to be done
> resetting to the point that they can respond to commands (if only to
> say they're not ready?
Ths SCSI-2 standard does not say anything about this.
However, it seems to be assumed that a device has recovered enough
to respond to commands like INQUIRY and TUR after the normal
parallel SCSI selection timeout of 250ms.
In practice, we many devices need more time to recover.
If you look at the SCSI scanning code, there are timeouts of a few
(6) seconds. They are only needed because of devices needing time to
recover after a reset.
> When the first TUR completes, the CC/UA expected flag takes care of
> the reported sense 06:29:00 (power on reset or device reset occurred).
> So far so good. Second TUR issued. That one typically gets 02:04:01
> (not ready - in the process of becoming ready) reported. The error
> handler is programmed to retry TUR once if it sees this.
Good.
> Second question: if the device firmware takes some time to re-initiate
> the device, this code can be returned multiple times. So am I allowed
> to submit a patch to increase that retry count? What would be a good
> number? Hard to say in general, as this depends on what devices you
> have and how fast commands get executed :-(
TUR tends to be answered immediately. So you should wait a second
before retrying. Allowing for 32 retries does not seem exxagerated then,
as we know that the device is expected to come back.
> Finally, at least my device, the OnStream DI-30, will eventually want
> to report 06:28:00 (not ready to ready transition, medium may have
> changed). The error handler considers that an error, and is guaranteed
> to take the device offline, just as it came back to life :-(
>
> Am I allowed to submit a patch that will also retry on that condition?
I'd support it.
Regards,
--
Kurt Garloff <garloff@suse.de> Cologne, DE
SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head)
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Scsi error handler strategy question - now with [PATCH]
2004-01-04 23:24 Scsi error handler strategy question Willem Riede
2004-01-05 2:42 ` Andre Hedrick
2004-01-05 9:44 ` Kurt Garloff
@ 2004-01-18 15:20 ` Willem Riede
2 siblings, 0 replies; 5+ messages in thread
From: Willem Riede @ 2004-01-18 15:20 UTC (permalink / raw)
To: linux-scsi
On 2004.01.04 18:24, Willem Riede wrote:
> While testing the ide-scsi error handling, I observed that my ATAPI
> device gets offlined too easily.
>
[snip]
>
> Am I allowed to submit a patch that will also retry on that condition?
Below is the patch I developed to make the error recovery robust for
my OnStream DI-30 with osst and ide-scsi.
I realize that there may be objections to applying this to the main scsi
subsystem because the slowdown may be extreme for say a fiber connected
SAN with hundreds of disks, but it is absolutely necessary for my ATAPI
devices. So if you don't want to apply this, tell me, and I'll work it
into a ide-scsi specific error strategy handler.
Thanks, Willem Riede.
--- linux-2.6.1-1.34/drivers/scsi/scsi_error.c 2004-01-09 01:59:03.000000000 -0500
+++ linux-2.6.1-test/drivers/scsi/scsi_error.c 2004-01-18 08:59:30.000000000 -0500
@@ -282,6 +282,11 @@
(scmd->sense_buffer[13] == 0x01)) {
return NEEDS_RETRY;
}
+ /* same for reset occurred and not ready to ready transition */
+ if (((scmd->sense_buffer[12] & ~1) == 0x28) &&
+ (scmd->sense_buffer[13] == 0x00)) {
+ return NEEDS_RETRY;
+ }
return SUCCESS;
/* these three are not supported */
@@ -713,9 +718,11 @@
static int scsi_eh_tur(struct scsi_cmnd *scmd)
{
static unsigned char tur_command[6] = {TEST_UNIT_READY, 0, 0, 0, 0, 0};
- int retry_cnt = 1, rtn;
+ int retry_cnt = 32, rtn;
retry_tur:
+ SCSI_LOG_ERROR_RECOVERY(3, printk("%s: send TUR, attempt %d\n",
+ current->comm, 33-retry_cnt));
memcpy(scmd->cmnd, tur_command, sizeof(tur_command));
/*
@@ -747,8 +754,10 @@
if (rtn == SUCCESS)
return 0;
else if (rtn == NEEDS_RETRY)
- if (retry_cnt--)
+ if (retry_cnt--) {
+ scsi_sleep(HZ);
goto retry_tur;
+ }
return 1;
}
^ permalink raw reply [flat|nested] 5+ messages in thread