Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
       [not found] <20050210134432.GA12229@kassiopeia.juls.savba.sk>
@ 2005-02-11 16:00 ` Alan Stern
  2005-02-11 16:18   ` James Bottomley
  0 siblings, 1 reply; 12+ messages in thread
From: Alan Stern @ 2005-02-11 16:00 UTC (permalink / raw)
  To: Radovan Garabik; +Cc: USB Storage list, SCSI development list

On Thu, 10 Feb 2005, Radovan Garabik wrote:

> I put the dmesg output to:
> http://kassiopeia.juls.savba.sk/~garabik/junk/gigabox.txt
> 
> the exact sequence to reproduce the behaviour is:
> 1) modprobe ehci-hcd
> 2) plug in gigabox, wait a second till the message about a new usb
>    device appears
> 3) modprobe usb-storage delay_use=10

I see.  The drive reports a failure with Sense Key = 0x04.  According to 
the SCSI specification, this indicates a _nonrecoverable_ hardware 
failure.  So you can't blame the SCSI drivers for not retrying the 
request, even though a later attempt to read the partition did end up 
succeeding.

Maybe the SCSI midlayer could be changed to retry anyway, just as an
optimistic workaround for buggy devices like yours.

SCSI developers: Is there any hope of this?

> > What patch did you apply to the driver, to treat the Gigabox the same as a
> > Genesys drive?  And what's in your /proc/bus/usb/devices entry for the Gigabox?
> 
> see
> https://lists.one-eyed-alien.net/pipermail/usb-storage/2004-November/001154.html
> you responded to that mail :-)

Ah yes.  That contains your original bug report.  I didn't remember that
was from you, because the message was resent by Matt Dharm.

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
  2005-02-11 16:00 ` [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again Alan Stern
@ 2005-02-11 16:18   ` James Bottomley
  2005-02-11 18:31     ` Alan Stern
  0 siblings, 1 reply; 12+ messages in thread
From: James Bottomley @ 2005-02-11 16:18 UTC (permalink / raw)
  To: Alan Stern; +Cc: Radovan Garabik, USB Storage list, SCSI Mailing List

On Fri, 2005-02-11 at 11:00 -0500, Alan Stern wrote:
> SCSI developers: Is there any hope of this?

Well, IBM also has some old buggy piece of hardware that apparently
gives fatal errors but actually wants them retried:

http://marc.theaimsgroup.com/?t=110088696100005

After a bit of argument, they agreed to do it as a blacklist flag, so
you should be able to add the same flag to your gigabox.

I planned to put the patch in just as soon as they managed to send the
patch unmangled by a mailer, but they went very quiet.

James

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
  2005-02-11 16:18   ` James Bottomley
@ 2005-02-11 18:31     ` Alan Stern
  2005-02-11 19:07       ` Patrick Mansfield
  2005-02-16 14:37       ` Radovan Garabik
  0 siblings, 2 replies; 12+ messages in thread
From: Alan Stern @ 2005-02-11 18:31 UTC (permalink / raw)
  To: Radovan Garabik; +Cc: USB Storage list, SCSI Mailing List

On Fri, 11 Feb 2005, James Bottomley wrote:

> On Fri, 2005-02-11 at 11:00 -0500, Alan Stern wrote:
> > SCSI developers: Is there any hope of this?
> 
> Well, IBM also has some old buggy piece of hardware that apparently
> gives fatal errors but actually wants them retried:
> 
> http://marc.theaimsgroup.com/?t=110088696100005
> 
> After a bit of argument, they agreed to do it as a blacklist flag, so
> you should be able to add the same flag to your gigabox.
> 
> I planned to put the patch in just as soon as they managed to send the
> patch unmangled by a mailer, but they went very quiet.

Thanks, James.

Radovan: Below is an updated version of the patch James mentioned, with an
entry included for your Gigabox drive.  It's meant to apply to
2.6.11-rc3-bk5, but it will probably go with older kernels too.  Try it
out and let me know how it works.  If it solves your problem, I'll submit
it for inclusion in the official kernel.

Alan Stern


===== include/scsi/scsi_devinfo.h 1.7 vs edited =====
--- 1.7/include/scsi/scsi_devinfo.h	2004-10-05 11:25:13 -04:00
+++ edited/include/scsi/scsi_devinfo.h	2005-02-11 13:21:51 -05:00
@@ -27,4 +27,5 @@
 #define BLIST_NOT_LOCKABLE	0x80000	/* don't use PREVENT-ALLOW commands */
 #define BLIST_NO_ULD_ATTACH	0x100000 /* device is actually for RAID config */
 #define BLIST_SELECT_NO_ATN	0x200000 /* select without ATN */
+#define BLIST_RETRY_HWERROR	0x400000 /* retry HARDWARE_ERROR */
 #endif
===== drivers/scsi/scsi_devinfo.c 1.14 vs edited =====
--- 1.14/drivers/scsi/scsi_devinfo.c	2004-12-10 11:54:56 -05:00
+++ edited/drivers/scsi/scsi_devinfo.c	2005-02-11 13:25:41 -05:00
@@ -159,6 +159,7 @@
 	{"HP", "C3323-300", "4269", BLIST_NOTQ},
 	{"IBM", "AuSaV1S2", NULL, BLIST_FORCELUN},
 	{"IBM", "ProFibre 4000R", "*", BLIST_SPARSELUN | BLIST_LARGELUN},
+	{"IBM", "2105", NULL, BLIST_RETRY_HWERROR},
 	{"iomega", "jaz 1GB", "J.86", BLIST_NOTQ | BLIST_NOLUN},
 	{"IOMEGA", "Io20S         *F", NULL, BLIST_KEY},
 	{"INSITE", "Floptical   F*8I", NULL, BLIST_KEY},
@@ -192,6 +193,7 @@
 	{"SMSC", "USB 2 HS-CF", NULL, BLIST_SPARSELUN | BLIST_INQUIRY_36},
 	{"SONY", "CD-ROM CDU-8001", NULL, BLIST_BORKEN},
 	{"SONY", "TSL", NULL, BLIST_FORCELUN},		/* DDS3 & DDS4 autoloaders */
+	{"ST650211", "CF", NULL, BLIST_RETRY_HWERROR},
 	{"SUN", "T300", "*", BLIST_SPARSELUN},
 	{"SUN", "T4", "*", BLIST_SPARSELUN},
 	{"TEXEL", "CD-ROM", "1.06", BLIST_BORKEN},
===== drivers/scsi/scsi_error.c 1.47 vs edited =====
--- 1.47/drivers/scsi/scsi_error.c	2005-02-02 21:56:01 -05:00
+++ edited/drivers/scsi/scsi_error.c	2005-02-11 13:23:45 -05:00
@@ -31,6 +31,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_ioctl.h>
 #include <scsi/scsi_request.h>
+#include <scsi/scsi_devinfo.h>
 
 #include "scsi_priv.h"
 #include "scsi_logging.h"
@@ -350,10 +351,18 @@
 	case MEDIUM_ERROR:
 		return NEEDS_RETRY;
 
+	case HARDWARE_ERROR:
+		if (scsi_get_device_flags(scmd->device,
+					scmd->device->vendor,
+					scmd->device->model)
+				& BLIST_RETRY_HWERROR)
+			return NEEDS_RETRY;
+		else
+			return SUCCESS;
+
 	case ILLEGAL_REQUEST:
 	case BLANK_CHECK:
 	case DATA_PROTECT:
-	case HARDWARE_ERROR:
 	default:
 		return SUCCESS;
 	}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
  2005-02-11 18:31     ` Alan Stern
@ 2005-02-11 19:07       ` Patrick Mansfield
  2005-02-11 19:41         ` Alan Stern
  2005-02-16 14:37       ` Radovan Garabik
  1 sibling, 1 reply; 12+ messages in thread
From: Patrick Mansfield @ 2005-02-11 19:07 UTC (permalink / raw)
  To: Alan Stern; +Cc: Radovan Garabik, USB Storage list, SCSI Mailing List

On Fri, Feb 11, 2005 at 01:31:30PM -0500, Alan Stern wrote:
> On Fri, 11 Feb 2005, James Bottomley wrote:
> 
> > On Fri, 2005-02-11 at 11:00 -0500, Alan Stern wrote:
> > > SCSI developers: Is there any hope of this?
> > 
> > Well, IBM also has some old buggy piece of hardware that apparently
> > gives fatal errors but actually wants them retried:
> > 
> > http://marc.theaimsgroup.com/?t=110088696100005
> > 
> > After a bit of argument, they agreed to do it as a blacklist flag, so
> > you should be able to add the same flag to your gigabox.
> > 
> > I planned to put the patch in just as soon as they managed to send the
> > patch unmangled by a mailer, but they went very quiet.
> 
> Thanks, James.
> 
> Radovan: Below is an updated version of the patch James mentioned, with an
> entry included for your Gigabox drive.  It's meant to apply to
> 2.6.11-rc3-bk5, but it will probably go with older kernels too.  Try it
> out and let me know how it works.  If it solves your problem, I'll submit
> it for inclusion in the official kernel.

Do you have the asc/ascq for the USB device?

Longer term, it would be nice to have a black list modifiable via sysfs
(similiar to the scsi devinfo one) with a vendor + model + sense key + asc
+ ascq, for both this USB device and for the IBM one. AFAIUI, the IBM
ESS/2105 really wants certain hardware errors retried not all of them, I
don't know if the IBM ESS/storage developers were OK with Martin's patch
(I never see responses from them to anything posted on linux-scsi).

The sense black list could be used for quirks plus vendor specific
asc/ascq values. It could also be used by dm multipath, if error codes
(like Mike C's old patch) and not sense data were passed up.

James - would you be OK with such an approach?

It should be populated after loading scsi_mod, but before loading HBA
drivers (the same is true for devinfo); I think this means manually
loading scsi_mod, populate the tables, then allow hotplug loading of HBA
drivers.

We still need a nice way to display and modify a list or array in sysfs (for
both the devinfo or this asc/ascq table).

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
  2005-02-11 19:07       ` Patrick Mansfield
@ 2005-02-11 19:41         ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2005-02-11 19:41 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: Radovan Garabik, USB Storage list, SCSI Mailing List

On Fri, 11 Feb 2005, Patrick Mansfield wrote:

> Do you have the asc/ascq for the USB device?

ASC = 0x4b, ASCQ = 0.

If you want to see the details of the entire device initialization, they 
are available at

http://kassiopeia.juls.savba.sk/~garabik/junk/gigabox.txt

Alan Stern


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again
  2005-02-11 18:31     ` Alan Stern
  2005-02-11 19:07       ` Patrick Mansfield
@ 2005-02-16 14:37       ` Radovan Garabik
  2005-02-16 16:53         ` [PATCH as468] Retry supposedly "unrecoverable" hardware errors Alan Stern
  1 sibling, 1 reply; 12+ messages in thread
From: Radovan Garabik @ 2005-02-16 14:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: usb-storage, linux-scsi

On Fri, Feb 11, 2005 at 01:31:30PM -0500, Alan Stern wrote:
...

> 
> Radovan: Below is an updated version of the patch James mentioned, with an
> entry included for your Gigabox drive.  It's meant to apply to
> 2.6.11-rc3-bk5, but it will probably go with older kernels too.  Try it
> out and let me know how it works.  If it solves your problem, I'll submit
> it for inclusion in the official kernel.

I tried it (with 2.6.10), and it really seems to solve the
problems (though I made just a few test so far).

thanks a lot

P.S. Just FYI, I tried the ub driver, it seems to suffer from the same symptoms...

-- 
 -----------------------------------------------------------
| Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-16 14:37       ` Radovan Garabik
@ 2005-02-16 16:53         ` Alan Stern
  2005-02-17  4:27           ` Douglas Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Alan Stern @ 2005-02-16 16:53 UTC (permalink / raw)
  To: James Bottomley; +Cc: Martin Peschke, Radovan Garabik, SCSI development list

James:

This is an updated and unmangled version of the patch sent in by Martin 
Peschke.  Apparently some drives report Hardware Error sense for 
problems which do improve after retrying, so the patch retries these 
supposedly "unrecoverable" errors for such devices.

In addition to the IBM ESS drive it adds a blacklist entry for the drive
inside the MPIO HS200 Gigabox.

Alan Stern



Signed-off-by: Martin Peschke <mpeschke@de.ibm.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

===== include/scsi/scsi_devinfo.h 1.7 vs edited =====
--- 1.7/include/scsi/scsi_devinfo.h	2004-10-05 11:25:13 -04:00
+++ edited/include/scsi/scsi_devinfo.h	2005-02-11 13:21:51 -05:00
@@ -27,4 +27,5 @@
 #define BLIST_NOT_LOCKABLE	0x80000	/* don't use PREVENT-ALLOW commands */
 #define BLIST_NO_ULD_ATTACH	0x100000 /* device is actually for RAID config */
 #define BLIST_SELECT_NO_ATN	0x200000 /* select without ATN */
+#define BLIST_RETRY_HWERROR	0x400000 /* retry HARDWARE_ERROR */
 #endif
===== drivers/scsi/scsi_devinfo.c 1.14 vs edited =====
--- 1.14/drivers/scsi/scsi_devinfo.c	2004-12-10 11:54:56 -05:00
+++ edited/drivers/scsi/scsi_devinfo.c	2005-02-11 13:25:41 -05:00
@@ -159,6 +159,7 @@
 	{"HP", "C3323-300", "4269", BLIST_NOTQ},
 	{"IBM", "AuSaV1S2", NULL, BLIST_FORCELUN},
 	{"IBM", "ProFibre 4000R", "*", BLIST_SPARSELUN | BLIST_LARGELUN},
+	{"IBM", "2105", NULL, BLIST_RETRY_HWERROR},
 	{"iomega", "jaz 1GB", "J.86", BLIST_NOTQ | BLIST_NOLUN},
 	{"IOMEGA", "Io20S         *F", NULL, BLIST_KEY},
 	{"INSITE", "Floptical   F*8I", NULL, BLIST_KEY},
@@ -192,6 +193,7 @@
 	{"SMSC", "USB 2 HS-CF", NULL, BLIST_SPARSELUN | BLIST_INQUIRY_36},
 	{"SONY", "CD-ROM CDU-8001", NULL, BLIST_BORKEN},
 	{"SONY", "TSL", NULL, BLIST_FORCELUN},		/* DDS3 & DDS4 autoloaders */
+	{"ST650211", "CF", NULL, BLIST_RETRY_HWERROR},
 	{"SUN", "T300", "*", BLIST_SPARSELUN},
 	{"SUN", "T4", "*", BLIST_SPARSELUN},
 	{"TEXEL", "CD-ROM", "1.06", BLIST_BORKEN},
===== drivers/scsi/scsi_error.c 1.47 vs edited =====
--- 1.47/drivers/scsi/scsi_error.c	2005-02-02 21:56:01 -05:00
+++ edited/drivers/scsi/scsi_error.c	2005-02-11 13:23:45 -05:00
@@ -31,6 +31,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_ioctl.h>
 #include <scsi/scsi_request.h>
+#include <scsi/scsi_devinfo.h>
 
 #include "scsi_priv.h"
 #include "scsi_logging.h"
@@ -350,10 +351,18 @@
 	case MEDIUM_ERROR:
 		return NEEDS_RETRY;
 
+	case HARDWARE_ERROR:
+		if (scsi_get_device_flags(scmd->device,
+					scmd->device->vendor,
+					scmd->device->model)
+				& BLIST_RETRY_HWERROR)
+			return NEEDS_RETRY;
+		else
+			return SUCCESS;
+
 	case ILLEGAL_REQUEST:
 	case BLANK_CHECK:
 	case DATA_PROTECT:
-	case HARDWARE_ERROR:
 	default:
 		return SUCCESS;
 	}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-16 16:53         ` [PATCH as468] Retry supposedly "unrecoverable" hardware errors Alan Stern
@ 2005-02-17  4:27           ` Douglas Gilbert
  2005-02-17  5:06             ` Douglas Gilbert
  2005-02-17 15:11             ` James Bottomley
  0 siblings, 2 replies; 12+ messages in thread
From: Douglas Gilbert @ 2005-02-17  4:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: James Bottomley, Martin Peschke, Radovan Garabik,
	SCSI development list

Alan Stern wrote:
> James:
> 
> This is an updated and unmangled version of the patch sent in by Martin 
> Peschke.  Apparently some drives report Hardware Error sense for 
> problems which do improve after retrying, so the patch retries these 
> supposedly "unrecoverable" errors for such devices.

Recent SPC-3 and SBC-2 drafts treat the sense keys of
MEDIUM ERROR and HARDWARE ERROR in a similar way.
Both can return an "info" field which has the same
meaning (lba of first failure). The distinction is that
MEDIUM ERROR is a little more precise (at least for
magnetic rotating media) **. For flash ram the distinction
is moot.

I believe MEDIUM ERROR and HARDWARE ERROR should be
treated the same way in scsi_check_sense() (i.e.
both return NEEDS_RETRY). That way an extra black list
category is avoided.

** HARDWARE ERROR is returned in cases of self diagnostic
failure and lack of available blocks for reassignment.
It seems valid for a device to return a HARDWARE ERROR
sense key both for these cases and unrecoverable data
errors (and ignore MEDIUM ERROR).

Doug Gilbert

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-17  4:27           ` Douglas Gilbert
@ 2005-02-17  5:06             ` Douglas Gilbert
  2005-02-17 15:20               ` Alan Stern
  2005-02-17 15:11             ` James Bottomley
  1 sibling, 1 reply; 12+ messages in thread
From: Douglas Gilbert @ 2005-02-17  5:06 UTC (permalink / raw)
  To: Alan Stern
  Cc: James Bottomley, Martin Peschke, Radovan Garabik,
	SCSI development list

[-- Attachment #1: Type: text/plain, Size: 1470 bytes --]

Douglas Gilbert wrote:
> Alan Stern wrote:
> 
>> James:
>>
>> This is an updated and unmangled version of the patch sent in by 
>> Martin Peschke.  Apparently some drives report Hardware Error sense 
>> for problems which do improve after retrying, so the patch retries 
>> these supposedly "unrecoverable" errors for such devices.
> 
> 
> Recent SPC-3 and SBC-2 drafts treat the sense keys of
> MEDIUM ERROR and HARDWARE ERROR in a similar way.
> Both can return an "info" field which has the same
> meaning (lba of first failure). The distinction is that
> MEDIUM ERROR is a little more precise (at least for
> magnetic rotating media) **. For flash ram the distinction
> is moot.
> 
> I believe MEDIUM ERROR and HARDWARE ERROR should be
> treated the same way in scsi_check_sense() (i.e.
> both return NEEDS_RETRY). That way an extra black list
> category is avoided.
> 
> 
> ** HARDWARE ERROR is returned in cases of self diagnostic
> failure and lack of available blocks for reassignment.
> It seems valid for a device to return a HARDWARE ERROR
> sense key both for these cases and unrecoverable data
> errors (and ignore MEDIUM ERROR).

... after a bit further thought, a retry (arguably) is only
needed when an unrecoverable (data) error is detected. If we
assume the "info" field indicates an unrecoverable error
then the following patch combines the processing of
MEDIUM and HARDWARE ERROR sense keys without the need for
a black list category.

Doug Gilbert


[-- Attachment #2: scsi_error2611rc4he.diff --]
[-- Type: text/x-patch, Size: 753 bytes --]

--- linux/drivers/scsi/scsi_error.c	2005-02-13 20:46:31.000000000 +1000
+++ linux/drivers/scsi/scsi_error.c2611rc4he	2005-02-17 14:55:44.000000000 +1000
@@ -279,6 +279,7 @@
 static int scsi_check_sense(struct scsi_cmnd *scmd)
 {
 	struct scsi_sense_hdr sshdr;
+	u64 info;
 
 	if (! scsi_command_normalize_sense(scmd, &sshdr))
 		return FAILED;	/* no valid sense data */
@@ -348,12 +349,15 @@
 		return SUCCESS;
 
 	case MEDIUM_ERROR:
-		return NEEDS_RETRY;
-
+	case HARDWARE_ERROR:
+		if (scsi_get_sense_info_fld(scmd->sense_buffer,
+					    sizeof(scmd->sense_buffer), &info))
+			return NEEDS_RETRY;
+		else
+			return SUCCESS;
 	case ILLEGAL_REQUEST:
 	case BLANK_CHECK:
 	case DATA_PROTECT:
-	case HARDWARE_ERROR:
 	default:
 		return SUCCESS;
 	}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-17  4:27           ` Douglas Gilbert
  2005-02-17  5:06             ` Douglas Gilbert
@ 2005-02-17 15:11             ` James Bottomley
  2005-02-18  0:49               ` Douglas Gilbert
  1 sibling, 1 reply; 12+ messages in thread
From: James Bottomley @ 2005-02-17 15:11 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Alan Stern, Martin Peschke, Radovan Garabik, SCSI Mailing List

On Thu, 2005-02-17 at 14:27 +1000, Douglas Gilbert wrote:
> Recent SPC-3 and SBC-2 drafts treat the sense keys of
> MEDIUM ERROR and HARDWARE ERROR in a similar way.
> Both can return an "info" field which has the same
> meaning (lba of first failure). The distinction is that
> MEDIUM ERROR is a little more precise (at least for
> magnetic rotating media) **. For flash ram the distinction
> is moot.

My copy of SPC-3 (r21d) still defined HARDWARE ERROR in Table 27 as

HARDWARE ERROR: Indicates that the device server detected a non-
recoverable hardware failure
(e.g., controller failure, device failure, or parity error) while
performing the command or during a self
test.

which looks pretty non-retryable to me ... where does it say that the
error might be retryable?

James



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-17  5:06             ` Douglas Gilbert
@ 2005-02-17 15:20               ` Alan Stern
  0 siblings, 0 replies; 12+ messages in thread
From: Alan Stern @ 2005-02-17 15:20 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: James Bottomley, Martin Peschke, Radovan Garabik,
	SCSI development list

On Thu, 17 Feb 2005, Douglas Gilbert wrote:

> ... after a bit further thought, a retry (arguably) is only
> needed when an unrecoverable (data) error is detected. If we
> assume the "info" field indicates an unrecoverable error
> then the following patch combines the processing of
> MEDIUM and HARDWARE ERROR sense keys without the need for
> a black list category.

Your assumption is not a good one.  In the particular case that led me to 
submit this patch, the device returns Valid = 0, with no "info" data.  
Nevertheless, retrying fixes the problem, whereas your patch won't help.

Alan Stern


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH as468] Retry supposedly "unrecoverable" hardware errors
  2005-02-17 15:11             ` James Bottomley
@ 2005-02-18  0:49               ` Douglas Gilbert
  0 siblings, 0 replies; 12+ messages in thread
From: Douglas Gilbert @ 2005-02-18  0:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Martin Peschke, Radovan Garabik, SCSI Mailing List

James Bottomley wrote:
> On Thu, 2005-02-17 at 14:27 +1000, Douglas Gilbert wrote:
> 
>>Recent SPC-3 and SBC-2 drafts treat the sense keys of
>>MEDIUM ERROR and HARDWARE ERROR in a similar way.
>>Both can return an "info" field which has the same
>>meaning (lba of first failure). The distinction is that
>>MEDIUM ERROR is a little more precise (at least for
>>magnetic rotating media) **. For flash ram the distinction
>>is moot.
> 
> 
> My copy of SPC-3 (r21d) still defined HARDWARE ERROR in Table 27 as
> 
> HARDWARE ERROR: Indicates that the device server detected a non-
> recoverable hardware failure
> (e.g., controller failure, device failure, or parity error) while
> performing the command or during a self
> test.
> 
> which looks pretty non-retryable to me ... where does it say that the
> error might be retryable?

James,
The definition of MEDIUM ERROR from the same table:
"Indicates that the command terminated with a non-recoverable
error condition that may have been caused by a flaw in the
medium or an error in the recorded data. This sense key may
also be returned if the device server is unable to
distinguish between a flaw in the medium and a specific
hardware failure (i.e. sense key 4h)". Sense key "4h" is
HARDWARE ERROR.

I interpret that as SPC-3 saying MEDIUM ERROR and
HARDWARE ERROR may both report non-recoverable errors.
Also note that MEDIUM ERROR, HARDWARE ERROR and RECOVERED
ERROR can return an "actual retry count" in their additional
sense data.

SBC-2 (rev 16) makes little distinction between
the two sense keys for "unrecovered read errors": table 4 shows
either can be used. It also says on page 19: "When
an unrecovered read error is reported the information field
of the sense data shall contain the LBA of the unrecovered
logical block."

Nothing that I can see links an "unrecovered (read) error" with
the application client retrying the same command in either draft.
If "actual retry count" is > 1 in the sense key specific field
then that implies the device has already tried several times.

SSC-3 (for tape drives) also allows MEDIUM ERROR or HARDWARE ERROR
to indicate an unrecovered read error (rev 1c, table 2). For tape
drives, retrying the same command is probably not appropriate. [I
note that st and sg set their 'max_retries' to 0 to inhibit this.]
MMC-5 only mentions the HARDWARE ERROR sense key for a self
diagnostic failure.

This analysis leads me to question why retries are instigated
from the mid level and not the sd driver (and perhaps sr driver
as well). If so, sd should not instigate retries if the device
indicates a reasonable number of retries have already taken
place, unless it can change some other factor or is instructed by
some parameter to sd.

As Alan Stern points out, my patch fails the reality
test. The device in question obviously required a retry when
it returned a HARDWARE ERROR sense key (but perhaps the
reason was not an unrecovered error or it was not reported
properly).

Doug Gilbert

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-02-18  0:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20050210134432.GA12229@kassiopeia.juls.savba.sk>
2005-02-11 16:00 ` [usb-storage] Re: MPIO HS200 Gigabox weird behaviour again Alan Stern
2005-02-11 16:18   ` James Bottomley
2005-02-11 18:31     ` Alan Stern
2005-02-11 19:07       ` Patrick Mansfield
2005-02-11 19:41         ` Alan Stern
2005-02-16 14:37       ` Radovan Garabik
2005-02-16 16:53         ` [PATCH as468] Retry supposedly "unrecoverable" hardware errors Alan Stern
2005-02-17  4:27           ` Douglas Gilbert
2005-02-17  5:06             ` Douglas Gilbert
2005-02-17 15:20               ` Alan Stern
2005-02-17 15:11             ` James Bottomley
2005-02-18  0:49               ` Douglas Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox