[PATCH] scsi_lib.c: continue after MEDIUM

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
@ 2007-01-31  0:47 Mark Lord
  2007-01-31  1:12 ` [PATCH] RESEND " Mark Lord
  2007-01-31  1:16 ` [PATCH] " James Bottomley
  0 siblings, 2 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31  0:47 UTC (permalink / raw)
  To: linux-kernel, IDE/ATA development list, James Bottomley

In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR.  It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.

This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also failed.

This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.

Signed-off-by:  Mark Lord <mlord@pobox.com>
---
diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
--- old/drivers/scsi/scsi_lib.c	2007-01-30 13:58:05.000000000 -0500
+++ linux/drivers/scsi/scsi_lib.c	2007-01-30 18:30:01.000000000 -0500
@@ -865,6 +865,12 @@
 	 */
 	if (sense_valid && !sense_deferred) {
 		switch (sshdr.sense_key) {
+		case MEDIUM_ERROR:
+			// Bad sector.  Fail it, and then continue the rest of the request:
+			if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
+				cmd->retries = 0;	// go around again..
+				return;
+			}
 		case UNIT_ATTENTION:
 			if (cmd->device->removable) {
 				/* Detected disc change.  Set a bit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] RESEND scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord
@ 2007-01-31  1:12 ` Mark Lord
  2007-01-31  1:16 ` [PATCH] " James Bottomley
  1 sibling, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31  1:12 UTC (permalink / raw)
  To: linux-kernel, IDE/ATA development list, James Bottomley,
	linux-scsi

Fixed for 80-columns, and copying linux-scsi this time.

In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR.  It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.

This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also failed.

This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.

Signed-off-by:  Mark Lord <mlord@pobox.com>
---
--- old/drivers/scsi/scsi_lib.c	2007-01-30 20:06:15.000000000 -0500
+++ linux/drivers/scsi/scsi_lib.c	2007-01-30 20:06:59.000000000 -0500
@@ -865,6 +865,13 @@
 	 */
 	if (sense_valid && !sense_deferred) {
 		switch (sshdr.sense_key) {
+		case MEDIUM_ERROR:
+			/* Bad sector. Fail it, and continue on with the rest */
+			if (scsi_end_request(cmd, 0,
+					cmd->device->sector_size, 1) == NULL) {
+				cmd->retries = 0;	/* go around again.. */
+				return;
+			}
 		case UNIT_ATTENTION:
 			if (cmd->device->removable) {
 				/* Detected disc change.  Set a bit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord
  2007-01-31  1:12 ` [PATCH] RESEND " Mark Lord
@ 2007-01-31  1:16 ` James Bottomley
  2007-01-31  1:36   ` Mark Lord
                     ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: James Bottomley @ 2007-01-31  1:16 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-kernel, IDE/ATA development list, linux-scsi

First off, please send SCSI patches to the SCSI list:
<linux-scsi@vger.kernel.org>

On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:
> In ancient kernels, the SCSI disk code used to continue after
> encountering a MEDIUM_ERROR.  It would "complete" the good
> sectors before the error, fail the bad sector/block, and then
> continue with the rest of the request.
> 
> Kernels since about 2.6.16 or so have been broken in this regard.
> They "complete" the good sectors before the error,
> and then fail the entire remaining portions of the request.

What was the commit that introduced the change? ... I have a vague
memory of it being deliberate.

> This is very risky behaviour, as a request is often a merge
> of several bios, and just because one application hits a bad sector
> is no reason to pretend that (for example) an adjacent directly lookup also failed.
> 
> This patch fixes the behaviour to be similar to what we had originally.
> 
> When a bad sector is encounted, SCSI will now work around it again,
> failing *only* the bad sector itself.

Erm, but the corollary is that if we get a large read failure because of
a bad track, you're going to try and chunk up it a sector at a time
forcing an individual error for each sector is going to annoy some
people ... particularly removable medium ones which return this error if
the medium isn't present ... Are you sure this is really what we want to
do?

> Signed-off-by:  Mark Lord <mlord@pobox.com>
> ---
> diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
> --- old/drivers/scsi/scsi_lib.c	2007-01-30 13:58:05.000000000 -0500
> +++ linux/drivers/scsi/scsi_lib.c	2007-01-30 18:30:01.000000000 -0500
> @@ -865,6 +865,12 @@
>  	 */
>  	if (sense_valid && !sense_deferred) {
>  		switch (sshdr.sense_key) {
> +		case MEDIUM_ERROR:
> +			// Bad sector.  Fail it, and then continue the rest of the request:
> +			if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {

The sense key may have come with additional information  I think we want
to parse that (if it exists) rather than just blindly failing the first
sector of the request.

> +				cmd->retries = 0;	// go around again..
> +				return;
> +			}

This would drop through to the UNIT_ATTENTION case if scsi_end_request()
fails ... I don't think that's correct.

>  		case UNIT_ATTENTION:
>  			if (cmd->device->removable) {
>  				/* Detected disc change.  Set a bit

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  1:16 ` [PATCH] " James Bottomley
@ 2007-01-31  1:36   ` Mark Lord
       [not found]   ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>
  2007-02-01 20:02   ` Mark Lord
  2 siblings, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31  1:36 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi

James Bottomley wrote:
> First off, please send SCSI patches to the SCSI list:
> <linux-scsi@vger.kernel.org>

Fixed already, thanks!

>> This patch fixes the behaviour to be similar to what we had originally.
>>
>> When a bad sector is encounted, SCSI will now work around it again,
>> failing *only* the bad sector itself.
> 
> Erm, but the corollary is that if we get a large read failure because of
> a bad track, you're going to try and chunk up it a sector at a time

That's better than the huge data-loss scenario that we currently
have for single-sector errors.  MUCH better.

> forcing an individual error for each sector is going to annoy some
> people ... particularly removable medium ones which return this error if
> the medium isn't present ... Are you sure this is really what we want to
> do?

No, for removed-medium everything just fails right away.
This patch is *only* for media errors, not any other failures.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

[parent not found: <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>]

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
       [not found]   ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>
@ 2007-01-31  1:41     ` Mark Lord
  2007-01-31  3:20       ` Ric Wheeler
  2007-01-31  9:30       ` Jeff Garzik
  0 siblings, 2 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31  1:41 UTC (permalink / raw)
  To: Eric D. Mudama
  Cc: James Bottomley, linux-kernel, IDE/ATA development list,
	linux-scsi

Eric D. Mudama wrote:
>
> Actually, it's possibly worse, since each failure in libata will 
> generate 3-4 retries.  With existing ATA error recovery in the drives, 
> that's about 3 seconds per retry on average, or 12 seconds per failure.  
> Multiply that by the number of blocks past the error to complete the 
> request..

It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  1:41     ` Mark Lord
@ 2007-01-31  3:20       ` Ric Wheeler
  2007-01-31  4:21         ` James Bottomley
                           ` (2 more replies)
  2007-01-31  9:30       ` Jeff Garzik
  1 sibling, 3 replies; 34+ messages in thread
From: Ric Wheeler @ 2007-01-31  3:20 UTC (permalink / raw)
  To: Mark Lord
  Cc: Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi



Mark Lord wrote:

> Eric D. Mudama wrote:
>
>>
>> Actually, it's possibly worse, since each failure in libata will 
>> generate 3-4 retries.  With existing ATA error recovery in the 
>> drives, that's about 3 seconds per retry on average, or 12 seconds 
>> per failure.  Multiply that by the number of blocks past the error to 
>> complete the request..
>
>
> It really beats the alternative of a forced reboot
> due to, say, superblock I/O failing because it happened
> to get merged with an unrelated I/O which then failed..
> Etc..
>
> Definitely an improvement.
>
> The number of retries is an entirely separate issue.
> If we really care about it, then we should fix SD_MAX_RETRIES.
>
> The current value of 5 is *way* too high.  It should be zero or one.
>
> Cheers
>
I think that drives retry enough, we should leave retry at zero for 
normal (non-removable) drives. Should this  be a policy we can set like 
we do with NCQ queue depth via /sys ?

We need to be able to layer things like MD on top of normal drive errors 
in a way that will produce a system that provides reasonable response 
time despite any possible IO error on a single component.  Another case 
that we end up doing on a regular basis is drive recovery. Errors need 
to be limited in scope to just the impacted area and dispatched up to 
the application layer as quickly as we can so that you don't spend days 
watching a copy of  huge drive (think 750GB or more) ;-)

ric


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  3:20       ` Ric Wheeler
@ 2007-01-31  4:21         ` James Bottomley
  2007-01-31 15:13           ` Mark Lord
  2007-01-31  5:09         ` Douglas Gilbert
  2007-01-31 15:08         ` Mark Lord
  2 siblings, 1 reply; 34+ messages in thread
From: James Bottomley @ 2007-01-31  4:21 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Eric D. Mudama, linux-kernel, IDE/ATA development list,
	linux-scsi

On Tue, 2007-01-30 at 22:20 -0500, Ric Wheeler wrote:
> Mark Lord wrote:
> > The number of retries is an entirely separate issue.
> > If we really care about it, then we should fix SD_MAX_RETRIES.
> >
> > The current value of 5 is *way* too high.  It should be zero or one.
> >
> > Cheers
> >
> I think that drives retry enough, we should leave retry at zero for 
> normal (non-removable) drives. Should this  be a policy we can set like 
> we do with NCQ queue depth via /sys ?

I don't disagree that it should be settable.  However, retries occur for
other reasons than failures inside the device.  The most standard ones
are unit attentions generated because of other activity (target reset
etc).  The key to the problem is retrying only operations that are
genuinely retryable, which the mid-layer doesn't do such a good job on.

> We need to be able to layer things like MD on top of normal drive errors 
> in a way that will produce a system that provides reasonable response 
> time despite any possible IO error on a single component.  Another case 
> that we end up doing on a regular basis is drive recovery. Errors need 
> to be limited in scope to just the impacted area and dispatched up to 
> the application layer as quickly as we can so that you don't spend days 
> watching a copy of  huge drive (think 750GB or more) ;-)

For the MD case, this is what REQ_FAILFAST is for.

James



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  4:21         ` James Bottomley
@ 2007-01-31 15:13           ` Mark Lord
  2007-01-31 15:22             ` Mark Lord
  2007-01-31 15:24             ` James Bottomley
  0 siblings, 2 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31 15:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Eric D. Mudama, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

James Bottomley wrote:
>
> For the MD case, this is what REQ_FAILFAST is for.

I cannot find where SCSI honours that flag.  James?

And for that matter, even when I patch SCSI so that it *does* honour it,
I don't actually see the flag making it into the SCSI layer from above.

And I don't see where/how the block layer takes care when considering
merge FAILFAST/READA requests with non FAILFAST/READA requests.
To me, it looks perfectly happy to add non-FAILFAST/READA bios
to a FAILFAST request, risking data loss if a lower-layer decides
to honour the FAILFAST/READA flags.

So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:13           ` Mark Lord
@ 2007-01-31 15:22             ` Mark Lord
  2007-01-31 15:24             ` James Bottomley
  1 sibling, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31 15:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ric Wheeler, Eric D. Mudama, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

Mark Lord wrote:
> James Bottomley wrote:
>>
>> For the MD case, this is what REQ_FAILFAST is for.
> 
> I cannot find where SCSI honours that flag.  James?

Scratch that thought.. SCSI honours it in scsi_end_request().

But I'm not certain that the block layer handles it correctly,
at least not in the 2.6.16/2.6.18 kernel that I'm working with today.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:13           ` Mark Lord
  2007-01-31 15:22             ` Mark Lord
@ 2007-01-31 15:24             ` James Bottomley
  1 sibling, 0 replies; 34+ messages in thread
From: James Bottomley @ 2007-01-31 15:24 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Eric D. Mudama, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

On Wed, 2007-01-31 at 10:13 -0500, Mark Lord wrote:
> James Bottomley wrote:
> >
> > For the MD case, this is what REQ_FAILFAST is for.

> I cannot find where SCSI honours that flag.  James?

Er, it's in scsi_error.c:scsi_decide_disposition():

      maybe_retry:

	/* we requeue for retry because the error was retryable, and
	 * the request was not marked fast fail.  Note that above,
	 * even if the request is marked fast fail, we still requeue
	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
	if ((++scmd->retries) <= scmd->allowed
	    && !blk_noretry_request(scmd->request)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
		return NEEDS_RETRY;
	} else {
		/*
		 * no more retries - report this one back to upper level.
		 */
		return SUCCESS;
	}

> And for that matter, even when I patch SCSI so that it *does* honour it,
> I don't actually see the flag making it into the SCSI layer from above.
> 
> And I don't see where/how the block layer takes care when considering
> merge FAILFAST/READA requests with non FAILFAST/READA requests.
> To me, it looks perfectly happy to add non-FAILFAST/READA bios
> to a FAILFAST request, risking data loss if a lower-layer decides
> to honour the FAILFAST/READA flags.
> 
> So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  3:20       ` Ric Wheeler
  2007-01-31  4:21         ` James Bottomley
@ 2007-01-31  5:09         ` Douglas Gilbert
  2007-01-31 15:08         ` Mark Lord
  2 siblings, 0 replies; 34+ messages in thread
From: Douglas Gilbert @ 2007-01-31  5:09 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

Ric Wheeler wrote:
> 
> 
> Mark Lord wrote:
> 
>> Eric D. Mudama wrote:
>>
>>>
>>> Actually, it's possibly worse, since each failure in libata will
>>> generate 3-4 retries.  With existing ATA error recovery in the
>>> drives, that's about 3 seconds per retry on average, or 12 seconds
>>> per failure.  Multiply that by the number of blocks past the error to
>>> complete the request..
>>
>>
>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
>>
>> Definitely an improvement.
>>
>> The number of retries is an entirely separate issue.
>> If we really care about it, then we should fix SD_MAX_RETRIES.
>>
>> The current value of 5 is *way* too high.  It should be zero or one.
>>
>> Cheers
>>
> I think that drives retry enough, we should leave retry at zero for
> normal (non-removable) drives. Should this  be a policy we can set like
> we do with NCQ queue depth via /sys ?

The transport might also want a say. I see ABORTED COMMAND
errors often enough with SAS (e.g. due to expander congestion)
to warrant at least one retry (which works in my testing).
SATA disks behind SAS infrastructure would also be
susceptible to the same "random" failures.

Transport Layer Retries (TLR) in SAS should remove this class
of transport errors but only SAS tape drives support TLR as
far as I know.

Doug Gilbert

> We need to be able to layer things like MD on top of normal drive errors
> in a way that will produce a system that provides reasonable response
> time despite any possible IO error on a single component.  Another case
> that we end up doing on a regular basis is drive recovery. Errors need
> to be limited in scope to just the impacted area and dispatched up to
> the application layer as quickly as we can so that you don't spend days
> watching a copy of  huge drive (think 750GB or more) ;-)
> 
> ric



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  3:20       ` Ric Wheeler
  2007-01-31  4:21         ` James Bottomley
  2007-01-31  5:09         ` Douglas Gilbert
@ 2007-01-31 15:08         ` Mark Lord
  2007-01-31 15:23           ` Alan
  2 siblings, 1 reply; 34+ messages in thread
From: Mark Lord @ 2007-01-31 15:08 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

Ric Wheeler wrote:
> Mark Lord wrote:
>> Eric D. Mudama wrote:
>>> Actually, it's possibly worse, since each failure in libata will 
>>> generate 3-4 retries. 

(note: libata does *not* generate retries for medium errors;
 the looping is driven by the SCSI mid-layer code).

>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
>>
>> Definitely an improvement.
>>
>> The number of retries is an entirely separate issue.
>> If we really care about it, then we should fix SD_MAX_RETRIES.
>>
>> The current value of 5 is *way* too high.  It should be zero or one.
..
> I think that drives retry enough, we should leave retry at zero for 
> normal (non-removable) drives. Should this  be a policy we can set like 
> we do with NCQ queue depth via /sys ?

Or perhaps we could have the mid-layer always "early-exit"
without retries for "MEDIUM_ERROR", and still do retries for the rest.

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).

But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:08         ` Mark Lord
@ 2007-01-31 15:23           ` Alan
  2007-01-31 16:35             ` Ric Wheeler
  2007-01-31 17:57             ` Mark Lord
  0 siblings, 2 replies; 34+ messages in thread
From: Alan @ 2007-01-31 15:23 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
> as the drive itself has already done internal retries (libata uses the
> "with retry" ATA opcodes for this).

This depends on the firmware. Some of the "raid firmware" drives don't
appear to do retries in firmware.

> But meanwhile, we still have the original issue too, where a single stray
> bad sector can blow a system out of the water, because the mid-layer
> currently aborts everything after it from a large merged request.
> 
> Thus the original patch from this thread.  :)

Agreed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:23           ` Alan
@ 2007-01-31 16:35             ` Ric Wheeler
  2007-01-31 17:57             ` Mark Lord
  1 sibling, 0 replies; 34+ messages in thread
From: Ric Wheeler @ 2007-01-31 16:35 UTC (permalink / raw)
  To: Alan
  Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg



Alan wrote:
>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>> as the drive itself has already done internal retries (libata uses the
>> "with retry" ATA opcodes for this).
> 
> This depends on the firmware. Some of the "raid firmware" drives don't
> appear to do retries in firmware.

I think that even for these devices, the actual drives behind the controller 
will do retries. In any case, it would be reasonable to be able to set this 
retry/no-retry via /sys to handle exceptional cases...

> 
>> But meanwhile, we still have the original issue too, where a single stray
>> bad sector can blow a system out of the water, because the mid-layer
>> currently aborts everything after it from a large merged request.
>>
>> Thus the original patch from this thread.  :)
> 
> Agreed



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:23           ` Alan
  2007-01-31 16:35             ` Ric Wheeler
@ 2007-01-31 17:57             ` Mark Lord
  2007-01-31 18:13               ` James Bottomley
  1 sibling, 1 reply; 34+ messages in thread
From: Mark Lord @ 2007-01-31 17:57 UTC (permalink / raw)
  To: Alan
  Cc: Ric Wheeler, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

Alan wrote:
>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>> as the drive itself has already done internal retries (libata uses the
>> "with retry" ATA opcodes for this).
> 
> This depends on the firmware. Some of the "raid firmware" drives don't
> appear to do retries in firmware.

One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).

If the failure takes more than a few hundred milliseconds to be reported,
or in this case 4-7 seconds typically, then we know the drive was doing
retries before it reported back.

I haven't seen any drive fail instantly yet.
Can anyone with those newfangled "RAID edition" drives try it
and report back?  Oh.. you'll need a way to create a bad sector.
I've got patches and a command-line utility for the job.

If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm),
then the patches aren't needed.

>> But meanwhile, we still have the original issue too, where a single stray
>> bad sector can blow a system out of the water, because the mid-layer
>> currently aborts everything after it from a large merged request.
>>
>> Thus the original patch from this thread.  :)
> 
> Agreed

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 17:57             ` Mark Lord
@ 2007-01-31 18:13               ` James Bottomley
  2007-01-31 18:37                 ` Mark Lord
  0 siblings, 1 reply; 34+ messages in thread
From: James Bottomley @ 2007-01-31 18:13 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan, Ric Wheeler, Eric D. Mudama, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:
> Alan wrote:
> >> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
> >> as the drive itself has already done internal retries (libata uses the
> >> "with retry" ATA opcodes for this).
> > 
> > This depends on the firmware. Some of the "raid firmware" drives don't
> > appear to do retries in firmware.
> 
> One way to tell if this is true, is simply to time how long
> the failed operation takes.  If the drive truly does not do retries,
> then the media error should be reported more or less instantly
> (assuming drive was already spun up).

Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885

I thought I'd applied.  Apparently I didn't, so I'll go back and put
them in.

James



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 18:13               ` James Bottomley
@ 2007-01-31 18:37                 ` Mark Lord
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31 18:37 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan, Ric Wheeler, Eric D. Mudama, linux-kernel,
	IDE/ATA development list, linux-scsi, dougg

James Bottomley wrote:
> On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:
>> Alan wrote:
>>>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>>>> as the drive itself has already done internal retries (libata uses the
>>>> "with retry" ATA opcodes for this).
>>> This depends on the firmware. Some of the "raid firmware" drives don't
>>> appear to do retries in firmware.
>> One way to tell if this is true, is simply to time how long
>> the failed operation takes.  If the drive truly does not do retries,
>> then the media error should be reported more or less instantly
>> (assuming drive was already spun up).
> 
> Well, the simpler way (and one we have a hope of implementing) is to
> examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

My suggestion above was not for a kernel fix,
but rather just as a way of determining if drives
which claim "no retries" actually do them or not.  :)

> I seem to have dropped the ball on this one in that the scsi_error.c
> pieces of this patch
> 
> http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885
> 
> I thought I'd applied.  Apparently I didn't, so I'll go back and put
> them in.

Good.  That would be a useful supplement to the patch I posted here.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  1:41     ` Mark Lord
  2007-01-31  3:20       ` Ric Wheeler
@ 2007-01-31  9:30       ` Jeff Garzik
  2007-01-31 14:36         ` Ric Wheeler
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff Garzik @ 2007-01-31  9:30 UTC (permalink / raw)
  To: Mark Lord, Eric D. Mudama
  Cc: James Bottomley, linux-kernel, IDE/ATA development list,
	linux-scsi

Mark Lord wrote:
> Eric D. Mudama wrote:
>>
>> Actually, it's possibly worse, since each failure in libata will 
>> generate 3-4 retries.  With existing ATA error recovery in the drives, 
>> that's about 3 seconds per retry on average, or 12 seconds per 
>> failure.  Multiply that by the number of blocks past the error to 
>> complete the request..
> 
> It really beats the alternative of a forced reboot
> due to, say, superblock I/O failing because it happened
> to get merged with an unrelated I/O which then failed..
> Etc..

FWIW -- speaking generally -- I think there are inevitable areas where 
libata error handling combined with SCSI error handling results in 
suboptimal error handling.

Just creating a list of "<this behavior> should be handled <this way>, 
but in reality is handled in <this silly way>" would be very helpful.

Error handling is tough to get right, because the code is exercised so 
infrequently.  Tejun has actually done an above-average job here, by 
making device probe, hotplug and other "exceptions" go through the 
libata EH code, thereby exercising the EH code more than one might 
normally assume.

Some errors in libata probably should not be retried more than once, 
when we have a definitive diagnosis.  Suggestions for improvements are 
welcome.

	Jeff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  9:30       ` Jeff Garzik
@ 2007-01-31 14:36         ` Ric Wheeler
  2007-01-31 15:28           ` Douglas Gilbert
  0 siblings, 1 reply; 34+ messages in thread
From: Ric Wheeler @ 2007-01-31 14:36 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi



Jeff Garzik wrote:
> Mark Lord wrote:
>> Eric D. Mudama wrote:
>>>
>>> Actually, it's possibly worse, since each failure in libata will 
>>> generate 3-4 retries.  With existing ATA error recovery in the 
>>> drives, that's about 3 seconds per retry on average, or 12 seconds 
>>> per failure.  Multiply that by the number of blocks past the error to 
>>> complete the request..
>>
>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
> 
> 
> FWIW -- speaking generally -- I think there are inevitable areas where 
> libata error handling combined with SCSI error handling results in 
> suboptimal error handling.
> 
> Just creating a list of "<this behavior> should be handled <this way>, 
> but in reality is handled in <this silly way>" would be very helpful.

I agree - Tejun has done a great job at giving us a great base. Next step is to 
get clarity on what the types of errors are and how to differentiate between 
them (and maybe how that would change by class of device?).

> 
> Error handling is tough to get right, because the code is exercised so 
> infrequently.  Tejun has actually done an above-average job here, by 
> making device probe, hotplug and other "exceptions" go through the 
> libata EH code, thereby exercising the EH code more than one might 
> normally assume.
> 
> Some errors in libata probably should not be retried more than once, 
> when we have a definitive diagnosis.  Suggestions for improvements are 
> welcome.
> 
>     Jeff

One thing that we find really useful is to inject real errors into devices. Mark 
has some patches that let us inject media errors, we also bring back failed 
drives and run them through testing and occasionally get to use analyzers, etc 
to inject odd ball errors.

Hopefully, we will get some time to brainstorm about this at the workshop,

ric

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 14:36         ` Ric Wheeler
@ 2007-01-31 15:28           ` Douglas Gilbert
  2007-01-31 15:38             ` Mark Lord
  0 siblings, 1 reply; 34+ messages in thread
From: Douglas Gilbert @ 2007-01-31 15:28 UTC (permalink / raw)
  To: ric
  Cc: Jeff Garzik, Mark Lord, Eric D. Mudama, James Bottomley,
	linux-kernel, IDE/ATA development list, linux-scsi

Ric Wheeler wrote:
> 
> 
> Jeff Garzik wrote:
>> Mark Lord wrote:
>>> Eric D. Mudama wrote:
>>>>
>>>> Actually, it's possibly worse, since each failure in libata will
>>>> generate 3-4 retries.  With existing ATA error recovery in the
>>>> drives, that's about 3 seconds per retry on average, or 12 seconds
>>>> per failure.  Multiply that by the number of blocks past the error
>>>> to complete the request..
>>>
>>> It really beats the alternative of a forced reboot
>>> due to, say, superblock I/O failing because it happened
>>> to get merged with an unrelated I/O which then failed..
>>> Etc..
>>
>>
>> FWIW -- speaking generally -- I think there are inevitable areas where
>> libata error handling combined with SCSI error handling results in
>> suboptimal error handling.
>>
>> Just creating a list of "<this behavior> should be handled <this way>,
>> but in reality is handled in <this silly way>" would be very helpful.
> 
> I agree - Tejun has done a great job at giving us a great base. Next
> step is to get clarity on what the types of errors are and how to
> differentiate between them (and maybe how that would change by class of
> device?).
> 
>>
>> Error handling is tough to get right, because the code is exercised so
>> infrequently.  Tejun has actually done an above-average job here, by
>> making device probe, hotplug and other "exceptions" go through the
>> libata EH code, thereby exercising the EH code more than one might
>> normally assume.
>>
>> Some errors in libata probably should not be retried more than once,
>> when we have a definitive diagnosis.  Suggestions for improvements are
>> welcome.
>>
>>     Jeff
> 
> One thing that we find really useful is to inject real errors into
> devices. Mark has some patches that let us inject media errors, we also
> bring back failed drives and run them through testing and occasionally
> get to use analyzers, etc to inject odd ball errors.

Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as "uncorrectable". There
is no need to send bad "long" data to it and suppress the
disk's automatic re-allocation logic.

In the case of ATA it is the WRITE UNCORRECTABLE command.
In the case of SCSI it is the WR_UNCOR bit in the WRITE
LONG command.

It seems that due to SAT any useful capability in the ATA
command set will soon appear in the corresponding SCSI
command set, if it is not already there.

Doug Gilbert

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31 15:28           ` Douglas Gilbert
@ 2007-01-31 15:38             ` Mark Lord
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31 15:38 UTC (permalink / raw)
  To: dougg
  Cc: ric, Jeff Garzik, Eric D. Mudama, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

Douglas Gilbert wrote:
>
> Ric,
> Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
> command support to flag a block as "uncorrectable". There
> is no need to send bad "long" data to it and suppress the
> disk's automatic re-allocation logic.

That'll be useful in a couple of years, once drives that have it
become more common.  For now, though, we're hacking current drives
using READ/WRITE LONG commands, with a corresponding patch to libata
to allow for the longer "sector size" involved.

Having real bad sectors, exactly where we want them on the media,
sure does make testing / fixing the EH mechanisms a lot more feasible.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-01-31  1:16 ` [PATCH] " James Bottomley
  2007-01-31  1:36   ` Mark Lord
       [not found]   ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>
@ 2007-02-01 20:02   ` Mark Lord
  2007-02-01 21:55     ` James Bottomley
  2 siblings, 1 reply; 34+ messages in thread
From: Mark Lord @ 2007-02-01 20:02 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi

James Bottomley wrote:
> On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:
>> Kernels since about 2.6.16 or so have been broken in this regard.
>> They "complete" the good sectors before the error,
>> and then fail the entire remaining portions of the request.
> 
> What was the commit that introduced the change? ... I have a vague
> memory of it being deliberate.

I believe you made the first change in response to my prodding at the time,
when libata was not returning valid sense data (no LBA) for media errors.
The SCSI EH handling of that was rather poor at the time,
and so having it not retry the remaining sectors was actually
a very good fix at the time.

But now, libata *does* return valid sense data for LBA/DMA drives,
and the workaround from circa 2.6.16 is no longer the best we can do.
Now that we know which sector failed, we ought to be able to skip
over it, and continue with the rest of the merged request.

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?

..
>> Signed-off-by:  Mark Lord <mlord@pobox.com>
>> ---
>> diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
>> --- old/drivers/scsi/scsi_lib.c	2007-01-30 13:58:05.000000000 -0500
>> +++ linux/drivers/scsi/scsi_lib.c	2007-01-30 18:30:01.000000000 -0500
>> @@ -865,6 +865,12 @@
>>  	 */
>>  	if (sense_valid && !sense_deferred) {
>>  		switch (sshdr.sense_key) {
>> +		case MEDIUM_ERROR:
>> +			// Bad sector.  Fail it, and then continue the rest of the request:
>> +			if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
> 
> The sense key may have come with additional information  I think we want
> to parse that (if it exists) rather than just blindly failing the first
> sector of the request.
> 
>> +				cmd->retries = 0;	// go around again..
>> +				return;
>> +			}
> 
> This would drop through to the UNIT_ATTENTION case if scsi_end_request()
> fails ... I don't think that's correct.
> 
>>  		case UNIT_ATTENTION:
>>  			if (cmd->device->removable) {
>>  				/* Detected disc change.  Set a bit
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-01 20:02   ` Mark Lord
@ 2007-02-01 21:55     ` James Bottomley
  2007-02-02  2:48       ` Mark Lord
  2007-02-02 12:20       ` Ric Wheeler
  0 siblings, 2 replies; 34+ messages in thread
From: James Bottomley @ 2007-02-01 21:55 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-kernel, IDE/ATA development list, linux-scsi

On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
> I believe you made the first change in response to my prodding at the time,
> when libata was not returning valid sense data (no LBA) for media errors.
> The SCSI EH handling of that was rather poor at the time,
> and so having it not retry the remaining sectors was actually
> a very good fix at the time.
> 
> But now, libata *does* return valid sense data for LBA/DMA drives,
> and the workaround from circa 2.6.16 is no longer the best we can do.
> Now that we know which sector failed, we ought to be able to skip
> over it, and continue with the rest of the merged request.

We can ... the big concern with your approach, which you haven't
addressed is the time factor.  For most SCSI devices, returning a fatal
MEDIUM ERROR means we're out of remapping table, and also that there's
probably a bunch of sectors on the track that are now out.  Thus, there
are almost always multiple sector failures.  In linux, the average
request size on a filesystem is around 64-128kb; thats 128-256 sectors.
If we fail at the initial sector, we have to go through another 128-256
attempts, with the internal device retries, before we fail the entire
request.  Some devices can take a second or so for each read before they
finally give up and decide they really can't read the sector, so you're
looking at 2-5 minutes before the machine finally fails this one
request ... and much worse for devices that retry more times.

> One thing that could be even better than the patch below,
> would be to have it perhaps skip the entire bio that includes
> the failed sector, rather than only the bad sector itself.

Er ... define "skip over the bio".  A bio is simply a block
representation for a bunch of sg elements coming in to the elevator.
Mostly what we see in SCSI is a single bio per request, so skipping the
bio is really the current behaviour (to fail the rest of the request).

> I think doing that might address most concerns expressed here.
> Have you got an alternate suggestion, James?

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-01 21:55     ` James Bottomley
@ 2007-02-02  2:48       ` Mark Lord
  2007-02-02 12:20       ` Ric Wheeler
  1 sibling, 0 replies; 34+ messages in thread
From: Mark Lord @ 2007-02-02  2:48 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi

James Bottomley wrote:
> On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
>..
>> One thing that could be even better than the patch below,
>> would be to have it perhaps skip the entire bio that includes
>> the failed sector, rather than only the bad sector itself.
> 
> Er ... define "skip over the bio".  A bio is simply a block
> representation for a bunch of sg elements coming in to the elevator.

Exactly.  Or rather, a block of sg_elements from a single point
of request, is it not?

> Mostly what we see in SCSI is a single bio per request, so skipping the
> bio is really the current behaviour (to fail the rest of the request).

Very good.  That's what it's supposed to do.

But if each request contained only a single bio, then all of Jens'
work on IO scheduling would be for nothing, n'est-ce pas?

In the case where a request consists of multiple bio's
which have been merged under a single request struct,
we really should give at least one attempt to each bio.

This way, in most cases, only the process that requested the
failed sector(s) will see an error, not the innocent victims
that happened to get merged onto the end.  Which could be very
critical stuff (or not -- it could be quite random).

So the time factor works out to one disk I/O timeout per failed bio.
That's what would have happened with the NOP scheduler anyway.

On the sytems I'm working with, I don't see huge numbers of bad sectors.
What they tend to show is just one or two bad sectors, widely scattered.

So:
>> I think doing that might address most concerns expressed here.
>> Have you got an alternate suggestion, James?

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-01 21:55     ` James Bottomley
  2007-02-02  2:48       ` Mark Lord
@ 2007-02-02 12:20       ` Ric Wheeler
  2007-02-02 14:42         ` Alan
  2007-02-02 14:50         ` Alan
  1 sibling, 2 replies; 34+ messages in thread
From: Ric Wheeler @ 2007-02-02 12:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi



James Bottomley wrote:

>On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
>  
>
>>I believe you made the first change in response to my prodding at the time,
>>when libata was not returning valid sense data (no LBA) for media errors.
>>The SCSI EH handling of that was rather poor at the time,
>>and so having it not retry the remaining sectors was actually
>>a very good fix at the time.
>>
>>But now, libata *does* return valid sense data for LBA/DMA drives,
>>and the workaround from circa 2.6.16 is no longer the best we can do.
>>Now that we know which sector failed, we ought to be able to skip
>>over it, and continue with the rest of the merged request.
>>    
>>
>
>We can ... the big concern with your approach, which you haven't
>addressed is the time factor.  For most SCSI devices, returning a fatal
>MEDIUM ERROR means we're out of remapping table, and also that there's
>probably a bunch of sectors on the track that are now out.  Thus, there
>are almost always multiple sector failures.  In linux, the average
>request size on a filesystem is around 64-128kb; thats 128-256 sectors.
>If we fail at the initial sector, we have to go through another 128-256
>attempts, with the internal device retries, before we fail the entire
>request.  Some devices can take a second or so for each read before they
>finally give up and decide they really can't read the sector, so you're
>looking at 2-5 minutes before the machine finally fails this one
>request ... and much worse for devices that retry more times.
>  
>

This is not the case on a read error - we commonly see transient errors 
on reads from disks.  What we push our vendors to do is to try to keep 
the "worst case" response down to tens of seconds as they retry, etc 
internally with a device. When they take that long (and they do), adding 
retries up the stack can translate into minutes per sector.

The interesting point of this question is about the typically pattern of 
IO errors. On a read, it is safe to assume that you will have issues 
with some bounded numbers of adjacent sectors.

>  
>
>>One thing that could be even better than the patch below,
>>would be to have it perhaps skip the entire bio that includes
>>the failed sector, rather than only the bad sector itself.
>>    
>>
>
>Er ... define "skip over the bio".  A bio is simply a block
>representation for a bunch of sg elements coming in to the elevator.
>Mostly what we see in SCSI is a single bio per request, so skipping the
>bio is really the current behaviour (to fail the rest of the request).
>  
>
This is really a tricky one - what can happen when we fail merged IO 
requests is really unpredictable behavior up at the application level 
since the IO error might not be at all relevant to my part of the 
request.  Merging can produce a request that is much larger than any 
normal drive error.

I really like the idea of being able to set this kind of policy on a per 
drive instance since what you want here will change depending on what 
your system requirements are, what the system is trying to do (i.e., 
when trying to recover a failing but not dead yet disk, IO errors should 
be as quick as possible and we should choose an IO scheduler that does 
not combine IO's).

>  
>
>>I think doing that might address most concerns expressed here.
>>Have you got an alternate suggestion, James?
>>    
>>
>
>James
>  
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 12:20       ` Ric Wheeler
@ 2007-02-02 14:42         ` Alan
  2007-02-02 14:53           ` James Bottomley
  2007-02-02 20:16           ` Douglas Gilbert
  2007-02-02 14:50         ` Alan
  1 sibling, 2 replies; 34+ messages in thread
From: Alan @ 2007-02-02 14:42 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, Mark Lord, linux-kernel,
	IDE/ATA development list, linux-scsi

> The interesting point of this question is about the typically pattern of 
> IO errors. On a read, it is safe to assume that you will have issues 
> with some bounded numbers of adjacent sectors.

Which in theory you can get by asking the drive for the real sector size
from the ATA7 info. (We ought to dig this out more as its relevant for
partition layout too).

> I really like the idea of being able to set this kind of policy on a per 
> drive instance since what you want here will change depending on what 
> your system requirements are, what the system is trying to do (i.e., 
> when trying to recover a failing but not dead yet disk, IO errors should 
> be as quick as possible and we should choose an IO scheduler that does 
> not combine IO's).

That seems to be arguing for a bounded "live" time including retry run
time for a command. That's also more intuitive for real time work and for
end user setup. "Either work or fail within n seconds"


Alan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 14:42         ` Alan
@ 2007-02-02 14:53           ` James Bottomley
  2007-02-02 16:16             ` Ric Wheeler
  2007-02-02 20:16           ` Douglas Gilbert
  1 sibling, 1 reply; 34+ messages in thread
From: James Bottomley @ 2007-02-02 14:53 UTC (permalink / raw)
  To: Alan
  Cc: Ric Wheeler, Mark Lord, linux-kernel, IDE/ATA development list,
	linux-scsi

On Fri, 2007-02-02 at 14:42 +0000, Alan wrote:
> > The interesting point of this question is about the typically pattern of 
> > IO errors. On a read, it is safe to assume that you will have issues 
> > with some bounded numbers of adjacent sectors.
> 
> Which in theory you can get by asking the drive for the real sector size
> from the ATA7 info. (We ought to dig this out more as its relevant for
> partition layout too).
> 
> > I really like the idea of being able to set this kind of policy on a per 
> > drive instance since what you want here will change depending on what 
> > your system requirements are, what the system is trying to do (i.e., 
> > when trying to recover a failing but not dead yet disk, IO errors should 
> > be as quick as possible and we should choose an IO scheduler that does 
> > not combine IO's).
> 
> That seems to be arguing for a bounded "live" time including retry run
> time for a command. That's also more intuitive for real time work and for
> end user setup. "Either work or fail within n seconds"

Actually, then I think perhaps we use the allowed retries for this ...

So you would fail a single sector and count it against the retries.
When you've done this allowed retries times, you fail the rest of the
request.

James



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 14:53           ` James Bottomley
@ 2007-02-02 16:16             ` Ric Wheeler
  0 siblings, 0 replies; 34+ messages in thread
From: Ric Wheeler @ 2007-02-02 16:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan, Mark Lord, linux-kernel, IDE/ATA development list,
	linux-scsi



James Bottomley wrote:
> On Fri, 2007-02-02 at 14:42 +0000, Alan wrote:
>   
>>> The interesting point of this question is about the typically pattern of 
>>> IO errors. On a read, it is safe to assume that you will have issues 
>>> with some bounded numbers of adjacent sectors.
>>>       
>> Which in theory you can get by asking the drive for the real sector size
>> from the ATA7 info. (We ought to dig this out more as its relevant for
>> partition layout too).
>>     

Actually, my point is that damage typically impacts a cluster of disk 
sectors that are adjacent. Think of a drive that has junk on the platter 
or a some such thing - the contamination is likely to be localized.
>>     
>>> I really like the idea of being able to set this kind of policy on a per 
>>> drive instance since what you want here will change depending on what 
>>> your system requirements are, what the system is trying to do (i.e., 
>>> when trying to recover a failing but not dead yet disk, IO errors should 
>>> be as quick as possible and we should choose an IO scheduler that does 
>>> not combine IO's).
>>>       
>> That seems to be arguing for a bounded "live" time including retry run
>> time for a command. That's also more intuitive for real time work and for
>> end user setup. "Either work or fail within n seconds"
>>     
>
> Actually, then I think perhaps we use the allowed retries for this ...
>   
I really am not a big retry fan for most modern drives - the drive will 
try really, really hard to complete an IO for us and multiple retries 
can just slow down the higher level application from recovering.
> So you would fail a single sector and count it against the retries.
> When you've done this allowed retries times, you fail the rest of the
> request.
>
> James
>
>   
I think that we need to play with some of these possible solutions on 
some  real-world bad drives and see how they react. 

We should definitely talk more about this at the workshop ;-)

ric


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 14:42         ` Alan
  2007-02-02 14:53           ` James Bottomley
@ 2007-02-02 20:16           ` Douglas Gilbert
  1 sibling, 0 replies; 34+ messages in thread
From: Douglas Gilbert @ 2007-02-02 20:16 UTC (permalink / raw)
  To: Alan
  Cc: Ric Wheeler, James Bottomley, Mark Lord, linux-kernel,
	IDE/ATA development list, linux-scsi

Alan wrote:
>> The interesting point of this question is about the typically pattern of 
>> IO errors. On a read, it is safe to assume that you will have issues 
>> with some bounded numbers of adjacent sectors.
> 
> Which in theory you can get by asking the drive for the real sector size
> from the ATA7 info. (We ought to dig this out more as its relevant for
> partition layout too).
> 
>> I really like the idea of being able to set this kind of policy on a per 
>> drive instance since what you want here will change depending on what 
>> your system requirements are, what the system is trying to do (i.e., 
>> when trying to recover a failing but not dead yet disk, IO errors should 
>> be as quick as possible and we should choose an IO scheduler that does 
>> not combine IO's).
> 
> That seems to be arguing for a bounded "live" time including retry run
> time for a command. That's also more intuitive for real time work and for
> end user setup. "Either work or fail within n seconds"

Which is more or less the "streaming" feature set in recent
ATA standards. [Alas, streaming and NCQ/TCQ can't be done
with the same access.] SCSI has its Read Write Error Recovery
mode page which doesn't have timeouts but does have Read
and Write Retry Counts amongst other fields that control
the amount (and indirectly the time) of attempted error
recovery.

Doug Gilbert



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 12:20       ` Ric Wheeler
  2007-02-02 14:42         ` Alan
@ 2007-02-02 14:50         ` Alan
  2007-02-02 16:06           ` Mark Lord
  1 sibling, 1 reply; 34+ messages in thread
From: Alan @ 2007-02-02 14:50 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: James Bottomley, Mark Lord, linux-kernel,
	IDE/ATA development list, linux-scsi

> your system requirements are, what the system is trying to do (i.e., 
> when trying to recover a failing but not dead yet disk, IO errors should 
> be as quick as possible and we should choose an IO scheduler that does 
> not combine IO's).

If this is the right strategy for disk recovery for a given type of
device then this ought to be an automatic strategy. Most end users will
not have the knowledge to frob about in sysfs, and if the bad sector hits
at the wrong moment a sensible automatic recovery strategy is going to do
the right thing faster than human intervention, which may be some hours
away.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 14:50         ` Alan
@ 2007-02-02 16:06           ` Mark Lord
  2007-02-02 19:49             ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Lord @ 2007-02-02 16:06 UTC (permalink / raw)
  To: Alan
  Cc: Ric Wheeler, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

Alan wrote:
>
> If this is the right strategy for disk recovery for a given type of
> device then this ought to be an automatic strategy. Most end users will
> not have the knowledge to frob about in sysfs, and if the bad sector hits
> at the wrong moment a sensible automatic recovery strategy is going to do
> the right thing faster than human intervention, which may be some hours
> away.

I think something we seem to be discussing here are the opposite aims
of enterprise RAID (traditional SCSI market) versus desktop filesystems
(now the number one user of Linux SCSI, courtesy of libata).

With RAID, it's safe to suggest that a very fast, bounded exit from EH
is almost always what the customer / end-user wants, because the higher
level RAID management can then deal with the failed drive offline.

But for a desktop filesystem, failing out quickly and bulk-failing megabytes
around a couple of bad sectors is not good -- no redundancy, and this will
lead to unneeded data loss.

It's beginning to look like this needs to be run-time tuneable,
per block minor device (per-partition), through sysfs at a minimum.
The RAID tools could then automatically choose settings to bias more
towards an "instant exit" when errors are found, whereas a non-RAID
desktop could select a more reliable recovery strategy.

Right now, with Jame's patch (earlier in this thread), the whole scheme
is heavily weighted towards the RAID "instant exit" strategy, which is
making me quite nervous about the data on my laptop.

Cheers 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 16:06           ` Mark Lord
@ 2007-02-02 19:49             ` Matt Mackall
  2007-02-02 22:58               ` Mark Lord
  0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2007-02-02 19:49 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

On Fri, Feb 02, 2007 at 11:06:19AM -0500, Mark Lord wrote:
> Alan wrote:
> >
> >If this is the right strategy for disk recovery for a given type of
> >device then this ought to be an automatic strategy. Most end users will
> >not have the knowledge to frob about in sysfs, and if the bad sector hits
> >at the wrong moment a sensible automatic recovery strategy is going to do
> >the right thing faster than human intervention, which may be some hours
> >away.
> 
> I think something we seem to be discussing here are the opposite aims
> of enterprise RAID (traditional SCSI market) versus desktop filesystems
> (now the number one user of Linux SCSI, courtesy of libata).
> 
> With RAID, it's safe to suggest that a very fast, bounded exit from EH
> is almost always what the customer / end-user wants, because the higher
> level RAID management can then deal with the failed drive offline.
> 
> But for a desktop filesystem, failing out quickly and bulk-failing megabytes
> around a couple of bad sectors is not good -- no redundancy, and this will
> lead to unneeded data loss.
> 
> It's beginning to look like this needs to be run-time tuneable,
> per block minor device (per-partition), through sysfs at a minimum.
> The RAID tools could then automatically choose settings to bias more
> towards an "instant exit" when errors are found, whereas a non-RAID
> desktop could select a more reliable recovery strategy.
> 
> Right now, with Jame's patch (earlier in this thread), the whole scheme
> is heavily weighted towards the RAID "instant exit" strategy, which is
> making me quite nervous about the data on my laptop.

Also worth considering is that spending minutes trying to reread
damaged sectors is likely to accelerate your death spiral. More data
may be recoverable if you give up quickly in a first pass, then go
back and manually retry damaged bits with smaller I/Os.

Another approach is to have separate retries per hard sector and
hard errors per MB at the block device level. We don't want to have
the same number of retries for a 64KB block as a 1MB block and we
certainly don't want to do 2K retries in a row.

So for instance, I could have the first set to 1 and the second set to
16. For a 256KB read, this gets scaled down to 4, which means we'll
retry each sector once and fail the whole I/O when we hit the 5th
sector error.

More reasonable number might be 0 and 1, meaning "don't do OS level
retries on sectors and fail the whole I/O on the second sector error
(or immediately for smaller reads)".

It might also be informative to add a kernel message reporting if a retry
(in the non-transient cases) ever actually succeeds. 

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 19:49             ` Matt Mackall
@ 2007-02-02 22:58               ` Mark Lord
  2007-02-02 23:07                 ` Matt Mackall
  0 siblings, 1 reply; 34+ messages in thread
From: Mark Lord @ 2007-02-02 22:58 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

Matt Mackall wrote:
> ..
> Also worth considering is that spending minutes trying to reread
> damaged sectors is likely to accelerate your death spiral. More data
> may be recoverable if you give up quickly in a first pass, then go
> back and manually retry damaged bits with smaller I/Os.

All good input.  But what was being debated here is not so much
the retrying of known-bad sectors, but rather what to do about
the kiBs or MiBs of sectors remaining in a merged request after
hitting a single bad sector mid-way.

Currently, SCSI just abandons the entire remaining workload.

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
  2007-02-02 22:58               ` Mark Lord
@ 2007-02-02 23:07                 ` Matt Mackall
  0 siblings, 0 replies; 34+ messages in thread
From: Matt Mackall @ 2007-02-02 23:07 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel,
	IDE/ATA development list, linux-scsi

On Fri, Feb 02, 2007 at 05:58:04PM -0500, Mark Lord wrote:
> Matt Mackall wrote:
> >..
> >Also worth considering is that spending minutes trying to reread
> >damaged sectors is likely to accelerate your death spiral. More data
> >may be recoverable if you give up quickly in a first pass, then go
> >back and manually retry damaged bits with smaller I/Os.
> 
> All good input.  But what was being debated here is not so much
> the retrying of known-bad sectors, but rather what to do about
> the kiBs or MiBs of sectors remaining in a merged request after
> hitting a single bad sector mid-way.

Yep, that's precisely what was addressed in the part you snipped. 

My main point being that what to do about the remaining workload
should be dependent on the size of the I/O. If we encounter errors on
sectors 4,5,6,7,8.. of a 1MB request, we should have a threshold for
giving up. It's not unreasonable for that threshold to be larger than
1, but it should not be 2048.

And if we do the I/O as four 256KB requests, we should have
approximately the same number of retries (assuming the whole region's
bad).

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2007-02-02 23:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-31  0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord
2007-01-31  1:12 ` [PATCH] RESEND " Mark Lord
2007-01-31  1:16 ` [PATCH] " James Bottomley
2007-01-31  1:36   ` Mark Lord
     [not found]   ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>
2007-01-31  1:41     ` Mark Lord
2007-01-31  3:20       ` Ric Wheeler
2007-01-31  4:21         ` James Bottomley
2007-01-31 15:13           ` Mark Lord
2007-01-31 15:22             ` Mark Lord
2007-01-31 15:24             ` James Bottomley
2007-01-31  5:09         ` Douglas Gilbert
2007-01-31 15:08         ` Mark Lord
2007-01-31 15:23           ` Alan
2007-01-31 16:35             ` Ric Wheeler
2007-01-31 17:57             ` Mark Lord
2007-01-31 18:13               ` James Bottomley
2007-01-31 18:37                 ` Mark Lord
2007-01-31  9:30       ` Jeff Garzik
2007-01-31 14:36         ` Ric Wheeler
2007-01-31 15:28           ` Douglas Gilbert
2007-01-31 15:38             ` Mark Lord
2007-02-01 20:02   ` Mark Lord
2007-02-01 21:55     ` James Bottomley
2007-02-02  2:48       ` Mark Lord
2007-02-02 12:20       ` Ric Wheeler
2007-02-02 14:42         ` Alan
2007-02-02 14:53           ` James Bottomley
2007-02-02 16:16             ` Ric Wheeler
2007-02-02 20:16           ` Douglas Gilbert
2007-02-02 14:50         ` Alan
2007-02-02 16:06           ` Mark Lord
2007-02-02 19:49             ` Matt Mackall
2007-02-02 22:58               ` Mark Lord
2007-02-02 23:07                 ` Matt Mackall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).