* [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
@ 2007-01-31 0:47 Mark Lord
2007-01-31 1:12 ` [PATCH] RESEND " Mark Lord
2007-01-31 1:16 ` [PATCH] " James Bottomley
0 siblings, 2 replies; 34+ messages in thread
From: Mark Lord @ 2007-01-31 0:47 UTC (permalink / raw)
To: linux-kernel, IDE/ATA development list, James Bottomley
In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR. It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.
Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.
This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also failed.
This patch fixes the behaviour to be similar to what we had originally.
When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.
Signed-off-by: Mark Lord <mlord@pobox.com>
---
diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
--- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500
+++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500
@@ -865,6 +865,12 @@
*/
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+ case MEDIUM_ERROR:
+ // Bad sector. Fail it, and then continue the rest of the request:
+ if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
+ cmd->retries = 0; // go around again..
+ return;
+ }
case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change. Set a bit
^ permalink raw reply [flat|nested] 34+ messages in thread* [PATCH] RESEND scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord @ 2007-01-31 1:12 ` Mark Lord 2007-01-31 1:16 ` [PATCH] " James Bottomley 1 sibling, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 1:12 UTC (permalink / raw) To: linux-kernel, IDE/ATA development list, James Bottomley, linux-scsi Fixed for 80-columns, and copying linux-scsi this time. In ancient kernels, the SCSI disk code used to continue after encountering a MEDIUM_ERROR. It would "complete" the good sectors before the error, fail the bad sector/block, and then continue with the rest of the request. Kernels since about 2.6.16 or so have been broken in this regard. They "complete" the good sectors before the error, and then fail the entire remaining portions of the request. This is very risky behaviour, as a request is often a merge of several bios, and just because one application hits a bad sector is no reason to pretend that (for example) an adjacent directly lookup also failed. This patch fixes the behaviour to be similar to what we had originally. When a bad sector is encounted, SCSI will now work around it again, failing *only* the bad sector itself. Signed-off-by: Mark Lord <mlord@pobox.com> --- --- old/drivers/scsi/scsi_lib.c 2007-01-30 20:06:15.000000000 -0500 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 20:06:59.000000000 -0500 @@ -865,6 +865,13 @@ */ if (sense_valid && !sense_deferred) { switch (sshdr.sense_key) { + case MEDIUM_ERROR: + /* Bad sector. Fail it, and continue on with the rest */ + if (scsi_end_request(cmd, 0, + cmd->device->sector_size, 1) == NULL) { + cmd->retries = 0; /* go around again.. */ + return; + } case UNIT_ATTENTION: if (cmd->device->removable) { /* Detected disc change. Set a bit ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord 2007-01-31 1:12 ` [PATCH] RESEND " Mark Lord @ 2007-01-31 1:16 ` James Bottomley 2007-01-31 1:36 ` Mark Lord ` (2 more replies) 1 sibling, 3 replies; 34+ messages in thread From: James Bottomley @ 2007-01-31 1:16 UTC (permalink / raw) To: Mark Lord; +Cc: linux-kernel, IDE/ATA development list, linux-scsi First off, please send SCSI patches to the SCSI list: <linux-scsi@vger.kernel.org> On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote: > In ancient kernels, the SCSI disk code used to continue after > encountering a MEDIUM_ERROR. It would "complete" the good > sectors before the error, fail the bad sector/block, and then > continue with the rest of the request. > > Kernels since about 2.6.16 or so have been broken in this regard. > They "complete" the good sectors before the error, > and then fail the entire remaining portions of the request. What was the commit that introduced the change? ... I have a vague memory of it being deliberate. > This is very risky behaviour, as a request is often a merge > of several bios, and just because one application hits a bad sector > is no reason to pretend that (for example) an adjacent directly lookup also failed. > > This patch fixes the behaviour to be similar to what we had originally. > > When a bad sector is encounted, SCSI will now work around it again, > failing *only* the bad sector itself. Erm, but the corollary is that if we get a large read failure because of a bad track, you're going to try and chunk up it a sector at a time forcing an individual error for each sector is going to annoy some people ... particularly removable medium ones which return this error if the medium isn't present ... Are you sure this is really what we want to do? > Signed-off-by: Mark Lord <mlord@pobox.com> > --- > diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c > --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500 > +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500 > @@ -865,6 +865,12 @@ > */ > if (sense_valid && !sense_deferred) { > switch (sshdr.sense_key) { > + case MEDIUM_ERROR: > + // Bad sector. Fail it, and then continue the rest of the request: > + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) { The sense key may have come with additional information I think we want to parse that (if it exists) rather than just blindly failing the first sector of the request. > + cmd->retries = 0; // go around again.. > + return; > + } This would drop through to the UNIT_ATTENTION case if scsi_end_request() fails ... I don't think that's correct. > case UNIT_ATTENTION: > if (cmd->device->removable) { > /* Detected disc change. Set a bit ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 1:16 ` [PATCH] " James Bottomley @ 2007-01-31 1:36 ` Mark Lord [not found] ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com> 2007-02-01 20:02 ` Mark Lord 2 siblings, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 1:36 UTC (permalink / raw) To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi James Bottomley wrote: > First off, please send SCSI patches to the SCSI list: > <linux-scsi@vger.kernel.org> Fixed already, thanks! >> This patch fixes the behaviour to be similar to what we had originally. >> >> When a bad sector is encounted, SCSI will now work around it again, >> failing *only* the bad sector itself. > > Erm, but the corollary is that if we get a large read failure because of > a bad track, you're going to try and chunk up it a sector at a time That's better than the huge data-loss scenario that we currently have for single-sector errors. MUCH better. > forcing an individual error for each sector is going to annoy some > people ... particularly removable medium ones which return this error if > the medium isn't present ... Are you sure this is really what we want to > do? No, for removed-medium everything just fails right away. This patch is *only* for media errors, not any other failures. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>]
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR [not found] ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com> @ 2007-01-31 1:41 ` Mark Lord 2007-01-31 3:20 ` Ric Wheeler 2007-01-31 9:30 ` Jeff Garzik 0 siblings, 2 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 1:41 UTC (permalink / raw) To: Eric D. Mudama Cc: James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Eric D. Mudama wrote: > > Actually, it's possibly worse, since each failure in libata will > generate 3-4 retries. With existing ATA error recovery in the drives, > that's about 3 seconds per retry on average, or 12 seconds per failure. > Multiply that by the number of blocks past the error to complete the > request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 1:41 ` Mark Lord @ 2007-01-31 3:20 ` Ric Wheeler 2007-01-31 4:21 ` James Bottomley ` (2 more replies) 2007-01-31 9:30 ` Jeff Garzik 1 sibling, 3 replies; 34+ messages in thread From: Ric Wheeler @ 2007-01-31 3:20 UTC (permalink / raw) To: Mark Lord Cc: Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Mark Lord wrote: > Eric D. Mudama wrote: > >> >> Actually, it's possibly worse, since each failure in libata will >> generate 3-4 retries. With existing ATA error recovery in the >> drives, that's about 3 seconds per retry on average, or 12 seconds >> per failure. Multiply that by the number of blocks past the error to >> complete the request.. > > > It really beats the alternative of a forced reboot > due to, say, superblock I/O failing because it happened > to get merged with an unrelated I/O which then failed.. > Etc.. > > Definitely an improvement. > > The number of retries is an entirely separate issue. > If we really care about it, then we should fix SD_MAX_RETRIES. > > The current value of 5 is *way* too high. It should be zero or one. > > Cheers > I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? We need to be able to layer things like MD on top of normal drive errors in a way that will produce a system that provides reasonable response time despite any possible IO error on a single component. Another case that we end up doing on a regular basis is drive recovery. Errors need to be limited in scope to just the impacted area and dispatched up to the application layer as quickly as we can so that you don't spend days watching a copy of huge drive (think 750GB or more) ;-) ric ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 3:20 ` Ric Wheeler @ 2007-01-31 4:21 ` James Bottomley 2007-01-31 15:13 ` Mark Lord 2007-01-31 5:09 ` Douglas Gilbert 2007-01-31 15:08 ` Mark Lord 2 siblings, 1 reply; 34+ messages in thread From: James Bottomley @ 2007-01-31 4:21 UTC (permalink / raw) To: Ric Wheeler Cc: Mark Lord, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi On Tue, 2007-01-30 at 22:20 -0500, Ric Wheeler wrote: > Mark Lord wrote: > > The number of retries is an entirely separate issue. > > If we really care about it, then we should fix SD_MAX_RETRIES. > > > > The current value of 5 is *way* too high. It should be zero or one. > > > > Cheers > > > I think that drives retry enough, we should leave retry at zero for > normal (non-removable) drives. Should this be a policy we can set like > we do with NCQ queue depth via /sys ? I don't disagree that it should be settable. However, retries occur for other reasons than failures inside the device. The most standard ones are unit attentions generated because of other activity (target reset etc). The key to the problem is retrying only operations that are genuinely retryable, which the mid-layer doesn't do such a good job on. > We need to be able to layer things like MD on top of normal drive errors > in a way that will produce a system that provides reasonable response > time despite any possible IO error on a single component. Another case > that we end up doing on a regular basis is drive recovery. Errors need > to be limited in scope to just the impacted area and dispatched up to > the application layer as quickly as we can so that you don't spend days > watching a copy of huge drive (think 750GB or more) ;-) For the MD case, this is what REQ_FAILFAST is for. James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 4:21 ` James Bottomley @ 2007-01-31 15:13 ` Mark Lord 2007-01-31 15:22 ` Mark Lord 2007-01-31 15:24 ` James Bottomley 0 siblings, 2 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 15:13 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi, dougg James Bottomley wrote: > > For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? And for that matter, even when I patch SCSI so that it *does* honour it, I don't actually see the flag making it into the SCSI layer from above. And I don't see where/how the block layer takes care when considering merge FAILFAST/READA requests with non FAILFAST/READA requests. To me, it looks perfectly happy to add non-FAILFAST/READA bios to a FAILFAST request, risking data loss if a lower-layer decides to honour the FAILFAST/READA flags. So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:13 ` Mark Lord @ 2007-01-31 15:22 ` Mark Lord 2007-01-31 15:24 ` James Bottomley 1 sibling, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 15:22 UTC (permalink / raw) To: James Bottomley Cc: Ric Wheeler, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi, dougg Mark Lord wrote: > James Bottomley wrote: >> >> For the MD case, this is what REQ_FAILFAST is for. > > I cannot find where SCSI honours that flag. James? Scratch that thought.. SCSI honours it in scsi_end_request(). But I'm not certain that the block layer handles it correctly, at least not in the 2.6.16/2.6.18 kernel that I'm working with today. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:13 ` Mark Lord 2007-01-31 15:22 ` Mark Lord @ 2007-01-31 15:24 ` James Bottomley 1 sibling, 0 replies; 34+ messages in thread From: James Bottomley @ 2007-01-31 15:24 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi, dougg On Wed, 2007-01-31 at 10:13 -0500, Mark Lord wrote: > James Bottomley wrote: > > > > For the MD case, this is what REQ_FAILFAST is for. > I cannot find where SCSI honours that flag. James? Er, it's in scsi_error.c:scsi_decide_disposition(): maybe_retry: /* we requeue for retry because the error was retryable, and * the request was not marked fast fail. Note that above, * even if the request is marked fast fail, we still requeue * for queue congestion conditions (QUEUE_FULL or BUSY) */ if ((++scmd->retries) <= scmd->allowed && !blk_noretry_request(scmd->request)) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ return NEEDS_RETRY; } else { /* * no more retries - report this one back to upper level. */ return SUCCESS; } > And for that matter, even when I patch SCSI so that it *does* honour it, > I don't actually see the flag making it into the SCSI layer from above. > > And I don't see where/how the block layer takes care when considering > merge FAILFAST/READA requests with non FAILFAST/READA requests. > To me, it looks perfectly happy to add non-FAILFAST/READA bios > to a FAILFAST request, risking data loss if a lower-layer decides > to honour the FAILFAST/READA flags. > > So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 3:20 ` Ric Wheeler 2007-01-31 4:21 ` James Bottomley @ 2007-01-31 5:09 ` Douglas Gilbert 2007-01-31 15:08 ` Mark Lord 2 siblings, 0 replies; 34+ messages in thread From: Douglas Gilbert @ 2007-01-31 5:09 UTC (permalink / raw) To: Ric Wheeler Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Ric Wheeler wrote: > > > Mark Lord wrote: > >> Eric D. Mudama wrote: >> >>> >>> Actually, it's possibly worse, since each failure in libata will >>> generate 3-4 retries. With existing ATA error recovery in the >>> drives, that's about 3 seconds per retry on average, or 12 seconds >>> per failure. Multiply that by the number of blocks past the error to >>> complete the request.. >> >> >> It really beats the alternative of a forced reboot >> due to, say, superblock I/O failing because it happened >> to get merged with an unrelated I/O which then failed.. >> Etc.. >> >> Definitely an improvement. >> >> The number of retries is an entirely separate issue. >> If we really care about it, then we should fix SD_MAX_RETRIES. >> >> The current value of 5 is *way* too high. It should be zero or one. >> >> Cheers >> > I think that drives retry enough, we should leave retry at zero for > normal (non-removable) drives. Should this be a policy we can set like > we do with NCQ queue depth via /sys ? The transport might also want a say. I see ABORTED COMMAND errors often enough with SAS (e.g. due to expander congestion) to warrant at least one retry (which works in my testing). SATA disks behind SAS infrastructure would also be susceptible to the same "random" failures. Transport Layer Retries (TLR) in SAS should remove this class of transport errors but only SAS tape drives support TLR as far as I know. Doug Gilbert > We need to be able to layer things like MD on top of normal drive errors > in a way that will produce a system that provides reasonable response > time despite any possible IO error on a single component. Another case > that we end up doing on a regular basis is drive recovery. Errors need > to be limited in scope to just the impacted area and dispatched up to > the application layer as quickly as we can so that you don't spend days > watching a copy of huge drive (think 750GB or more) ;-) > > ric ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 3:20 ` Ric Wheeler 2007-01-31 4:21 ` James Bottomley 2007-01-31 5:09 ` Douglas Gilbert @ 2007-01-31 15:08 ` Mark Lord 2007-01-31 15:23 ` Alan 2 siblings, 1 reply; 34+ messages in thread From: Mark Lord @ 2007-01-31 15:08 UTC (permalink / raw) To: Ric Wheeler Cc: Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi, dougg Ric Wheeler wrote: > Mark Lord wrote: >> Eric D. Mudama wrote: >>> Actually, it's possibly worse, since each failure in libata will >>> generate 3-4 retries. (note: libata does *not* generate retries for medium errors; the looping is driven by the SCSI mid-layer code). >> It really beats the alternative of a forced reboot >> due to, say, superblock I/O failing because it happened >> to get merged with an unrelated I/O which then failed.. >> Etc.. >> >> Definitely an improvement. >> >> The number of retries is an entirely separate issue. >> If we really care about it, then we should fix SD_MAX_RETRIES. >> >> The current value of 5 is *way* too high. It should be zero or one. .. > I think that drives retry enough, we should leave retry at zero for > normal (non-removable) drives. Should this be a policy we can set like > we do with NCQ queue depth via /sys ? Or perhaps we could have the mid-layer always "early-exit" without retries for "MEDIUM_ERROR", and still do retries for the rest. When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the "with retry" ATA opcodes for this). But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:08 ` Mark Lord @ 2007-01-31 15:23 ` Alan 2007-01-31 16:35 ` Ric Wheeler 2007-01-31 17:57 ` Mark Lord 0 siblings, 2 replies; 34+ messages in thread From: Alan @ 2007-01-31 15:23 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi, dougg > When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, > as the drive itself has already done internal retries (libata uses the > "with retry" ATA opcodes for this). This depends on the firmware. Some of the "raid firmware" drives don't appear to do retries in firmware. > But meanwhile, we still have the original issue too, where a single stray > bad sector can blow a system out of the water, because the mid-layer > currently aborts everything after it from a large merged request. > > Thus the original patch from this thread. :) Agreed ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:23 ` Alan @ 2007-01-31 16:35 ` Ric Wheeler 2007-01-31 17:57 ` Mark Lord 1 sibling, 0 replies; 34+ messages in thread From: Ric Wheeler @ 2007-01-31 16:35 UTC (permalink / raw) To: Alan Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi, dougg Alan wrote: >> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, >> as the drive itself has already done internal retries (libata uses the >> "with retry" ATA opcodes for this). > > This depends on the firmware. Some of the "raid firmware" drives don't > appear to do retries in firmware. I think that even for these devices, the actual drives behind the controller will do retries. In any case, it would be reasonable to be able to set this retry/no-retry via /sys to handle exceptional cases... > >> But meanwhile, we still have the original issue too, where a single stray >> bad sector can blow a system out of the water, because the mid-layer >> currently aborts everything after it from a large merged request. >> >> Thus the original patch from this thread. :) > > Agreed ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:23 ` Alan 2007-01-31 16:35 ` Ric Wheeler @ 2007-01-31 17:57 ` Mark Lord 2007-01-31 18:13 ` James Bottomley 1 sibling, 1 reply; 34+ messages in thread From: Mark Lord @ 2007-01-31 17:57 UTC (permalink / raw) To: Alan Cc: Ric Wheeler, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi, dougg Alan wrote: >> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, >> as the drive itself has already done internal retries (libata uses the >> "with retry" ATA opcodes for this). > > This depends on the firmware. Some of the "raid firmware" drives don't > appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). If the failure takes more than a few hundred milliseconds to be reported, or in this case 4-7 seconds typically, then we know the drive was doing retries before it reported back. I haven't seen any drive fail instantly yet. Can anyone with those newfangled "RAID edition" drives try it and report back? Oh.. you'll need a way to create a bad sector. I've got patches and a command-line utility for the job. If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm), then the patches aren't needed. >> But meanwhile, we still have the original issue too, where a single stray >> bad sector can blow a system out of the water, because the mid-layer >> currently aborts everything after it from a large merged request. >> >> Thus the original patch from this thread. :) > > Agreed ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 17:57 ` Mark Lord @ 2007-01-31 18:13 ` James Bottomley 2007-01-31 18:37 ` Mark Lord 0 siblings, 1 reply; 34+ messages in thread From: James Bottomley @ 2007-01-31 18:13 UTC (permalink / raw) To: Mark Lord Cc: Alan, Ric Wheeler, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi, dougg On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote: > Alan wrote: > >> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, > >> as the drive itself has already done internal retries (libata uses the > >> "with retry" ATA opcodes for this). > > > > This depends on the firmware. Some of the "raid firmware" drives don't > > appear to do retries in firmware. > > One way to tell if this is true, is simply to time how long > the failed operation takes. If the drive truly does not do retries, > then the media error should be reported more or less instantly > (assuming drive was already spun up). Well, the simpler way (and one we have a hope of implementing) is to examine the ASC/ASCQ codes to see if the error is genuinely unretryable. I seem to have dropped the ball on this one in that the scsi_error.c pieces of this patch http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885 I thought I'd applied. Apparently I didn't, so I'll go back and put them in. James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 18:13 ` James Bottomley @ 2007-01-31 18:37 ` Mark Lord 0 siblings, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 18:37 UTC (permalink / raw) To: James Bottomley Cc: Alan, Ric Wheeler, Eric D. Mudama, linux-kernel, IDE/ATA development list, linux-scsi, dougg James Bottomley wrote: > On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote: >> Alan wrote: >>>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, >>>> as the drive itself has already done internal retries (libata uses the >>>> "with retry" ATA opcodes for this). >>> This depends on the firmware. Some of the "raid firmware" drives don't >>> appear to do retries in firmware. >> One way to tell if this is true, is simply to time how long >> the failed operation takes. If the drive truly does not do retries, >> then the media error should be reported more or less instantly >> (assuming drive was already spun up). > > Well, the simpler way (and one we have a hope of implementing) is to > examine the ASC/ASCQ codes to see if the error is genuinely unretryable. My suggestion above was not for a kernel fix, but rather just as a way of determining if drives which claim "no retries" actually do them or not. :) > I seem to have dropped the ball on this one in that the scsi_error.c > pieces of this patch > > http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885 > > I thought I'd applied. Apparently I didn't, so I'll go back and put > them in. Good. That would be a useful supplement to the patch I posted here. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 1:41 ` Mark Lord 2007-01-31 3:20 ` Ric Wheeler @ 2007-01-31 9:30 ` Jeff Garzik 2007-01-31 14:36 ` Ric Wheeler 1 sibling, 1 reply; 34+ messages in thread From: Jeff Garzik @ 2007-01-31 9:30 UTC (permalink / raw) To: Mark Lord, Eric D. Mudama Cc: James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Mark Lord wrote: > Eric D. Mudama wrote: >> >> Actually, it's possibly worse, since each failure in libata will >> generate 3-4 retries. With existing ATA error recovery in the drives, >> that's about 3 seconds per retry on average, or 12 seconds per >> failure. Multiply that by the number of blocks past the error to >> complete the request.. > > It really beats the alternative of a forced reboot > due to, say, superblock I/O failing because it happened > to get merged with an unrelated I/O which then failed.. > Etc.. FWIW -- speaking generally -- I think there are inevitable areas where libata error handling combined with SCSI error handling results in suboptimal error handling. Just creating a list of "<this behavior> should be handled <this way>, but in reality is handled in <this silly way>" would be very helpful. Error handling is tough to get right, because the code is exercised so infrequently. Tejun has actually done an above-average job here, by making device probe, hotplug and other "exceptions" go through the libata EH code, thereby exercising the EH code more than one might normally assume. Some errors in libata probably should not be retried more than once, when we have a definitive diagnosis. Suggestions for improvements are welcome. Jeff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 9:30 ` Jeff Garzik @ 2007-01-31 14:36 ` Ric Wheeler 2007-01-31 15:28 ` Douglas Gilbert 0 siblings, 1 reply; 34+ messages in thread From: Ric Wheeler @ 2007-01-31 14:36 UTC (permalink / raw) To: Jeff Garzik Cc: Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Jeff Garzik wrote: > Mark Lord wrote: >> Eric D. Mudama wrote: >>> >>> Actually, it's possibly worse, since each failure in libata will >>> generate 3-4 retries. With existing ATA error recovery in the >>> drives, that's about 3 seconds per retry on average, or 12 seconds >>> per failure. Multiply that by the number of blocks past the error to >>> complete the request.. >> >> It really beats the alternative of a forced reboot >> due to, say, superblock I/O failing because it happened >> to get merged with an unrelated I/O which then failed.. >> Etc.. > > > FWIW -- speaking generally -- I think there are inevitable areas where > libata error handling combined with SCSI error handling results in > suboptimal error handling. > > Just creating a list of "<this behavior> should be handled <this way>, > but in reality is handled in <this silly way>" would be very helpful. I agree - Tejun has done a great job at giving us a great base. Next step is to get clarity on what the types of errors are and how to differentiate between them (and maybe how that would change by class of device?). > > Error handling is tough to get right, because the code is exercised so > infrequently. Tejun has actually done an above-average job here, by > making device probe, hotplug and other "exceptions" go through the > libata EH code, thereby exercising the EH code more than one might > normally assume. > > Some errors in libata probably should not be retried more than once, > when we have a definitive diagnosis. Suggestions for improvements are > welcome. > > Jeff One thing that we find really useful is to inject real errors into devices. Mark has some patches that let us inject media errors, we also bring back failed drives and run them through testing and occasionally get to use analyzers, etc to inject odd ball errors. Hopefully, we will get some time to brainstorm about this at the workshop, ric ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 14:36 ` Ric Wheeler @ 2007-01-31 15:28 ` Douglas Gilbert 2007-01-31 15:38 ` Mark Lord 0 siblings, 1 reply; 34+ messages in thread From: Douglas Gilbert @ 2007-01-31 15:28 UTC (permalink / raw) To: ric Cc: Jeff Garzik, Mark Lord, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Ric Wheeler wrote: > > > Jeff Garzik wrote: >> Mark Lord wrote: >>> Eric D. Mudama wrote: >>>> >>>> Actually, it's possibly worse, since each failure in libata will >>>> generate 3-4 retries. With existing ATA error recovery in the >>>> drives, that's about 3 seconds per retry on average, or 12 seconds >>>> per failure. Multiply that by the number of blocks past the error >>>> to complete the request.. >>> >>> It really beats the alternative of a forced reboot >>> due to, say, superblock I/O failing because it happened >>> to get merged with an unrelated I/O which then failed.. >>> Etc.. >> >> >> FWIW -- speaking generally -- I think there are inevitable areas where >> libata error handling combined with SCSI error handling results in >> suboptimal error handling. >> >> Just creating a list of "<this behavior> should be handled <this way>, >> but in reality is handled in <this silly way>" would be very helpful. > > I agree - Tejun has done a great job at giving us a great base. Next > step is to get clarity on what the types of errors are and how to > differentiate between them (and maybe how that would change by class of > device?). > >> >> Error handling is tough to get right, because the code is exercised so >> infrequently. Tejun has actually done an above-average job here, by >> making device probe, hotplug and other "exceptions" go through the >> libata EH code, thereby exercising the EH code more than one might >> normally assume. >> >> Some errors in libata probably should not be retried more than once, >> when we have a definitive diagnosis. Suggestions for improvements are >> welcome. >> >> Jeff > > One thing that we find really useful is to inject real errors into > devices. Mark has some patches that let us inject media errors, we also > bring back failed drives and run them through testing and occasionally > get to use analyzers, etc to inject odd ball errors. Ric, Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added command support to flag a block as "uncorrectable". There is no need to send bad "long" data to it and suppress the disk's automatic re-allocation logic. In the case of ATA it is the WRITE UNCORRECTABLE command. In the case of SCSI it is the WR_UNCOR bit in the WRITE LONG command. It seems that due to SAT any useful capability in the ATA command set will soon appear in the corresponding SCSI command set, if it is not already there. Doug Gilbert ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 15:28 ` Douglas Gilbert @ 2007-01-31 15:38 ` Mark Lord 0 siblings, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-01-31 15:38 UTC (permalink / raw) To: dougg Cc: ric, Jeff Garzik, Eric D. Mudama, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Douglas Gilbert wrote: > > Ric, > Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added > command support to flag a block as "uncorrectable". There > is no need to send bad "long" data to it and suppress the > disk's automatic re-allocation logic. That'll be useful in a couple of years, once drives that have it become more common. For now, though, we're hacking current drives using READ/WRITE LONG commands, with a corresponding patch to libata to allow for the longer "sector size" involved. Having real bad sectors, exactly where we want them on the media, sure does make testing / fixing the EH mechanisms a lot more feasible. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-01-31 1:16 ` [PATCH] " James Bottomley 2007-01-31 1:36 ` Mark Lord [not found] ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com> @ 2007-02-01 20:02 ` Mark Lord 2007-02-01 21:55 ` James Bottomley 2 siblings, 1 reply; 34+ messages in thread From: Mark Lord @ 2007-02-01 20:02 UTC (permalink / raw) To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi James Bottomley wrote: > On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote: >> Kernels since about 2.6.16 or so have been broken in this regard. >> They "complete" the good sectors before the error, >> and then fail the entire remaining portions of the request. > > What was the commit that introduced the change? ... I have a vague > memory of it being deliberate. I believe you made the first change in response to my prodding at the time, when libata was not returning valid sense data (no LBA) for media errors. The SCSI EH handling of that was rather poor at the time, and so having it not retry the remaining sectors was actually a very good fix at the time. But now, libata *does* return valid sense data for LBA/DMA drives, and the workaround from circa 2.6.16 is no longer the best we can do. Now that we know which sector failed, we ought to be able to skip over it, and continue with the rest of the merged request. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? .. >> Signed-off-by: Mark Lord <mlord@pobox.com> >> --- >> diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c >> --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500 >> +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500 >> @@ -865,6 +865,12 @@ >> */ >> if (sense_valid && !sense_deferred) { >> switch (sshdr.sense_key) { >> + case MEDIUM_ERROR: >> + // Bad sector. Fail it, and then continue the rest of the request: >> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) { > > The sense key may have come with additional information I think we want > to parse that (if it exists) rather than just blindly failing the first > sector of the request. > >> + cmd->retries = 0; // go around again.. >> + return; >> + } > > This would drop through to the UNIT_ATTENTION case if scsi_end_request() > fails ... I don't think that's correct. > >> case UNIT_ATTENTION: >> if (cmd->device->removable) { >> /* Detected disc change. Set a bit > > - > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-01 20:02 ` Mark Lord @ 2007-02-01 21:55 ` James Bottomley 2007-02-02 2:48 ` Mark Lord 2007-02-02 12:20 ` Ric Wheeler 0 siblings, 2 replies; 34+ messages in thread From: James Bottomley @ 2007-02-01 21:55 UTC (permalink / raw) To: Mark Lord; +Cc: linux-kernel, IDE/ATA development list, linux-scsi On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: > I believe you made the first change in response to my prodding at the time, > when libata was not returning valid sense data (no LBA) for media errors. > The SCSI EH handling of that was rather poor at the time, > and so having it not retry the remaining sectors was actually > a very good fix at the time. > > But now, libata *does* return valid sense data for LBA/DMA drives, > and the workaround from circa 2.6.16 is no longer the best we can do. > Now that we know which sector failed, we ought to be able to skip > over it, and continue with the rest of the merged request. We can ... the big concern with your approach, which you haven't addressed is the time factor. For most SCSI devices, returning a fatal MEDIUM ERROR means we're out of remapping table, and also that there's probably a bunch of sectors on the track that are now out. Thus, there are almost always multiple sector failures. In linux, the average request size on a filesystem is around 64-128kb; thats 128-256 sectors. If we fail at the initial sector, we have to go through another 128-256 attempts, with the internal device retries, before we fail the entire request. Some devices can take a second or so for each read before they finally give up and decide they really can't read the sector, so you're looking at 2-5 minutes before the machine finally fails this one request ... and much worse for devices that retry more times. > One thing that could be even better than the patch below, > would be to have it perhaps skip the entire bio that includes > the failed sector, rather than only the bad sector itself. Er ... define "skip over the bio". A bio is simply a block representation for a bunch of sg elements coming in to the elevator. Mostly what we see in SCSI is a single bio per request, so skipping the bio is really the current behaviour (to fail the rest of the request). > I think doing that might address most concerns expressed here. > Have you got an alternate suggestion, James? James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-01 21:55 ` James Bottomley @ 2007-02-02 2:48 ` Mark Lord 2007-02-02 12:20 ` Ric Wheeler 1 sibling, 0 replies; 34+ messages in thread From: Mark Lord @ 2007-02-02 2:48 UTC (permalink / raw) To: James Bottomley; +Cc: linux-kernel, IDE/ATA development list, linux-scsi James Bottomley wrote: > On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: >.. >> One thing that could be even better than the patch below, >> would be to have it perhaps skip the entire bio that includes >> the failed sector, rather than only the bad sector itself. > > Er ... define "skip over the bio". A bio is simply a block > representation for a bunch of sg elements coming in to the elevator. Exactly. Or rather, a block of sg_elements from a single point of request, is it not? > Mostly what we see in SCSI is a single bio per request, so skipping the > bio is really the current behaviour (to fail the rest of the request). Very good. That's what it's supposed to do. But if each request contained only a single bio, then all of Jens' work on IO scheduling would be for nothing, n'est-ce pas? In the case where a request consists of multiple bio's which have been merged under a single request struct, we really should give at least one attempt to each bio. This way, in most cases, only the process that requested the failed sector(s) will see an error, not the innocent victims that happened to get merged onto the end. Which could be very critical stuff (or not -- it could be quite random). So the time factor works out to one disk I/O timeout per failed bio. That's what would have happened with the NOP scheduler anyway. On the sytems I'm working with, I don't see huge numbers of bad sectors. What they tend to show is just one or two bad sectors, widely scattered. So: >> I think doing that might address most concerns expressed here. >> Have you got an alternate suggestion, James? Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-01 21:55 ` James Bottomley 2007-02-02 2:48 ` Mark Lord @ 2007-02-02 12:20 ` Ric Wheeler 2007-02-02 14:42 ` Alan 2007-02-02 14:50 ` Alan 1 sibling, 2 replies; 34+ messages in thread From: Ric Wheeler @ 2007-02-02 12:20 UTC (permalink / raw) To: James Bottomley Cc: Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi James Bottomley wrote: >On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: > > >>I believe you made the first change in response to my prodding at the time, >>when libata was not returning valid sense data (no LBA) for media errors. >>The SCSI EH handling of that was rather poor at the time, >>and so having it not retry the remaining sectors was actually >>a very good fix at the time. >> >>But now, libata *does* return valid sense data for LBA/DMA drives, >>and the workaround from circa 2.6.16 is no longer the best we can do. >>Now that we know which sector failed, we ought to be able to skip >>over it, and continue with the rest of the merged request. >> >> > >We can ... the big concern with your approach, which you haven't >addressed is the time factor. For most SCSI devices, returning a fatal >MEDIUM ERROR means we're out of remapping table, and also that there's >probably a bunch of sectors on the track that are now out. Thus, there >are almost always multiple sector failures. In linux, the average >request size on a filesystem is around 64-128kb; thats 128-256 sectors. >If we fail at the initial sector, we have to go through another 128-256 >attempts, with the internal device retries, before we fail the entire >request. Some devices can take a second or so for each read before they >finally give up and decide they really can't read the sector, so you're >looking at 2-5 minutes before the machine finally fails this one >request ... and much worse for devices that retry more times. > > This is not the case on a read error - we commonly see transient errors on reads from disks. What we push our vendors to do is to try to keep the "worst case" response down to tens of seconds as they retry, etc internally with a device. When they take that long (and they do), adding retries up the stack can translate into minutes per sector. The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors. > > >>One thing that could be even better than the patch below, >>would be to have it perhaps skip the entire bio that includes >>the failed sector, rather than only the bad sector itself. >> >> > >Er ... define "skip over the bio". A bio is simply a block >representation for a bunch of sg elements coming in to the elevator. >Mostly what we see in SCSI is a single bio per request, so skipping the >bio is really the current behaviour (to fail the rest of the request). > > This is really a tricky one - what can happen when we fail merged IO requests is really unpredictable behavior up at the application level since the IO error might not be at all relevant to my part of the request. Merging can produce a request that is much larger than any normal drive error. I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). > > >>I think doing that might address most concerns expressed here. >>Have you got an alternate suggestion, James? >> >> > >James > > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 12:20 ` Ric Wheeler @ 2007-02-02 14:42 ` Alan 2007-02-02 14:53 ` James Bottomley 2007-02-02 20:16 ` Douglas Gilbert 2007-02-02 14:50 ` Alan 1 sibling, 2 replies; 34+ messages in thread From: Alan @ 2007-02-02 14:42 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi > The interesting point of this question is about the typically pattern of > IO errors. On a read, it is safe to assume that you will have issues > with some bounded numbers of adjacent sectors. Which in theory you can get by asking the drive for the real sector size from the ATA7 info. (We ought to dig this out more as its relevant for partition layout too). > I really like the idea of being able to set this kind of policy on a per > drive instance since what you want here will change depending on what > your system requirements are, what the system is trying to do (i.e., > when trying to recover a failing but not dead yet disk, IO errors should > be as quick as possible and we should choose an IO scheduler that does > not combine IO's). That seems to be arguing for a bounded "live" time including retry run time for a command. That's also more intuitive for real time work and for end user setup. "Either work or fail within n seconds" Alan ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 14:42 ` Alan @ 2007-02-02 14:53 ` James Bottomley 2007-02-02 16:16 ` Ric Wheeler 2007-02-02 20:16 ` Douglas Gilbert 1 sibling, 1 reply; 34+ messages in thread From: James Bottomley @ 2007-02-02 14:53 UTC (permalink / raw) To: Alan Cc: Ric Wheeler, Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi On Fri, 2007-02-02 at 14:42 +0000, Alan wrote: > > The interesting point of this question is about the typically pattern of > > IO errors. On a read, it is safe to assume that you will have issues > > with some bounded numbers of adjacent sectors. > > Which in theory you can get by asking the drive for the real sector size > from the ATA7 info. (We ought to dig this out more as its relevant for > partition layout too). > > > I really like the idea of being able to set this kind of policy on a per > > drive instance since what you want here will change depending on what > > your system requirements are, what the system is trying to do (i.e., > > when trying to recover a failing but not dead yet disk, IO errors should > > be as quick as possible and we should choose an IO scheduler that does > > not combine IO's). > > That seems to be arguing for a bounded "live" time including retry run > time for a command. That's also more intuitive for real time work and for > end user setup. "Either work or fail within n seconds" Actually, then I think perhaps we use the allowed retries for this ... So you would fail a single sector and count it against the retries. When you've done this allowed retries times, you fail the rest of the request. James ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 14:53 ` James Bottomley @ 2007-02-02 16:16 ` Ric Wheeler 0 siblings, 0 replies; 34+ messages in thread From: Ric Wheeler @ 2007-02-02 16:16 UTC (permalink / raw) To: James Bottomley Cc: Alan, Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi James Bottomley wrote: > On Fri, 2007-02-02 at 14:42 +0000, Alan wrote: > >>> The interesting point of this question is about the typically pattern of >>> IO errors. On a read, it is safe to assume that you will have issues >>> with some bounded numbers of adjacent sectors. >>> >> Which in theory you can get by asking the drive for the real sector size >> from the ATA7 info. (We ought to dig this out more as its relevant for >> partition layout too). >> Actually, my point is that damage typically impacts a cluster of disk sectors that are adjacent. Think of a drive that has junk on the platter or a some such thing - the contamination is likely to be localized. >> >>> I really like the idea of being able to set this kind of policy on a per >>> drive instance since what you want here will change depending on what >>> your system requirements are, what the system is trying to do (i.e., >>> when trying to recover a failing but not dead yet disk, IO errors should >>> be as quick as possible and we should choose an IO scheduler that does >>> not combine IO's). >>> >> That seems to be arguing for a bounded "live" time including retry run >> time for a command. That's also more intuitive for real time work and for >> end user setup. "Either work or fail within n seconds" >> > > Actually, then I think perhaps we use the allowed retries for this ... > I really am not a big retry fan for most modern drives - the drive will try really, really hard to complete an IO for us and multiple retries can just slow down the higher level application from recovering. > So you would fail a single sector and count it against the retries. > When you've done this allowed retries times, you fail the rest of the > request. > > James > > I think that we need to play with some of these possible solutions on some real-world bad drives and see how they react. We should definitely talk more about this at the workshop ;-) ric ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 14:42 ` Alan 2007-02-02 14:53 ` James Bottomley @ 2007-02-02 20:16 ` Douglas Gilbert 1 sibling, 0 replies; 34+ messages in thread From: Douglas Gilbert @ 2007-02-02 20:16 UTC (permalink / raw) To: Alan Cc: Ric Wheeler, James Bottomley, Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi Alan wrote: >> The interesting point of this question is about the typically pattern of >> IO errors. On a read, it is safe to assume that you will have issues >> with some bounded numbers of adjacent sectors. > > Which in theory you can get by asking the drive for the real sector size > from the ATA7 info. (We ought to dig this out more as its relevant for > partition layout too). > >> I really like the idea of being able to set this kind of policy on a per >> drive instance since what you want here will change depending on what >> your system requirements are, what the system is trying to do (i.e., >> when trying to recover a failing but not dead yet disk, IO errors should >> be as quick as possible and we should choose an IO scheduler that does >> not combine IO's). > > That seems to be arguing for a bounded "live" time including retry run > time for a command. That's also more intuitive for real time work and for > end user setup. "Either work or fail within n seconds" Which is more or less the "streaming" feature set in recent ATA standards. [Alas, streaming and NCQ/TCQ can't be done with the same access.] SCSI has its Read Write Error Recovery mode page which doesn't have timeouts but does have Read and Write Retry Counts amongst other fields that control the amount (and indirectly the time) of attempted error recovery. Doug Gilbert ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 12:20 ` Ric Wheeler 2007-02-02 14:42 ` Alan @ 2007-02-02 14:50 ` Alan 2007-02-02 16:06 ` Mark Lord 1 sibling, 1 reply; 34+ messages in thread From: Alan @ 2007-02-02 14:50 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, Mark Lord, linux-kernel, IDE/ATA development list, linux-scsi > your system requirements are, what the system is trying to do (i.e., > when trying to recover a failing but not dead yet disk, IO errors should > be as quick as possible and we should choose an IO scheduler that does > not combine IO's). If this is the right strategy for disk recovery for a given type of device then this ought to be an automatic strategy. Most end users will not have the knowledge to frob about in sysfs, and if the bad sector hits at the wrong moment a sensible automatic recovery strategy is going to do the right thing faster than human intervention, which may be some hours away. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 14:50 ` Alan @ 2007-02-02 16:06 ` Mark Lord 2007-02-02 19:49 ` Matt Mackall 0 siblings, 1 reply; 34+ messages in thread From: Mark Lord @ 2007-02-02 16:06 UTC (permalink / raw) To: Alan Cc: Ric Wheeler, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Alan wrote: > > If this is the right strategy for disk recovery for a given type of > device then this ought to be an automatic strategy. Most end users will > not have the knowledge to frob about in sysfs, and if the bad sector hits > at the wrong moment a sensible automatic recovery strategy is going to do > the right thing faster than human intervention, which may be some hours > away. I think something we seem to be discussing here are the opposite aims of enterprise RAID (traditional SCSI market) versus desktop filesystems (now the number one user of Linux SCSI, courtesy of libata). With RAID, it's safe to suggest that a very fast, bounded exit from EH is almost always what the customer / end-user wants, because the higher level RAID management can then deal with the failed drive offline. But for a desktop filesystem, failing out quickly and bulk-failing megabytes around a couple of bad sectors is not good -- no redundancy, and this will lead to unneeded data loss. It's beginning to look like this needs to be run-time tuneable, per block minor device (per-partition), through sysfs at a minimum. The RAID tools could then automatically choose settings to bias more towards an "instant exit" when errors are found, whereas a non-RAID desktop could select a more reliable recovery strategy. Right now, with Jame's patch (earlier in this thread), the whole scheme is heavily weighted towards the RAID "instant exit" strategy, which is making me quite nervous about the data on my laptop. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 16:06 ` Mark Lord @ 2007-02-02 19:49 ` Matt Mackall 2007-02-02 22:58 ` Mark Lord 0 siblings, 1 reply; 34+ messages in thread From: Matt Mackall @ 2007-02-02 19:49 UTC (permalink / raw) To: Mark Lord Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi On Fri, Feb 02, 2007 at 11:06:19AM -0500, Mark Lord wrote: > Alan wrote: > > > >If this is the right strategy for disk recovery for a given type of > >device then this ought to be an automatic strategy. Most end users will > >not have the knowledge to frob about in sysfs, and if the bad sector hits > >at the wrong moment a sensible automatic recovery strategy is going to do > >the right thing faster than human intervention, which may be some hours > >away. > > I think something we seem to be discussing here are the opposite aims > of enterprise RAID (traditional SCSI market) versus desktop filesystems > (now the number one user of Linux SCSI, courtesy of libata). > > With RAID, it's safe to suggest that a very fast, bounded exit from EH > is almost always what the customer / end-user wants, because the higher > level RAID management can then deal with the failed drive offline. > > But for a desktop filesystem, failing out quickly and bulk-failing megabytes > around a couple of bad sectors is not good -- no redundancy, and this will > lead to unneeded data loss. > > It's beginning to look like this needs to be run-time tuneable, > per block minor device (per-partition), through sysfs at a minimum. > The RAID tools could then automatically choose settings to bias more > towards an "instant exit" when errors are found, whereas a non-RAID > desktop could select a more reliable recovery strategy. > > Right now, with Jame's patch (earlier in this thread), the whole scheme > is heavily weighted towards the RAID "instant exit" strategy, which is > making me quite nervous about the data on my laptop. Also worth considering is that spending minutes trying to reread damaged sectors is likely to accelerate your death spiral. More data may be recoverable if you give up quickly in a first pass, then go back and manually retry damaged bits with smaller I/Os. Another approach is to have separate retries per hard sector and hard errors per MB at the block device level. We don't want to have the same number of retries for a 64KB block as a 1MB block and we certainly don't want to do 2K retries in a row. So for instance, I could have the first set to 1 and the second set to 16. For a 256KB read, this gets scaled down to 4, which means we'll retry each sector once and fail the whole I/O when we hit the 5th sector error. More reasonable number might be 0 and 1, meaning "don't do OS level retries on sectors and fail the whole I/O on the second sector error (or immediately for smaller reads)". It might also be informative to add a kernel message reporting if a retry (in the non-transient cases) ever actually succeeds. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 19:49 ` Matt Mackall @ 2007-02-02 22:58 ` Mark Lord 2007-02-02 23:07 ` Matt Mackall 0 siblings, 1 reply; 34+ messages in thread From: Mark Lord @ 2007-02-02 22:58 UTC (permalink / raw) To: Matt Mackall Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi Matt Mackall wrote: > .. > Also worth considering is that spending minutes trying to reread > damaged sectors is likely to accelerate your death spiral. More data > may be recoverable if you give up quickly in a first pass, then go > back and manually retry damaged bits with smaller I/Os. All good input. But what was being debated here is not so much the retrying of known-bad sectors, but rather what to do about the kiBs or MiBs of sectors remaining in a merged request after hitting a single bad sector mid-way. Currently, SCSI just abandons the entire remaining workload. Cheers ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR 2007-02-02 22:58 ` Mark Lord @ 2007-02-02 23:07 ` Matt Mackall 0 siblings, 0 replies; 34+ messages in thread From: Matt Mackall @ 2007-02-02 23:07 UTC (permalink / raw) To: Mark Lord Cc: Alan, Ric Wheeler, James Bottomley, linux-kernel, IDE/ATA development list, linux-scsi On Fri, Feb 02, 2007 at 05:58:04PM -0500, Mark Lord wrote: > Matt Mackall wrote: > >.. > >Also worth considering is that spending minutes trying to reread > >damaged sectors is likely to accelerate your death spiral. More data > >may be recoverable if you give up quickly in a first pass, then go > >back and manually retry damaged bits with smaller I/Os. > > All good input. But what was being debated here is not so much > the retrying of known-bad sectors, but rather what to do about > the kiBs or MiBs of sectors remaining in a merged request after > hitting a single bad sector mid-way. Yep, that's precisely what was addressed in the part you snipped. My main point being that what to do about the remaining workload should be dependent on the size of the I/O. If we encounter errors on sectors 4,5,6,7,8.. of a 1MB request, we should have a threshold for giving up. It's not unreasonable for that threshold to be larger than 1, but it should not be 2048. And if we do the I/O as four 256KB requests, we should have approximately the same number of retries (assuming the whole region's bad). -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2007-02-02 23:07 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-31 0:47 [PATCH] scsi_lib.c: continue after MEDIUM_ERROR Mark Lord
2007-01-31 1:12 ` [PATCH] RESEND " Mark Lord
2007-01-31 1:16 ` [PATCH] " James Bottomley
2007-01-31 1:36 ` Mark Lord
[not found] ` <311601c90701301725n53d25a74g652b7ca3bfc64c56@mail.gmail.com>
2007-01-31 1:41 ` Mark Lord
2007-01-31 3:20 ` Ric Wheeler
2007-01-31 4:21 ` James Bottomley
2007-01-31 15:13 ` Mark Lord
2007-01-31 15:22 ` Mark Lord
2007-01-31 15:24 ` James Bottomley
2007-01-31 5:09 ` Douglas Gilbert
2007-01-31 15:08 ` Mark Lord
2007-01-31 15:23 ` Alan
2007-01-31 16:35 ` Ric Wheeler
2007-01-31 17:57 ` Mark Lord
2007-01-31 18:13 ` James Bottomley
2007-01-31 18:37 ` Mark Lord
2007-01-31 9:30 ` Jeff Garzik
2007-01-31 14:36 ` Ric Wheeler
2007-01-31 15:28 ` Douglas Gilbert
2007-01-31 15:38 ` Mark Lord
2007-02-01 20:02 ` Mark Lord
2007-02-01 21:55 ` James Bottomley
2007-02-02 2:48 ` Mark Lord
2007-02-02 12:20 ` Ric Wheeler
2007-02-02 14:42 ` Alan
2007-02-02 14:53 ` James Bottomley
2007-02-02 16:16 ` Ric Wheeler
2007-02-02 20:16 ` Douglas Gilbert
2007-02-02 14:50 ` Alan
2007-02-02 16:06 ` Mark Lord
2007-02-02 19:49 ` Matt Mackall
2007-02-02 22:58 ` Mark Lord
2007-02-02 23:07 ` Matt Mackall
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).