Re: [PATCH] [RFC] sd: make error handling more robust

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] [RFC] sd: make error handling more robust
@ 2008-02-01  1:24 Luben Tuikov
  2008-02-01 11:53 ` Luben Tuikov
  2008-02-01 15:46 ` Tony Battersby
  0 siblings, 2 replies; 12+ messages in thread
From: Luben Tuikov @ 2008-02-01  1:24 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Tony Battersby

[-- Attachment #1: Type: text/plain, Size: 5400 bytes --]

--- On Thu, 1/31/08, Tony Battersby <tonyb@cybernetics.com> wrote:

> From: Tony Battersby <tonyb@cybernetics.com>
> Subject: [PATCH] [RFC] sd: make error handling more robust
> To: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "Luben Tuikov" <ltuikov@yahoo.com>, linux-scsi@vger.kernel.org
> Date: Thursday, January 31, 2008, 1:31 PM
> I have a RAID that returns a medium error on a read command.
>  The
> "information bytes valid" bit is set in the sense
> data, but the
> information bytes are zero:
> 
> CDB: 28 00 02 B0 62 00 00 00 02 00
> Status:	02 (CHECK CONDITION)
> Sense data:
> F0 00 03 00 | 00 00 00 0A | 00 00 00 00 | 00 00 00 00
> 00 00

This indicates either a problem with the target
or the LLDD. VALID should be 0 if the bad LBA is
invalid.  Now since VALID is 1, bad LBA is considered
valid, with value of 0.

Bad LBA is always greater than or equal to start LBA,
i.e.:
    Bad LBA >= Start LBA.
The opposite is an impossibility since reading off
the end of the "platter" doesn't wrap around through
LBA 0 (which would be crazy).

By having VALID = 1 and Bad LBA = 0, and I'm assuming
Start LBA > 0 (from the situation you describe),
then good bytes is non-sensical.  This is NOT due
to the way sd_done() behaves, but due to the invalid/
broken sense data returned/emulated by the LU/LLDD.

> For HARDWARE_ERROR/MEDIUM_ERROR, sd_done assumes that the
> sense
> information bytes contain the sector in error without
> sanity-checking
> the value, so it ends up returning a completely bogus
> good_bytes value
> greater than xfer_size.

This is due to the fact that the LU/LLDD returned
sense data with VALID = 1, and so Bad LBA is assumed
correct.  This is as per spec. At this moment given
the situation you describe it seems that the LU/LLDD
is behaving out of spec.

As to sanity checking, given the inequality above,
and your peculiar target/LLDD returning broken
sense data, a sanity check would be the following:

   if (bad_lba < start_lba)
         goto out;

Added immediately before the computation of
good_bytes.

> This in turn causes the medium
> error to be
> ignored and corrupt data to be returned to the user
> application.  This
> problem was introduced by the patch "[SCSI]
> sd/scsi_lib simplify
> sd_rw_intr and scsi_io_completion" in 2.6.18-rc1; with
> kernels 2.6.17
> and before, the application correctly gets an I/O error
> instead.  This
> patch fixes this particular problem by checking that
> bad_lba falls
> within a sensible range before using it.
> 
> For a read command that returns
> HARDWARE_ERROR/MEDIUM_ERROR, I also
> added the ability to calculate the amount of good data
> returned using
> the data transfer residual if set by the LLDD.  If the both
> the sense
> information bytes and the data transfer residual are valid,
> then they
> are used to sanity-check each other.
> 
> The following code in sd_done doesn't seem right to me:
> 
>        if (driver_byte(result) != DRIVER_SENSE &&
>            (!sense_valid || sense_deferred))
>                 goto out;
> 
> It would make more sense to use || rather than &&
> for this test.

That's very easy to verify. Simply negate the
test and see if what you get makes sense. Case
in point:

Negate original:
    if (driver_byte(result) == DRIVER_SENSE ||
        (sense_valid && sense_defered))
            Inspect sense.

Negate your proposed change "&&" -> "||":
    if (driver_byte(result) == DRIVER_SENSE &&
         (sense_valid || sense_deferred))
                Inspect sense.

The reason to use the former, rather the latter,
is that LLDDs don't always set the driver_byte
in the result, and may only set the status byte.
E.g. to indicate CHECK CONDITION and then using
scsi_command_normalize_sense() to find out if
sense data is present.

>  Even
> better, this patch moves the check up higher so that the
> sense buffer
> isn't even accessed unless driver_byte(result) &
> DRIVER_SENSE.
> 
> Finally, for the case of RECOVERED_ERROR/NO_SENSE, this
> patch adds a
> check of the data transfer residual before assuming that
> the command
> completed successfully.

You cannot do this for NO SENSE SK.  You can do
that for RECOVERED ERROR SK. The way to do this
is to move "case RECOVERED_ERROR:" just above
"case HARDWARE_ERROR:".

> I would like comments on the following:
> 
> sd_done doesn't check the data transfer residual for
> commands that
> complete successfully.  In the unlikely case that the drive
> returns good
> status without transferring all the data (due to a SCSI bus
> problem or
> disk firmware bug), the error won't be detected.  I
> figured that it was
> more likely for a LLDD to set resid incorrectly than for
> this unlikely
> problem to happen, so I didn't add a check for it. 
> Agreed?
> 
> Is the new is_sd_cmnd_read_or_write() function necessary? 
> The original
> code appears to use blk_fs_request(SCpnt->request) to
> determine if a
> command is read or write.  Should I drop
> is_sd_cmnd_read_or_write() and
> use blk_fs_request() instead, or it it OK like this?
> 
> Does anyone object to the new data transfer residual checks
> for
> non-read/write commands that return
> RECOVERED_ERROR/NO_SENSE?  Should I
> just drop this part of the patch for simplicity?

It seems to me, your patch is really debugging
a misbehaved device server at the RAID device
(or misbehaved LLDD).

Try the attached patch and see what you get.

   Luben

[-- Attachment #2: sd.patch.txt --]
[-- Type: text/plain, Size: 987 bytes --]

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a69b155..383a3a8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -934,6 +934,10 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 		goto out;
 
 	switch (sshdr.sense_key) {
+	case RECOVERED_ERROR:
+		scsi_print_sense("sd", SCpnt);
+		SCpnt->result = 0;
+		/* fallthrough */
 	case HARDWARE_ERROR:
 	case MEDIUM_ERROR:
 		if (!blk_fs_request(SCpnt->request))
@@ -968,14 +972,14 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 		/* This computation should always be done in terms of
 		 * the resolution of the device's medium.
 		 */
+		if (bad_lba < start_lba)
+			goto out;
 		good_bytes = (bad_lba - start_lba)*SCpnt->device->sector_size;
 		break;
-	case RECOVERED_ERROR:
 	case NO_SENSE:
 		/* Inform the user, but make sure that it's not treated
 		 * as a hard error.
 		 */
-		scsi_print_sense("sd", SCpnt);
 		SCpnt->result = 0;
 		memset(SCpnt->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
 		good_bytes = xfer_size;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01  1:24 [PATCH] [RFC] sd: make error handling more robust Luben Tuikov
@ 2008-02-01 11:53 ` Luben Tuikov
  2008-02-01 13:44   ` Salyzyn, Mark
  2008-02-01 15:46 ` Tony Battersby
  1 sibling, 1 reply; 12+ messages in thread
From: Luben Tuikov @ 2008-02-01 11:53 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Tony Battersby

--- On Thu, 1/31/08, Luben Tuikov <ltuikov@yahoo.com> wrote:
> Negate original:
>     if (driver_byte(result) == DRIVER_SENSE ||
>         (sense_valid && sense_defered))
>             Inspect sense.
> 
> Negate your proposed change "&&" ->
> "||":
>     if (driver_byte(result) == DRIVER_SENSE &&
>          (sense_valid || sense_deferred))
>                 Inspect sense.

Should read:

Negate original:
     if (driver_byte(result) == DRIVER_SENSE ||
         (sense_valid && !sense_deferred))
             Inspect sense.

Negate your proposed change "&&" -> "||":
     if (driver_byte(result) == DRIVER_SENSE &&
         (sense_valid || !sense_deferred))
             Inspect sense.

  Luben


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 11:53 ` Luben Tuikov
@ 2008-02-01 13:44   ` Salyzyn, Mark
  2008-02-01 16:15     ` Tony Battersby
  0 siblings, 1 reply; 12+ messages in thread
From: Salyzyn, Mark @ 2008-02-01 13:44 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org
  Cc: Mike Snitzer, ltuikov@yahoo.com, James Bottomley, Tony Battersby

Serendipity, been working an issue for the past week where error reporting from aacraid's HARDWARE_ERROR when an array was marked DEAD in the Adapter was not propagating to MD. The trigger for this investigation was "AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]" as authored my Mike Snitzer.

I had attacked this yesterday by setting the deferred bit (0xF1) rather than invalidating the sense data (0x70) in the LLD (Patch to the aacraid would be forthcoming following some testing). The Valid bit in the SCSI layer sometimes (too many kernels, I am talking generically) throws away all the content and keys of the sense data, at least deferred appears to report that the info does not accompany the sense data.

The change needs to be SCSI compliant and possibly endure back-porting to other kernels. So where are we going to put this, in each respective LLD (invalidating info, but not the keys), in sd.c (where the bad_lba is calculated) or in scsi_error.c (where the info field is picked up)?

Sincerely -- Mark Salyzyn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 13:44   ` Salyzyn, Mark
@ 2008-02-01 16:15     ` Tony Battersby
  0 siblings, 0 replies; 12+ messages in thread
From: Tony Battersby @ 2008-02-01 16:15 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: linux-scsi@vger.kernel.org, Mike Snitzer, ltuikov@yahoo.com,
	James Bottomley

Salyzyn, Mark wrote:
> Serendipity, been working an issue for the past week where error reporting from aacraid's HARDWARE_ERROR when an array was marked DEAD in the Adapter was not propagating to MD. The trigger for this investigation was "AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]" as authored my Mike Snitzer.
>
> I had attacked this yesterday by setting the deferred bit (0xF1) rather than invalidating the sense data (0x70) in the LLD (Patch to the aacraid would be forthcoming following some testing). The Valid bit in the SCSI layer sometimes (too many kernels, I am talking generically) throws away all the content and keys of the sense data, at least deferred appears to report that the info does not accompany the sense data.
>
> The change needs to be SCSI compliant and possibly endure back-porting to other kernels. So where are we going to put this, in each respective LLD (invalidating info, but not the keys), in sd.c (where the bad_lba is calculated) or in scsi_error.c (where the info field is picked up)?
>
> Sincerely -- Mark Salyzyn
>
>   
Have you tried setting VALID=1 and DEFERRED=0 (0xF0), with the sense
information bytes containing the LBA of the first requested sector from
the read/write CDB?  Note that if you support arrays > 2 TB, you may
have to use the newer descriptor sense format to return sector numbers
>= 2^32.

Tony


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01  1:24 [PATCH] [RFC] sd: make error handling more robust Luben Tuikov
  2008-02-01 11:53 ` Luben Tuikov
@ 2008-02-01 15:46 ` Tony Battersby
  2008-02-01 16:09   ` James Bottomley
  2008-02-01 20:06   ` Luben Tuikov
  1 sibling, 2 replies; 12+ messages in thread
From: Tony Battersby @ 2008-02-01 15:46 UTC (permalink / raw)
  To: ltuikov; +Cc: James Bottomley, linux-scsi

Luben Tuikov wrote:
> This is due to the fact that the LU/LLDD returned
> sense data with VALID = 1, and so Bad LBA is assumed
> correct.  This is as per spec. At this moment given
> the situation you describe it seems that the LU/LLDD
> is behaving out of spec.
>
>   
I agree that my RAID (an external SCSI RAID connected via LSI Ultra320
HBA) is out-of-spec.  It should set VALID = 0 if the sense information
bytes are invalid.  I also agree that my patch is geared toward making
the kernel more robust when faced with out-of-spec devices.

> As to sanity checking, given the inequality above,
> and your peculiar target/LLDD returning broken
> sense data, a sanity check would be the following:
>
>    if (bad_lba < start_lba)
>          goto out;
>
> Added immediately before the computation of
> good_bytes.
>
>   
I agree that the simple check that you suggest would handle the
particular case of my out-of-spec RAID.  However, since we are
sanity-checking bad_lba, why not also add a check for the upper bound
too, in case there are other disks that are out-of-spec in a different
way?  I want to guarantee that sd_done() always returns a valid
good_bytes value regardless of what the sense data contains.

>> The following code in sd_done doesn't seem right to me:
>>
>>        if (driver_byte(result) != DRIVER_SENSE &&
>>            (!sense_valid || sense_deferred))
>>                 goto out;
>>
>> It would make more sense to use || rather than &&
>> for this test.
>>     
>
> That's very easy to verify. Simply negate the
> test and see if what you get makes sense. Case
> in point:
>
> Negate original:
>     if (driver_byte(result) == DRIVER_SENSE ||
>         (sense_valid && !sense_defered))
>             Inspect sense.
>
> Negate your proposed change "&&" -> "||":
>     if (driver_byte(result) == DRIVER_SENSE &&
>          (sense_valid && !sense_deferred))
>                 Inspect sense.
>
> The reason to use the former, rather the latter,
> is that LLDDs don't always set the driver_byte
> in the result, and may only set the status byte.
> E.g. to indicate CHECK CONDITION and then using
> scsi_command_normalize_sense() to find out if
> sense data is present.
>
>   
(Note: I edited the logic tests in the quote above.)

Here is the problem I see with the original: suppose the driver did
retrieve the sense data and set driver_byte(result) to DRIVER_SENSE, but
the sense data is invalid (corrupted on the wire, garbled by PCI,
etc.).  Then scsi_command_normalize_sense() will return sense_valid =
0.  In that case, the original code will check the invalid sense data
instead of bailing out (driver_byte(result) == DRIVER_SENSE, sense_valid
== 0, sense_deferred == 0).  Since scsi_command_normalize_sense()
memsets the scsi_sense_hdr to 0, this error will be treated like
SENSE_NO_SENSE, and sd_done will return good_bytes = xfer_size.

If LLDDs don't always set the driver byte, and we can't trust that
"driver_byte(result) & DRIVER_SENSE" is set if sense data was retrieved,
then I think the correct thing to do is to drop the check for it
entirely.  So we would have:

	if (result) {
		sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
		if (sense_valid)
			sense_deferred = scsi_sense_is_deferred(&sshdr);
	}
	...
	if (!sense_valid || sense_deferred)
		goto out;

>> Finally, for the case of RECOVERED_ERROR/NO_SENSE, this
>> patch adds a
>> check of the data transfer residual before assuming that
>> the command
>> completed successfully.
>>     
>
> You cannot do this for NO SENSE SK.  You can do
> that for RECOVERED ERROR SK. The way to do this
> is to move "case RECOVERED_ERROR:" just above
> "case HARDWARE_ERROR:".
>
>   
I think you may have misunderstood what I wrote.  I am checking the
"data transfer residual", which is set by the LLDD to indicate how much
data was actually transferred over the wire to or from the device.  It
has nothing to do with sense data.  The data transfer residual (if
supported by the LLDD) should always be valid for any sense key, and
even also for commands that complete successfully.  For example, on a
command that is supposed to write 128 KB, if the LLDD reports that it
only transferred 64 KB to the drive, and the drive returns CHECK
CONDITION, RECOVERED ERROR with VALID = 0, then we know that the drive
could not have written more than 64 KB, since the LLDD never sent it all
the data.

The only reason not to use the data transfer residual is if you are
worried that there are LLDDs that don't calculate it correctly.  I can
vouch for a few LLDDs (aic7xxx, sym53c8xx, mptspi, mptfc, qla2xxx, and
iscsi_tcp).  I intentionally designed my patch to use the data transfer
residual conservatively in case a LLDD got it wrong, so I am not too
worried about it.

Also, I disagree about treating recovered error like hardware/medium
error.  Recovered error is supposed to mean "the last command completed
successfully, with some recovery action performed by the device
server".  The other errors mean that the command didn't complete
successfully.  If VALID=1, then the sense information bytes indicate
"the unsigned logical block address associated with the sense key".  So
for hardware/medium error, the sense info might indicate the bad LBA,
but for recovered error, the sense info might indicate the recovered
LBA.  So I think that it would be a mistake to treat them the same.

>    Luben
>   

Thanks for your comments.

Tony Battersby
Cybernetics

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 15:46 ` Tony Battersby
@ 2008-02-01 16:09   ` James Bottomley
  2008-02-01 20:06   ` Luben Tuikov
  1 sibling, 0 replies; 12+ messages in thread
From: James Bottomley @ 2008-02-01 16:09 UTC (permalink / raw)
  To: Tony Battersby; +Cc: ltuikov, linux-scsi

On Fri, 2008-02-01 at 10:46 -0500, Tony Battersby wrote:
> Luben Tuikov wrote:
> > This is due to the fact that the LU/LLDD returned
> > sense data with VALID = 1, and so Bad LBA is assumed
> > correct.  This is as per spec. At this moment given
> > the situation you describe it seems that the LU/LLDD
> > is behaving out of spec.
> >
> >   
> I agree that my RAID (an external SCSI RAID connected via LSI Ultra320
> HBA) is out-of-spec.  It should set VALID = 0 if the sense information
> bytes are invalid.  I also agree that my patch is geared toward making
> the kernel more robust when faced with out-of-spec devices.
> 
> > As to sanity checking, given the inequality above,
> > and your peculiar target/LLDD returning broken
> > sense data, a sanity check would be the following:
> >
> >    if (bad_lba < start_lba)
> >          goto out;
> >
> > Added immediately before the computation of
> > good_bytes.
> >
> >   
> I agree that the simple check that you suggest would handle the
> particular case of my out-of-spec RAID.  However, since we are
> sanity-checking bad_lba, why not also add a check for the upper bound
> too, in case there are other disks that are out-of-spec in a different
> way?  I want to guarantee that sd_done() always returns a valid
> good_bytes value regardless of what the sense data contains.

Actually, no, that's not the way we work.  The initial presumption has
to be that a SCSI device follows the spec it's advertising it follows.
If we have to second guess everything going on to and coming off the
wire on the grounds the device might not be following the spec we'd have
no subsystem left.

The way we handle out of spec devices varies, usually by blacklists or
flags.  In this case, since the path isn't hot it's only error, adding
an inline check looks fine.  However, making the check Byzantine "just
in case" isn't the way to do this.  The check should be the minimum that
catches the out of spec problem

i.e., the course of action almost precisely what Luben is advocating.
Could you please try his patch and see if it fixes your problem?

> >> The following code in sd_done doesn't seem right to me:
> >>
> >>        if (driver_byte(result) != DRIVER_SENSE &&
> >>            (!sense_valid || sense_deferred))
> >>                 goto out;
> >>
> >> It would make more sense to use || rather than &&
> >> for this test.
> >>     
> >
> > That's very easy to verify. Simply negate the
> > test and see if what you get makes sense. Case
> > in point:
> >
> > Negate original:
> >     if (driver_byte(result) == DRIVER_SENSE ||
> >         (sense_valid && !sense_defered))
> >             Inspect sense.
> >
> > Negate your proposed change "&&" -> "||":
> >     if (driver_byte(result) == DRIVER_SENSE &&
> >          (sense_valid && !sense_deferred))
> >                 Inspect sense.
> >
> > The reason to use the former, rather the latter,
> > is that LLDDs don't always set the driver_byte
> > in the result, and may only set the status byte.
> > E.g. to indicate CHECK CONDITION and then using
> > scsi_command_normalize_sense() to find out if
> > sense data is present.
> >
> >   
> (Note: I edited the logic tests in the quote above.)
> 
> Here is the problem I see with the original: suppose the driver did
> retrieve the sense data and set driver_byte(result) to DRIVER_SENSE, but
> the sense data is invalid (corrupted on the wire, garbled by PCI,
> etc.).  Then scsi_command_normalize_sense() will return sense_valid =
> 0.  In that case, the original code will check the invalid sense data
> instead of bailing out (driver_byte(result) == DRIVER_SENSE, sense_valid
> == 0, sense_deferred == 0).  Since scsi_command_normalize_sense()
> memsets the scsi_sense_hdr to 0, this error will be treated like
> SENSE_NO_SENSE, and sd_done will return good_bytes = xfer_size.

We solve the problem at hand, not the problem we think we could get.
Wire corruption detection is done elsewhere, usually by parity, crc or
other transport techniques.

> If LLDDs don't always set the driver byte, 

By the time you get to this code, the DRIVER_SENSE is *always* set and
the sense is *always* collected (if available).

> and we can't trust that
> "driver_byte(result) & DRIVER_SENSE" is set if sense data was retrieved,
> then I think the correct thing to do is to drop the check for it
> entirely.  So we would have:
> 
> 	if (result) {
> 		sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
> 		if (sense_valid)
> 			sense_deferred = scsi_sense_is_deferred(&sshdr);
> 	}
> 	...
> 	if (!sense_valid || sense_deferred)
> 		goto out;
> 
> 
> 
> >> Finally, for the case of RECOVERED_ERROR/NO_SENSE, this
> >> patch adds a
> >> check of the data transfer residual before assuming that
> >> the command
> >> completed successfully.
> >>     
> >
> > You cannot do this for NO SENSE SK.  You can do
> > that for RECOVERED ERROR SK. The way to do this
> > is to move "case RECOVERED_ERROR:" just above
> > "case HARDWARE_ERROR:".
> >
> >   
> I think you may have misunderstood what I wrote.  I am checking the
> "data transfer residual", which is set by the LLDD to indicate how much
> data was actually transferred over the wire to or from the device.  It
> has nothing to do with sense data.  The data transfer residual (if
> supported by the LLDD) should always be valid for any sense key, and
> even also for commands that complete successfully.  For example, on a
> command that is supposed to write 128 KB, if the LLDD reports that it
> only transferred 64 KB to the drive, and the drive returns CHECK
> CONDITION, RECOVERED ERROR with VALID = 0, then we know that the drive
> could not have written more than 64 KB, since the LLDD never sent it all
> the data.

residuals are not universally supported by LLDs and shouldn't be relied
upon in error handling.

> The only reason not to use the data transfer residual is if you are
> worried that there are LLDDs that don't calculate it correctly.  I can
> vouch for a few LLDDs (aic7xxx, sym53c8xx, mptspi, mptfc, qla2xxx, and
> iscsi_tcp).  I intentionally designed my patch to use the data transfer
> residual conservatively in case a LLDD got it wrong, so I am not too
> worried about it.
> 
> Also, I disagree about treating recovered error like hardware/medium
> error.  Recovered error is supposed to mean "the last command completed
> successfully, with some recovery action performed by the device
> server".  The other errors mean that the command didn't complete
> successfully.  If VALID=1, then the sense information bytes indicate
> "the unsigned logical block address associated with the sense key".  So
> for hardware/medium error, the sense info might indicate the bad LBA,
> but for recovered error, the sense info might indicate the recovered
> LBA.  So I think that it would be a mistake to treat them the same.

I agree on this point.  The sense code does indicate the entire command
completed successfully, but there might have been an internal problem
with a flagged sector.  There's no real value to saying we only
completed up to the flagged sector, when in fact everything should have
been completed.

James



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 15:46 ` Tony Battersby
  2008-02-01 16:09   ` James Bottomley
@ 2008-02-01 20:06   ` Luben Tuikov
  2008-02-01 21:02     ` Tony Battersby
  1 sibling, 1 reply; 12+ messages in thread
From: Luben Tuikov @ 2008-02-01 20:06 UTC (permalink / raw)
  To: Tony Battersby; +Cc: James Bottomley, linux-scsi

--- On Fri, 2/1/08, Tony Battersby <tonyb@cybernetics.com> wrote:
> 
> Also, I disagree about treating recovered error like
> hardware/medium
> error.  Recovered error is supposed to mean "the last
> command completed
> successfully, with some recovery action performed by the
> device
> server".

Which then means that you agree with
commit 03aba2f7.

But the definition of RECOVERED ERROR immediately
after what you quoted, adds:
   "Details may be determined by examining the
additional sense bytes and the INFORMATION field."

Thus the patch I sent to you for you to try on
your hardware.

     Luben


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 20:06   ` Luben Tuikov
@ 2008-02-01 21:02     ` Tony Battersby
  2008-02-02  0:49       ` Luben Tuikov
  0 siblings, 1 reply; 12+ messages in thread
From: Tony Battersby @ 2008-02-01 21:02 UTC (permalink / raw)
  To: ltuikov; +Cc: James Bottomley, linux-scsi

Luben Tuikov wrote:
> --- On Fri, 2/1/08, Tony Battersby <tonyb@cybernetics.com> wrote:
>   
>> Also, I disagree about treating recovered error like
>> hardware/medium
>> error.  Recovered error is supposed to mean "the last
>> command completed
>> successfully, with some recovery action performed by the
>> device
>> server".
>>     
>
> Which then means that you agree with
> commit 03aba2f7.
>
>   

I disagree only with this part of the commit:

-                       good_bytes = (error_sector - SCpnt->request->sector) << 9;
-                       if (good_bytes < 0 || good_bytes >= this_count)
-                               good_bytes = 0;

So it removed the sanity-check on good_bytes, which broke error handling
for my out-of-spec RAID.  My patch adds the check back, only doing it
before the multiplication by the sector size rather than after.  That is
also why I wanted to add an upper-bound check, to make sure that sd_done
never returned good_bytes > xfer_size, but no one else agreed with that
level of paranoia.

> But the definition of RECOVERED ERROR immediately
> after what you quoted, adds:
>    "Details may be determined by examining the
> additional sense bytes and the INFORMATION field."
>
>   

I guess the question is: if a disk drive returns RECOVERED ERROR with
info_valid=1 and the sector number in the sense bytes, does that mean
that the disk completed the command successfully and transferred all the
data (and is reporting the sector number for information logging
purposes only), or does it mean that it stopped reading or writing at
the sector indicated in the sense data?  I can't really say for sure, so
I will leave the debate to others.

BTW, your patch will result in sd_done returning good_bytes == 0 for the
case where sense_key == RECOVERED ERROR && info_valid == 0, which I
think is probably wrong.  In this case I would return good_bytes == 0
for hardware/medium error and good_bytes == xfer_size for recovered error.

> Thus the patch I sent to you for you to try on
> your hardware.
>
>   
My hardware isn't returning "recovered error" or "no sense" sense keys;
I was just trying to improve the handling of these cases while I was
looking at the function.  Thus, there is no point for me to test your
full patch.  My problem is now solved with the simplified patch I
already posted.  If you want to push for the RECOVERED ERROR change,
then go right ahead with your own patch, but I'm done.

Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-01 21:02     ` Tony Battersby
@ 2008-02-02  0:49       ` Luben Tuikov
  2008-02-04 14:34         ` Tony Battersby
  0 siblings, 1 reply; 12+ messages in thread
From: Luben Tuikov @ 2008-02-02  0:49 UTC (permalink / raw)
  To: Tony Battersby; +Cc: James Bottomley, linux-scsi

--- On Fri, 2/1/08, Tony Battersby <tonyb@cybernetics.com> wrote:
> 
> I disagree only with this part of the commit:
> 
> -                       good_bytes = (error_sector -
> SCpnt->request->sector) << 9;
> -                       if (good_bytes < 0 || good_bytes
> >= this_count)
> -                               good_bytes = 0;
> 
> So it removed the sanity-check on good_bytes, which broke
> error handling
> for my out-of-spec RAID.

Tony, I explained to you the inequality this
is based on, and gave you a patch which tested
for it (hunk #2, 2 lines) and it also tried to
accommodate your complaint about RECOVERED ERROR
(hunk #1, 3 lines), residual rather, which James addressed and apparently(?) I misunderstood.

You then took hunk #2 (2 lines) of the patch I sent
you and submitted it as your own, and then I acked
"your" patch.

I think it would've been much clearer if you had
singled out the problems you were seeing with your
HW and sent a single problem with a single patch per
single email.

> My patch adds the check back,
> only doing it
> before the multiplication by the sector size rather than
> after.  That is
> also why I wanted to add an upper-bound check, to make sure
> that sd_done
> never returned good_bytes > xfer_size, but no one else
> agreed with that
> level of paranoia.

I think James' email explained why. He also gave an
alternative way to implement one-off behaviour.

> I guess the question is: if a disk drive returns RECOVERED
> ERROR with
> info_valid=1 and the sector number in the sense bytes, does
> that mean
> that the disk completed the command successfully and
> transferred all the
> data (and is reporting the sector number for information
> logging
> purposes only), or does it mean that it stopped reading or
> writing at
> the sector indicated in the sense data?

The former.  It is what 03aba2f7 implements.

>  I can't really
> say for sure, so
> I will leave the debate to others.

The answer is in SBC3 and also in 03aba2f7.

> BTW, your patch will result in sd_done returning good_bytes
> == 0 for the
> case where sense_key == RECOVERED ERROR &&
> info_valid == 0, which I
> think is probably wrong.

Tony, I tried to accommodate your complaints
in the email that started this thread, and thus
the patch that we asked you to try.  RECOVERED
ERROR is handled as per spec in 03aba2f7.

RECOVERED ERROR, info_valid = X, commit
03aba2f7 will return good_bytes = xfer_size,
as per SBC3.

> In this case I would return
> good_bytes == 0
> for hardware/medium error and good_bytes == xfer_size for
> recovered error.

For HARDWARE/MEDIUM ERROR, good_bytes is computed,
depending on sense data.
It is 0 <= good_bytes < xfer_size (latter strict inequality).

For RECOVERED ERROR, good_bytes = xfer_size,
as you can see in 03aba2f7.

> My hardware isn't returning "recovered error"
> or "no sense" sense keys;

Sorry, I assumed you were since you were talking
about it and your original patch touched it.

> I was just trying to improve the handling of these cases
> while I was
> looking at the function.
> Thus, there is no point for me to
> test your
> full patch.  My problem is now solved with the simplified
> patch I
> already posted.  If you want to push for the RECOVERED
> ERROR change,
> then go right ahead with your own patch, but I'm done.

No, I'm not pushing for anything.  I already
acked "your" patch.  I was merely trying to
accommodate the concerns you had, as I explained
above.

03aba2f7 stands with "your" simplified patch.

If you still think there is something wrong
with drivers/scsi/sd.c, a kernel log with a 
spec reference and a patch from you would be
most welcome I assure you.

   Luben

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-02  0:49       ` Luben Tuikov
@ 2008-02-04 14:34         ` Tony Battersby
  2008-02-04 21:02           ` Luben Tuikov
  0 siblings, 1 reply; 12+ messages in thread
From: Tony Battersby @ 2008-02-04 14:34 UTC (permalink / raw)
  To: ltuikov; +Cc: James Bottomley, linux-scsi

Luben Tuikov wrote:
> You then took hunk #2 (2 lines) of the patch I sent
> you and submitted it as your own, and then I acked
> "your" patch.
>   
I _really_ _really_ hope that you don't believe that I am trying to take
credit for your work. If you take another look, my original patch had
the following hunk:

+
+		/* Make sure that bad_lba is one of the sectors that the
+		 * command was trying to access.
+		 */
+		if (bad_lba < start_lba ||
+		    bad_lba >= start_lba + xfer_size / sector_size)
+			goto out;
+


Your response patch had the following hunk:

+		if (bad_lba < start_lba)
+			goto out;


So I don't feel that it was dishonest for me to submit this as "my"
work. If you were offended, then I apologize.

> I think it would've been much clearer if you had
> singled out the problems you were seeing with your
> HW and sent a single problem with a single patch per
> single email.
>
>   
Agreed. Sometimes it is difficult to predict when something that seems
so straightforward will generate so much controversy.

Tony


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] [RFC] sd: make error handling more robust
  2008-02-04 14:34         ` Tony Battersby
@ 2008-02-04 21:02           ` Luben Tuikov
  0 siblings, 0 replies; 12+ messages in thread
From: Luben Tuikov @ 2008-02-04 21:02 UTC (permalink / raw)
  To: Tony Battersby; +Cc: James Bottomley, linux-scsi

--- On Mon, 2/4/08, Tony Battersby <tonyb@cybernetics.com> wrote:
> I _really_ _really_ hope that you don't believe that I
> am trying to take
> credit for your work. If you take another look, my original
> patch had
> the following hunk:
> 
> +
> +		/* Make sure that bad_lba is one of the sectors that the
> +		 * command was trying to access.
> +		 */
> +		if (bad_lba < start_lba ||
> +		    bad_lba >= start_lba + xfer_size / sector_size)
> +			goto out;
> +
> 
> 
> Your response patch had the following hunk:
> 
> +		if (bad_lba < start_lba)
> +			goto out;
> 
> 
> So I don't feel that it was dishonest for me to submit
> this as "my"
> work. If you were offended, then I apologize.

Oh, no, of course not.  The most important thing is
if it works for you and fixes your problem and makes
your customers happy (or you if you're a customer).

> > I think it would've been much clearer if you had
> > singled out the problems you were seeing with your
> > HW and sent a single problem with a single patch per
> > single email.
> >
> >   
> Agreed. Sometimes it is difficult to predict when something
> that seems
> so straightforward will generate so much controversy.

Nah, maybe a couple of misunderstandings (email tends to
do that), but it's all good.

I think it would've been so much better for everyone if
the RAID vendor had simply fixed their code to not
set VALID when INFORMATION is not valid (spec behaviour).
Since the bug lies in their code, that would've been
the proper course of action.  Instead, every other OS
which uses that RAID HW would have to adjust to this
RAID FW bug (if they haven't already).  Oh, well.

   Luben

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] [RFC] sd: make error handling more robust
@ 2008-01-31 21:31 Tony Battersby
  0 siblings, 0 replies; 12+ messages in thread
From: Tony Battersby @ 2008-01-31 21:31 UTC (permalink / raw)
  To: James Bottomley, Luben Tuikov, linux-scsi

I have a RAID that returns a medium error on a read command.  The
"information bytes valid" bit is set in the sense data, but the
information bytes are zero:

CDB: 28 00 02 B0 62 00 00 00 02 00
Status:	02 (CHECK CONDITION)
Sense data:
F0 00 03 00 | 00 00 00 0A | 00 00 00 00 | 00 00 00 00
00 00

For HARDWARE_ERROR/MEDIUM_ERROR, sd_done assumes that the sense
information bytes contain the sector in error without sanity-checking
the value, so it ends up returning a completely bogus good_bytes value
greater than xfer_size.  This in turn causes the medium error to be
ignored and corrupt data to be returned to the user application.  This
problem was introduced by the patch "[SCSI] sd/scsi_lib simplify
sd_rw_intr and scsi_io_completion" in 2.6.18-rc1; with kernels 2.6.17
and before, the application correctly gets an I/O error instead.  This
patch fixes this particular problem by checking that bad_lba falls
within a sensible range before using it.

For a read command that returns HARDWARE_ERROR/MEDIUM_ERROR, I also
added the ability to calculate the amount of good data returned using
the data transfer residual if set by the LLDD.  If the both the sense
information bytes and the data transfer residual are valid, then they
are used to sanity-check each other.

The following code in sd_done doesn't seem right to me:

       if (driver_byte(result) != DRIVER_SENSE &&
           (!sense_valid || sense_deferred))
                goto out;

It would make more sense to use || rather than && for this test.  Even
better, this patch moves the check up higher so that the sense buffer
isn't even accessed unless driver_byte(result) & DRIVER_SENSE.

Finally, for the case of RECOVERED_ERROR/NO_SENSE, this patch adds a
check of the data transfer residual before assuming that the command
completed successfully.

I would like comments on the following:

sd_done doesn't check the data transfer residual for commands that
complete successfully.  In the unlikely case that the drive returns good
status without transferring all the data (due to a SCSI bus problem or
disk firmware bug), the error won't be detected.  I figured that it was
more likely for a LLDD to set resid incorrectly than for this unlikely
problem to happen, so I didn't add a check for it.  Agreed?

Is the new is_sd_cmnd_read_or_write() function necessary?  The original
code appears to use blk_fs_request(SCpnt->request) to determine if a
command is read or write.  Should I drop is_sd_cmnd_read_or_write() and
use blk_fs_request() instead, or it it OK like this?

Does anyone object to the new data transfer residual checks for
non-read/write commands that return RECOVERED_ERROR/NO_SENSE?  Should I
just drop this part of the patch for simplicity?

--- linux-2.6.24-git9/drivers/scsi/sd.c.orig	2008-01-31 16:24:04.000000000 -0500
+++ linux-2.6.24-git9/drivers/scsi/sd.c	2008-01-31 16:24:51.000000000 -0500
@@ -916,6 +916,29 @@ static struct block_device_operations sd
 	.revalidate_disk	= sd_revalidate_disk,
 };
 
+static int is_sd_cmnd_read_or_write(struct scsi_cmnd *cmd)
+{
+	int is_rw;
+
+	switch (cmd->cmnd[0]) {
+	case READ_6 :
+	case WRITE_6 :
+	case READ_10 :
+	case WRITE_10 :
+	case READ_12 :
+	case WRITE_12 :
+	case READ_16 :
+	case WRITE_16 :
+		is_rw = 1;
+		break;
+
+	default :
+		is_rw = 0;
+	}
+
+	return is_rw;
+}
+
 /**
  *	sd_done - bottom half handler: called when the lower level
  *	driver has completed (successfully or otherwise) a scsi command.
@@ -928,14 +951,16 @@ static int sd_done(struct scsi_cmnd *SCp
 	int result = SCpnt->result;
 	unsigned int xfer_size = scsi_bufflen(SCpnt);
  	unsigned int good_bytes = result ? 0 : xfer_size;
- 	u64 start_lba = SCpnt->request->sector;
+	unsigned int sector_size;
+	u64 start_lba;
  	u64 bad_lba;
 	struct scsi_sense_hdr sshdr;
 	int sense_valid = 0;
 	int sense_deferred = 0;
 	int info_valid;
+	int resid;
 
-	if (result) {
+	if (driver_byte(result) & DRIVER_SENSE) {
 		sense_valid = scsi_command_normalize_sense(SCpnt, &sshdr);
 		if (sense_valid)
 			sense_deferred = scsi_sense_is_deferred(&sshdr);
@@ -951,23 +976,53 @@ static int sd_done(struct scsi_cmnd *SCp
 						   sshdr.ascq));
 	}
 #endif
-	if (driver_byte(result) != DRIVER_SENSE &&
-	    (!sense_valid || sense_deferred))
+	if (!sense_valid || sense_deferred)
 		goto out;
 
+	sector_size = SCpnt->device->sector_size;
+
+	resid = scsi_get_resid(SCpnt);
+	if (resid < 0)
+		resid = 0;
+	else if ((unsigned) resid > xfer_size)
+		resid = xfer_size;
+
 	switch (sshdr.sense_key) {
 	case HARDWARE_ERROR:
 	case MEDIUM_ERROR:
-		if (!blk_fs_request(SCpnt->request))
+		if (!is_sd_cmnd_read_or_write(SCpnt))
 			goto out;
+
+		/* For read commands, use the data transfer residual (if
+		 * supported by the LLDD) to calculate the amount of good data
+		 * actually returned.  This doesn't work for write commands,
+		 * since the drive may accept the data into its buffer, but
+		 * then not write it to the medium.  Assume that resid == 0
+		 * means that the LLDD didn't set it, since if the drive
+		 * really returned all the data, then it shouldn't have
+		 * returned an error also.
+		 */
+		if ((SCpnt->sc_data_direction == DMA_FROM_DEVICE) &&
+		    (resid != 0) &&
+		    (sector_size != 0)) {
+			good_bytes = xfer_size - resid;
+			good_bytes -= good_bytes % sector_size;
+			/* Check the sense data also.
+			 */
+		}
+
+		/* The drive may return the LBA of the sector with the error in
+		 * the sense information bytes.
+		 */
 		info_valid = scsi_get_sense_info_fld(SCpnt->sense_buffer,
 						     SCSI_SENSE_BUFFERSIZE,
 						     &bad_lba);
 		if (!info_valid)
 			goto out;
-		if (xfer_size <= SCpnt->device->sector_size)
+		if (xfer_size <= sector_size)
 			goto out;
-		switch (SCpnt->device->sector_size) {
+		start_lba = SCpnt->request->sector;
+		switch (sector_size) {
 		case 256:
 			start_lba <<= 1;
 			break;
@@ -987,13 +1042,72 @@ static int sd_done(struct scsi_cmnd *SCp
 			goto out;
 			break;
 		}
+
+		/* Make sure that bad_lba is one of the sectors that the
+		 * command was trying to access.
+		 */
+		if (bad_lba < start_lba ||
+		    bad_lba >= start_lba + xfer_size / sector_size)
+			goto out;
+
 		/* This computation should always be done in terms of
 		 * the resolution of the device's medium.
 		 */
-		good_bytes = (bad_lba - start_lba)*SCpnt->device->sector_size;
+		good_bytes = (bad_lba - start_lba) * sector_size;
+
+		/* Check the computed value against the amount of data the
+		 * command actually transferred.
+		 */
+		if (good_bytes > xfer_size - resid) {
+			good_bytes = xfer_size - resid;
+			good_bytes -= good_bytes % sector_size;
+		}
 		break;
 	case RECOVERED_ERROR:
 	case NO_SENSE:
+		if (is_sd_cmnd_read_or_write(SCpnt)) {
+			/* Read/write commands: if all data transferred, then
+			 * consider the command to have completed successfully.
+			 * Otherwise, calculate good_bytes based on the actual
+			 * data transfer length rounded down to the nearest
+			 * sector.
+			 */
+			if (resid != 0) {
+				good_bytes = xfer_size - resid;
+				if (sector_size != 0)
+					good_bytes -= good_bytes % sector_size;
+				goto out;
+			}
+		} else {
+			switch (SCpnt->sc_data_direction) {
+			case DMA_TO_DEVICE:
+				/* Data-out commands (e.g. mode select): if all
+				 * data transferred, then consider the command
+				 * to have completed successfully.  Otherwise,
+				 * consider it an error.
+				 */
+				if (resid != 0)
+					goto out;
+				break;
+			case DMA_FROM_DEVICE:
+				/* Data-in commands (e.g. mode sense): if some
+				 * data transferred, then consider the command
+				 * to have completed successfully.  If no data
+				 * transferred, then consider it an error.
+				 * Note that it is normal for data-in commands
+				 * to transfer less than requested.
+				 */
+				if ((resid != 0) && (resid == xfer_size))
+					goto out;
+				break;
+			default:
+				/* Non-data commands: consider the command to
+				 * have completed successfully.
+				 */
+				break;
+			}
+		}
+
 		/* Inform the user, but make sure that it's not treated
 		 * as a hard error.
 		 */



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-02-04 21:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-01  1:24 [PATCH] [RFC] sd: make error handling more robust Luben Tuikov
2008-02-01 11:53 ` Luben Tuikov
2008-02-01 13:44   ` Salyzyn, Mark
2008-02-01 16:15     ` Tony Battersby
2008-02-01 15:46 ` Tony Battersby
2008-02-01 16:09   ` James Bottomley
2008-02-01 20:06   ` Luben Tuikov
2008-02-01 21:02     ` Tony Battersby
2008-02-02  0:49       ` Luben Tuikov
2008-02-04 14:34         ` Tony Battersby
2008-02-04 21:02           ` Luben Tuikov
  -- strict thread matches above, loose matches on Subject: below --
2008-01-31 21:31 Tony Battersby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox