[PATCH 4/4] 2.4 SCSI error handling fixes

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 4/4] 2.4 SCSI error handling fixes
@ 2002-09-12 18:18 Russell King
  2002-09-23 15:54 ` James Bottomley
  0 siblings, 1 reply; 3+ messages in thread
From: Russell King @ 2002-09-12 18:18 UTC (permalink / raw)
  To: linux-scsi

I've been chasing a few SCSI problems today, and here's 4 patches to
address the issues I found.  These patches are against 2.4.19.

My main aim here was to solve the bad behaviour with a medium error.
Some problems were found in my HBA driver, but some were found in
the error handling code.

Problems found were:

a) retrying a command with stale or invalid command information.
   (Patches 1 and 2)
b) reporting the wrong command on IO error.
   (Patch 3)
c) drives behaving badly in error condition, reporting success without
   data transfer.
   (Patch 4)

Results of investigation into these issues can be found at in my lkml
messages at:

  http://marc.theaimsgroup.com/?l=linux-kernel&m=103184937214222&w=2

Patch 4 is the one I'm least happy about; we shouldn't unconditionally
set the FUA bit in READ10 commands.  It seems that we need to set this
bit only after receiving an error, and only when we try to read the
remaining blocks which we believe are good.

Patch 4: set FUA bit for READ10/WRITE10 commands.

*** NOTE *** This needs a better solution.  Setting the
FUA bit bypasses the drives cache altogether.  You should
read the message at the URL above *before* commenting for
a detailed explaination of the problem, and why I've had
to make this change.

--- orig/drivers/scsi/sd.c	Mon Aug  5 13:31:25 2002
+++ linux/drivers/scsi/sd.c	Thu Sep 12 17:55:55 2002
@@ -399,6 +399,7 @@
 			this_count = 0xffff;

 		SCpnt->cmnd[0] += READ_10 - READ_6;
+		SCpnt->cmnd[1] |= 1 << 3; /* Set FUA --rmk */
 		SCpnt->cmnd[2] = (unsigned char) (block >> 24) & 0xff;
 		SCpnt->cmnd[3] = (unsigned char) (block >> 16) & 0xff;
 		SCpnt->cmnd[4] = (unsigned char) (block >> 8) & 0xff;

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 4/4] 2.4 SCSI error handling fixes
  2002-09-12 18:18 [PATCH 4/4] 2.4 SCSI error handling fixes Russell King
@ 2002-09-23 15:54 ` James Bottomley
  2002-09-23 18:57   ` Russell King
  0 siblings, 1 reply; 3+ messages in thread
From: James Bottomley @ 2002-09-23 15:54 UTC (permalink / raw)
  To: Russell King; +Cc: linux-scsi

rmk@arm.linux.org.uk said:
> *** NOTE *** This needs a better solution.  Setting the FUA bit
> bypasses the drives cache altogether.  You should read the message at
> the URL above *before* commenting for a detailed explaination of the
> problem, and why I've had to make this change. 

I read the message.  I think your drive is out of spec in its behaviour.

Patches 1-3 look fine. Patch 4 would kill our performance on huge cache RAID 
arrays (guess what most benchmarks are done on).

I think the correct (but much more invasive) fix for this is to return all 
MEDIUM errors immediately to the upper level driver (sd or sr) and let them do 
error recovery (i.e. determine whether retry should be done).  Once the upper 
level drivers do the retry, you can reliably add the FUA bit to the command 
for the retry.  Thus we'd only use FUA in the retry case and normal caching 
performance shouldn't be impacted.

Do you want to try doing it this way?

James

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 4/4] 2.4 SCSI error handling fixes
  2002-09-23 15:54 ` James Bottomley
@ 2002-09-23 18:57   ` Russell King
  0 siblings, 0 replies; 3+ messages in thread
From: Russell King @ 2002-09-23 18:57 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On Mon, Sep 23, 2002 at 11:54:27AM -0400, James Bottomley wrote:
> rmk@arm.linux.org.uk said:
> > *** NOTE *** This needs a better solution.  Setting the FUA bit
> > bypasses the drives cache altogether.  You should read the message at
> > the URL above *before* commenting for a detailed explaination of the
> > problem, and why I've had to make this change. 
> 
> I read the message.  I think your drive is out of spec in its behaviour.

I do too.  But you you want to take a chance that other drives don't behave
the same way?

> Patches 1-3 look fine. Patch 4 would kill our performance on huge cache RAID 
> arrays (guess what most benchmarks are done on).
> 
> I think the correct (but much more invasive) fix for this is to return all 
> MEDIUM errors immediately to the upper level driver (sd or sr) and let them do 
> error recovery (i.e. determine whether retry should be done).  Once the upper 
> level drivers do the retry, you can reliably add the FUA bit to the command 
> for the retry.  Thus we'd only use FUA in the retry case and normal caching 
> performance shouldn't be impacted.
> 
> Do you want to try doing it this way?

No, because I don't know enough about the upper layers.

I thought I made the above crystal clear as well in my patch explainations.
Also, I believe David Brownell found a way to achieve the desired result
without killing performance.

I'm trying to sort out the error handling at the moment, which appears to
be rather bogus.  It simply doesn't work, even for _simple_ uninteligent
HBA drivers.

For example, if I force a permantent parity error to one selected device
and insert the HBA driver, the error handling kicks in and totally screws
the SCSI subsystem - it leaves requests behind in the driver and expects
a mere request for abort (not reset) to clear them.  The only thing that
can clear a stuck bus from an initiator is a bus reset.  The scsi error
handling doesn't do that.  It just expects to be able to queue another
command and have everything work.

Another bug I've found is that if we declare a device dead, we tell the
upper levels that the command completed successfully, even if it went
through all the error handling and failed!

Anyway, I'm currently trying to sort out these messes in my current 2.4.19
tree.  Hopefully I'll be able to get the Linux error handling into some
sort of usable state in the coming months.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-09-23 18:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-09-12 18:18 [PATCH 4/4] 2.4 SCSI error handling fixes Russell King
2002-09-23 15:54 ` James Bottomley
2002-09-23 18:57   ` Russell King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox