* Questions about [SCSI] modify scsi to handle new fail fast flags.
@ 2009-03-13 15:37 Konrad Rzeszutek
2009-03-13 16:24 ` James Bottomley
0 siblings, 1 reply; 2+ messages in thread
From: Konrad Rzeszutek @ 2009-03-13 15:37 UTC (permalink / raw)
To: Mike Christie, linux-scsi
Hey Mike,
I was having a problem with one of the SATA controllers at work and traced it
down to enclosure. But during that time I found out that the error handler
of bio's issued from SCSI block vs the multipath block driver act differently.
The "scsi_eh_flush_done_q" is the one that controls whether the I/Os should
be repeated or returned back. Earlier in the days it would check the
bio->flags to see if the REQ_FAILFAST* attributes were set, but nowadays
it is more discriminating and depending on the host_byte(scmd->result) figures
out if needs to check for the REQ_FAILFAST* attribute.
The end result is that the bio's mapped through multipath have the same
logic as through the SCSI block. Compared to Linux releases in RHEL53, SLES10
the behaviour is different.
So my question is that OK? From the perspective of dm-mpath it looks as
if the device driver now does the retry/re-issue and ignores the FAILFAST
attribute. Instead of allowing dm-mpath to be the ...umm "brain" behind this.
Oh, the git that causes this behavior is this one:
commit 4a27446f3e39b06c28d1c8e31d33a5340826ed5c
Author: Mike Christie <michaelc@cs.wisc.edu>
Date: Tue Aug 19 18:45:31 2008 -0500
[SCSI] modify scsi to handle new fail fast flags.
This checks the errors the scsi-ml determined were retryable
and returns if we should fast fail it based on the request
fail fast flags.
Without the patch, drivers like lpfc, qla2xxx and fcoe would return
DID_ERROR for what it determines is a temporary communication problem.
There is no loss of connectivity at that time and the driver thinks
that it would be fast to retry at the driver level. SCSI-ml will however
sees fast fail on the request and DID_ERROR and will fast fail the io.
This will then cause dm-multipath to fail the path and possibley switch
target controllers when we should be retrying at the scsi layer.
We also were fast failing device errors to dm multiapth when
unless the scsi_dh modules think otherwis we want to retry at
the scsi layer because multipath can only retry the IO like scsi
should have done. multipath is a little dumber though because it
does not what the error was for and assumes that it should fail
the paths.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Questions about [SCSI] modify scsi to handle new fail fast flags.
2009-03-13 15:37 Questions about [SCSI] modify scsi to handle new fail fast flags Konrad Rzeszutek
@ 2009-03-13 16:24 ` James Bottomley
0 siblings, 0 replies; 2+ messages in thread
From: James Bottomley @ 2009-03-13 16:24 UTC (permalink / raw)
To: Konrad Rzeszutek; +Cc: Mike Christie, linux-scsi
On Fri, 2009-03-13 at 11:37 -0400, Konrad Rzeszutek wrote:
> Hey Mike,
>
> I was having a problem with one of the SATA controllers at work and traced it
> down to enclosure. But during that time I found out that the error handler
> of bio's issued from SCSI block vs the multipath block driver act differently.
>
> The "scsi_eh_flush_done_q" is the one that controls whether the I/Os should
> be repeated or returned back. Earlier in the days it would check the
> bio->flags to see if the REQ_FAILFAST* attributes were set, but nowadays
> it is more discriminating and depending on the host_byte(scmd->result) figures
> out if needs to check for the REQ_FAILFAST* attribute.
>
> The end result is that the bio's mapped through multipath have the same
> logic as through the SCSI block. Compared to Linux releases in RHEL53, SLES10
> the behaviour is different.
>
> So my question is that OK? From the perspective of dm-mpath it looks as
> if the device driver now does the retry/re-issue and ignores the FAILFAST
> attribute. Instead of allowing dm-mpath to be the ...umm "brain" behind this.
I can try: SCSI errors fall into three categories:
1. Driver or Host (as in kmalloc failed while the driver was
sending the packet out or the driver firmware failed and needs
resetting).
2. Transport errors (Parity/CRC error on the bus, transport failed
to connect to end device etc)
3. Device errors (device actually responded with an error, like
MEDIUM_ERROR or NOT_READY).
You can see that errors in category 3 (Device) aren't going to change if
you simply change path, so we should be doing normal error handling.
Likewise, errors in category 1 (Driver or Host) are in the same bucket;
worse, some errors in this category (like firmware failure) need SCSI
error handling to fix: if we switch path they never recover and your
availability reduces.
The idea of the changes were to separate known transport errors (which
could be fixed by switching paths) from other errors which couldn't and
should still be handled via normal eh procedures (previously we'd just
switched paths for every retryable error and never got around to
actually seeing if the SCSI error handler could fix it).
James
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2009-03-13 16:24 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-13 15:37 Questions about [SCSI] modify scsi to handle new fail fast flags Konrad Rzeszutek
2009-03-13 16:24 ` James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox