fastfail operation and retries

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* fastfail operation and retries
@ 2005-04-19 17:19 Andreas Herrmann
  2005-04-21 16:42 ` Patrick Mansfield
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Herrmann @ 2005-04-19 17:19 UTC (permalink / raw)
  To: Linux SCSI

Hi,

I have question(s) regarding the fastfail operation of the SCSI stack.

Performing multipath-tests with an IBM ESS I encountered problems.
During certain operations on an ESS (quiesce/resume and such) requests
on all paths fail temporarily with an data underrun (resid is set in
the FCP-response).  In another situation abort sequences happen (see
FC-FS).

In both cases it is not a path failure but the device (ESS) reports
error conditions temporarily (some seconds).

Now on error on the first path the multipath layer initiates failover
to other available path(s) where requests will immediately fail.

Using linux-2.4 and LVM such problems did not occure. There were
enough retries (5 for each path) to handle such situations.

Now if the FASTFAIL flag is set the SCSI stack prevents retries for
failed SCSI commands.

Problem is that the multipath layer cannot distinguish between path
and device failures (and won't do any retries for the failed request
on the same path anyway).

How can an lld force the SCSI stack to retry a failed scsi-command
(without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
the retry counter).

What about a DID_FORCE_RETRY ?  Or is there any outlook when there
will be a better interface between the SCSI stack and the multipath
layer to properly handle retries.

Regards,

Andreas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fastfail operation and retries
@ 2005-04-20  8:10 Andreas Herrmann
  0 siblings, 0 replies; 7+ messages in thread
From: Andreas Herrmann @ 2005-04-20  8:10 UTC (permalink / raw)
  To: 由渊霞; +Cc: Linux SCSI

        ??? <yxyou@yahoo.com.cn> wrote:
        20.04.2005 03:17
 
> what multipath are you using? Software, or hardware,
> or both?

We are using udm with evms (Linux on zSeries).
Hardware setup is:
- switched fabric FC-SAN,
- 4 paths to each FC-LUN on the ESS 800

All 4 paths are "failing fast" during operations on
the ESS and our stress test tool encounteres I/O-errors.


Regards,

Andreas


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fastfail operation and retries
  2005-04-19 17:19 fastfail operation and retries Andreas Herrmann
@ 2005-04-21 16:42 ` Patrick Mansfield
  2005-04-21 19:54   ` Lars Marowsky-Bree
  0 siblings, 1 reply; 7+ messages in thread
From: Patrick Mansfield @ 2005-04-21 16:42 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Linux SCSI, dm-devel

On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote:
> Hi,
> 
> I have question(s) regarding the fastfail operation of the SCSI stack.
> 
> Performing multipath-tests with an IBM ESS I encountered problems.
> During certain operations on an ESS (quiesce/resume and such) requests
> on all paths fail temporarily with an data underrun (resid is set in
> the FCP-response).  In another situation abort sequences happen (see
> FC-FS).
> 
> In both cases it is not a path failure but the device (ESS) reports
> error conditions temporarily (some seconds).
> 
> Now on error on the first path the multipath layer initiates failover
> to other available path(s) where requests will immediately fail.
> 
> Using linux-2.4 and LVM such problems did not occure. There were
> enough retries (5 for each path) to handle such situations.
> 
> Now if the FASTFAIL flag is set the SCSI stack prevents retries for
> failed SCSI commands.
> 
> Problem is that the multipath layer cannot distinguish between path
> and device failures (and won't do any retries for the failed request
> on the same path anyway).
> 
> How can an lld force the SCSI stack to retry a failed scsi-command
> (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
> the retry counter).
> 
> What about a DID_FORCE_RETRY ?  Or is there any outlook when there
> will be a better interface between the SCSI stack and the multipath
> layer to properly handle retries.

We need a patch like Mike Christie had, this:

http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2

The scsi core should decode the sense data and pass up the result, then dm
need not decode sense data, and we don't need sense data passed around via
the block layer.

scsi core could be changed to handle device specific decoding via sense
tables that can be modified via sysfs, similar to devinfo code (well,
devinfo still lacks a sysfs interface).

For ESS, you probably also need the BLIST_RETRY_HWERROR that is in
current 2.6.12 rc.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 16:42 ` Patrick Mansfield
@ 2005-04-21 19:54   ` Lars Marowsky-Bree
  2005-04-21 22:13     ` Patrick Mansfield
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 19:54 UTC (permalink / raw)
  To: device-mapper development, Andreas Herrmann; +Cc: Linux SCSI

On 2005-04-21T09:42:05, Patrick Mansfield <patmans@us.ibm.com> wrote:

> On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote:
> > Hi,
> > 
> > I have question(s) regarding the fastfail operation of the SCSI stack.
> > 
> > Performing multipath-tests with an IBM ESS I encountered problems.
> > During certain operations on an ESS (quiesce/resume and such) requests
> > on all paths fail temporarily with an data underrun (resid is set in
> > the FCP-response).  In another situation abort sequences happen (see
> > FC-FS).
> > 
> > In both cases it is not a path failure but the device (ESS) reports
> > error conditions temporarily (some seconds).
> > 
> > Now on error on the first path the multipath layer initiates failover
> > to other available path(s) where requests will immediately fail.
> > 
> > Using linux-2.4 and LVM such problems did not occure. There were
> > enough retries (5 for each path) to handle such situations.
> > 
> > Now if the FASTFAIL flag is set the SCSI stack prevents retries for
> > failed SCSI commands.
> > 
> > Problem is that the multipath layer cannot distinguish between path
> > and device failures (and won't do any retries for the failed request
> > on the same path anyway).
> > 
> > How can an lld force the SCSI stack to retry a failed scsi-command
> > (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
> > the retry counter).
> > 
> > What about a DID_FORCE_RETRY ?  Or is there any outlook when there
> > will be a better interface between the SCSI stack and the multipath
> > layer to properly handle retries.
> 
> We need a patch like Mike Christie had, this:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2
> 
> The scsi core should decode the sense data and pass up the result, then dm
> need not decode sense data, and we don't need sense data passed around via
> the block layer.

The most recent udm patchset has a patch by Jens Axboe and myself to
pass up sense data / error codes in the bio so the dm mpath module can
deal with it.  

Only issue still is that the SCSI midlayer does only generate a single
"EIO" code also for timeouts; however, that pretty much means it's a
transport error, because if it was a media error, we'd be getting sense
data ;-)

Together with the "queue_if_no_path" feature flag for dm-mpath that
should do what you need to handle this (arguably broken) array
behaviour: It'll queue until the error goes away and multipathd retests
and reactivates the paths. That ought to work, but given that I don't
have an IBM ESS accessible, please confirm that.

It is possible that to fully support them a dm mpath hardware handler
(like for the EMC CX family) might be required, too.

(For easier testing, you'll find that all this functionality is
available in the latest SLES9 SP2 betas, to which you ought to have
access at IBM, and the kernels are also available via
ftp://ftp.suse.com/pub/projects/kernel/kotd/.)

> scsi core could be changed to handle device specific decoding via sense
> tables that can be modified via sysfs, similar to devinfo code (well,
> devinfo still lacks a sysfs interface).

dm-path's capabilities go a bit beyond just the error decoding (which
for generic devices is also provided for in a generic
dm_scsi_err_handler()); for example you can code special initialization
commands and behaviour an array might need.

Maybe this could indeed be abstracted further to download the command
and/or specific decoding tables from user-space via sysfs or configfs by
a generic user-space customizable dm-hw-handler-generic.[ch] plugin; I
think patches are being accepted ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 19:54   ` Lars Marowsky-Bree
@ 2005-04-21 22:13     ` Patrick Mansfield
  2005-04-21 22:52       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 7+ messages in thread
From: Patrick Mansfield @ 2005-04-21 22:13 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On Thu, Apr 21, 2005 at 09:54:35PM +0200, Lars Marowsky-Bree wrote:

> > We need a patch like Mike Christie had, this:
> > 
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2
> > 
> > The scsi core should decode the sense data and pass up the result, then dm
> > need not decode sense data, and we don't need sense data passed around via
> > the block layer.
> 
> The most recent udm patchset has a patch by Jens Axboe and myself to
> pass up sense data / error codes in the bio so the dm mpath module can
> deal with it.  

But the scmd->result is not passed back.

If we passed it back there would be enough information available, but then
you still need to add the same decoding as already found in scsi core
(scsi_decide_disposition and more).

Better to decode the error once, and then pass that data back to the
blk layer.

> Only issue still is that the SCSI midlayer does only generate a single
> "EIO" code also for timeouts; however, that pretty much means it's a
> transport error, because if it was a media error, we'd be getting sense
> data ;-)

How does lack of sense data imply that there was no media/device error? A
timeout could be a failure anywhere, in the transport or because of
target/media/LUN problems. Or not a real error at all, just a busy device
or too short a timeout setting.

Currently scsi core does not fastfail time outs ...

Does path checker take paths permanently offline after multiple failures?

If a timeout causes a path failure (means today that scsi core already
retried the command), and path checker re-enables the path (for example,
path checker can send a test unit ready with no failure; this also means
scsi core has already retried the command), this could lead to retrying
that IO (or even another IO) and hitting a timeout again on that path.

Also a SCSI failure (command made it to the media/device, but got some
error) can happen without sense data, like any SCSI errors other than
a CHECK_CONDITION that are not requeued by scsi core (see
scsi_decide_disposition switch cases for status_byte(scmd->result)).

It's probably OK to just fail the path for all driver/transport errors
(and non-sense errors) even if they are retryable: path checker will just
re-enable the path (maybe immediately). But, we end up with different and
potentially significant behaviour for some error cases with/without fastfail.

So though I don't like the approach: distinguishing timeouts or ensuring
that path checker won't continually reenable a path might be good enough,
as long as there are no other error cases (driver or SCSI) that could lead
to long lasting failures.

> > scsi core could be changed to handle device specific decoding via sense
> > tables that can be modified via sysfs, similar to devinfo code (well,
> > devinfo still lacks a sysfs interface).
> 
> dm-path's capabilities go a bit beyond just the error decoding (which
> for generic devices is also provided for in a generic
> dm_scsi_err_handler()); for example you can code special initialization
> commands and behaviour an array might need.

Yes, but that doesn't mean we should decode SCSI sense or scsi core error
errors (i.e. scmd->result) in dm space.

Also, non-scsi drivers would like to use dm multipath, like DASD. Using
extended blk errors allows simpler support for such devices and drivers.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:13     ` Patrick Mansfield
@ 2005-04-21 22:52       ` Lars Marowsky-Bree
  2005-04-22  0:22         ` Patrick Mansfield
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 22:52 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote:

> > The most recent udm patchset has a patch by Jens Axboe and myself to
> > pass up sense data / error codes in the bio so the dm mpath module can
> > deal with it.  
> But the scmd->result is not passed back.

Bear with me and my limitted knowledge of the SCSI midlayer for a
second: What additional benefit would this provide over sense
key/asc/ascq & the error parameter in the bio end_io path?

> Better to decode the error once, and then pass that data back to the
> blk layer.

Decoding is device specific. So is the handling of path initialization
and others. I'd rather have this consolidated in one module, than have
parts of it in the mid-layer and other parts in the multipath code.

Could this be handled by a module in the mid-layer which receives
commands from the DM multipath layers above, and pass appropriate flags
back up? Probably. (I think this is what you're suggesting.) But
frankly, I prefer the current approach, which works. I don't see a real
benefit in your architecture, besides spreading things out further.

> > Only issue still is that the SCSI midlayer does only generate a single
> > "EIO" code also for timeouts; however, that pretty much means it's a
> > transport error, because if it was a media error, we'd be getting sense
> > data ;-)
> How does lack of sense data imply that there was no media/device error?

It does not always imply that. Note the "pretty much ... ;-)".

The one thing which could be improved here is that I'm not sure if an
EIO w/o sense data from the SCSI mid-layer always corresponds to a
timeout. Could we get EIO also for other errors?

However, as you correctly state later, it's pretty safe to treat such
errors as a "path error" and retry elsewhere, because if it was a false
failure, the path checker will reinstate soonish.

> timeout could be a failure anywhere, in the transport or because of
> target/media/LUN problems. Or not a real error at all, just a busy device
> or too short a timeout setting.

Well, the not real errors might benefit from the IO being retried on
another path though.

> Does path checker take paths permanently offline after multiple failures?

The path checker lives in user-space, and that's policy ;-) So, from the
kernel perspective, it doesn't matter. User-space currently does not
'permanently' fail paths, but it could be modified to do so if it goes
up/down at a too high rate, basically dampening for stability.  Patches
welcome.

> So though I don't like the approach: distinguishing timeouts or ensuring
> that path checker won't continually reenable a path might be good enough,
> as long as there are no other error cases (driver or SCSI) that could lead
> to long lasting failures.

That's essentially what is being done. However, there's some more
special cases (like a storage array telling us that that service
processor is no longer active and we should switch not to another path
on the same, but to the other SP; which we model in dm-mpath via
different priority groups and causing a PG switch), and some errors
translate to errors being immediately propagated upwards (media error,
illegal request, data protect and some others; again, this might include
specific handling based on the storage being addressed), because for
these retrying on another path (or switching service processors) doesn't
make any sense or might be even harmful.

> Yes, but that doesn't mean we should decode SCSI sense or scsi core error
> errors (i.e. scmd->result) in dm space.

This happens in the SCSI layer; dm-mpath only sees already 'decoded'
sense key/asc/ascq.

> Also, non-scsi drivers would like to use dm multipath, like DASD. Using
> extended blk errors allows simpler support for such devices and drivers.

Sure. The bi_error field introduced by Axboe's patch has flags detailing
what kind of error information is available - it's either ERRNO
(basically, the current "error"), SENSE (for certain scsi requests,
where sense is available), and could be extended to include a DASD
class, and then be complemented by a dm-dasd module for hw-specific
handling for any other specific needs they might have.

Can you sketch/summarize your suggested design in more detail? That
would be helpful for me, because I missed parts of the earlier
discussion.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:52       ` Lars Marowsky-Bree
@ 2005-04-22  0:22         ` Patrick Mansfield
  0 siblings, 0 replies; 7+ messages in thread
From: Patrick Mansfield @ 2005-04-22  0:22 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On Fri, Apr 22, 2005 at 12:52:56AM +0200, Lars Marowsky-Bree wrote:
> On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote:
> 
> > > The most recent udm patchset has a patch by Jens Axboe and myself to
> > > pass up sense data / error codes in the bio so the dm mpath module can
> > > deal with it.  
> > But the scmd->result is not passed back.
> 
> Bear with me and my limitted knowledge of the SCSI midlayer for a
> second: What additional benefit would this provide over sense
> key/asc/ascq & the error parameter in the bio end_io path?

So we can mark a path in dm as failed or not; then dm won't mark a path
failed for driver, transport, or other retryable errors.

As noted, this might not lead to user visible affects, since retryable
errors _should_ end up failing and then re-enabling the path, but that
could lead to problems. But the code paths will be cleaner.

> > Better to decode the error once, and then pass that data back to the
> > blk layer.
> 
> Decoding is device specific. So is the handling of path initialization
> and others. I'd rather have this consolidated in one module, than have
> parts of it in the mid-layer and other parts in the multipath code.

Me too, but I'm arguing to decode them in scsi core.

> Could this be handled by a module in the mid-layer which receives
> commands from the DM multipath layers above, and pass appropriate flags
> back up? Probably. (I think this is what you're suggesting.) But
> frankly, I prefer the current approach, which works. I don't see a real
> benefit in your architecture, besides spreading things out further.

scsi core (I don't like calling it midlayer) could have a module or such.

The same decoding that is being put into dm hardware modules should
also be in scsi core. That is, when running such hardware without dm
multipath (single pathed or just stupidly) we still want the decoding of
the sense data, especially for retryable errors.

> The one thing which could be improved here is that I'm not sure if an
> EIO w/o sense data from the SCSI mid-layer always corresponds to a
> timeout. Could we get EIO also for other errors?

You should be getting EIO for all IO failures, timeout or not. For example
a cable pull returns DID_NO_CONNECT (for at least qlogic, and maybe for
emulex), its decoded in scsi_decide_disposition, and scsi core calls
scsi_end_request(x, 0, x, x), and then calls end_that_request_chunk(x, 0,
x) and that sets error to EIO.

> However, as you correctly state later, it's pretty safe to treat such
> errors as a "path error" and retry elsewhere, because if it was a false
> failure, the path checker will reinstate soonish.
> 
> > timeout could be a failure anywhere, in the transport or because of
> > target/media/LUN problems. Or not a real error at all, just a busy device
> > or too short a timeout setting.
> 
> Well, the not real errors might benefit from the IO being retried on
> another path though.

Yes.

> > Does path checker take paths permanently offline after multiple failures?
> 
> The path checker lives in user-space, and that's policy ;-) So, from the
> kernel perspective, it doesn't matter. User-space currently does not
> 'permanently' fail paths, but it could be modified to do so if it goes
> up/down at a too high rate, basically dampening for stability.  Patches
> welcome.
> 
> > So though I don't like the approach: distinguishing timeouts or ensuring
> > that path checker won't continually reenable a path might be good enough,
> > as long as there are no other error cases (driver or SCSI) that could lead
> > to long lasting failures.
> 
> That's essentially what is being done. However, there's some more
> special cases (like a storage array telling us that that service
> processor is no longer active and we should switch not to another path
> on the same, but to the other SP; which we model in dm-mpath via
> different priority groups and causing a PG switch), and some errors
> translate to errors being immediately propagated upwards (media error,
> illegal request, data protect and some others; again, this might include
> specific handling based on the storage being addressed), because for
> these retrying on another path (or switching service processors) doesn't
> make any sense or might be even harmful.

Yes ... I'm familiar with such hardware.

> > Yes, but that doesn't mean we should decode SCSI sense or scsi core error
> > errors (i.e. scmd->result) in dm space.
> 
> This happens in the SCSI layer; dm-mpath only sees already 'decoded'
> sense key/asc/ascq.

But that data is not decoded, dm has to look at the sense value etc. Some
of that must overlap with the code in scsi core.

> > Also, non-scsi drivers would like to use dm multipath, like DASD. Using
> > extended blk errors allows simpler support for such devices and drivers.
> 
> Sure. The bi_error field introduced by Axboe's patch has flags detailing
> what kind of error information is available - it's either ERRNO
> (basically, the current "error"), SENSE (for certain scsi requests,
> where sense is available), and could be extended to include a DASD
> class, and then be complemented by a dm-dasd module for hw-specific
> handling for any other specific needs they might have.
> 
> Can you sketch/summarize your suggested design in more detail? That
> would be helpful for me, because I missed parts of the earlier
> discussion.

I can try forward porting Mike C's patch ... what we need on top of the
bi_error is to pass back the bi_error when calling end_that_request_first,
instead of a boolean 0/1 for uptodate, pass a BIO_ERROR_xxx. And set
bio->bi_error and/or just pass it back in bio_endio().

The errors could be:
	
	BIO_SUCCESS = 0,
	BIO_ERROR_ERR,
	BIO_ERROR_RETRY,
	BIO_ERROR_DEV_FAILURE,
	BIO_ERROR_DEV_RETRY,
	BIO_ERROR_TRNSPT_FAILURE,
	BIO_ERROR_TRNSPT_RETRY,
	BIO_ERROR_TIMEOUT,

And maybe (for non-failure failover case, when an SP is no longer active)
a BIO_ERROR_TRNSPT_INACTIVE or ?.

These somewhat match Mike C's values, he had:

+	BLK_SUCCESS,
+	BLK_ERR,		/* Generic error like -EIO */
+	BLK_FATAL_DEV,		/* Fatal driver error */
+	BLK_FATAL_TRNSPT,	/* Fatal transport error */
+	BLK_FATAL_DRV,		/* Fatal driver error */
+	BLK_RETRY_DEV,		/* Device error, I/O may be retried */
+	BLK_RETRY_TRNSPT,	/* Transport error, I/O may retried */
+	BLK_RETRY_DRV,		/* Driver error, I/O may be retried */

AFICT, the only need for a _DRV as in Mike's patch was too handle the
-EWOULDBLOCK (can't find this in current source though ..), so we might
need only a BIO_ERROR_RETRY.

And then in dm:

	BIO_SUCCESS:		complete IO with no failure
	BIO_ERROR_RETRY: 	never makes it to dm
	BIO_ERROR_ERR:		treat as a failed IO
	BIO_ERROR_DEV_FAILURE:	failed IO
	BIO_ERROR_DEV_RETRY:	retry on any path
	BIO_ERROR_TRNSPT_FAILURE:	fail path
	BIO_ERROR_TRNSPT_RETRY:	retry on any path
	BIO_ERROR_TIMEOUT:	hard to handle, needs to retry on another
				path, but mark this path as potentially
				failing. For now, it could just fail the
				path (then we are in the same situation as
				today).
	BIO_ERROR_TRNSPT_INACTIVE:	failover ...

Non-dm users treat non BIO_SUCCESS results as IO failures (we should not
return retry errors unless fast fail is set).

"retry on any path" would normally use a different path if one is
available.

We still need a scsi vendor specific decoder, I'd volunteered to do that
work before but there were no responses last time I brought it up (on
linux-scsi).

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-04-22  0:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-19 17:19 fastfail operation and retries Andreas Herrmann
2005-04-21 16:42 ` Patrick Mansfield
2005-04-21 19:54   ` Lars Marowsky-Bree
2005-04-21 22:13     ` Patrick Mansfield
2005-04-21 22:52       ` Lars Marowsky-Bree
2005-04-22  0:22         ` Patrick Mansfield
  -- strict thread matches above, loose matches on Subject: below --
2005-04-20  8:10 Andreas Herrmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox