fastfail operation and retries

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* fastfail operation and retries
@ 2005-04-19 17:19 Andreas Herrmann
  2005-04-21 16:42 ` Patrick Mansfield
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Herrmann @ 2005-04-19 17:19 UTC (permalink / raw)
  To: Linux SCSI

Hi,

I have question(s) regarding the fastfail operation of the SCSI stack.

Performing multipath-tests with an IBM ESS I encountered problems.
During certain operations on an ESS (quiesce/resume and such) requests
on all paths fail temporarily with an data underrun (resid is set in
the FCP-response).  In another situation abort sequences happen (see
FC-FS).

In both cases it is not a path failure but the device (ESS) reports
error conditions temporarily (some seconds).

Now on error on the first path the multipath layer initiates failover
to other available path(s) where requests will immediately fail.

Using linux-2.4 and LVM such problems did not occure. There were
enough retries (5 for each path) to handle such situations.

Now if the FASTFAIL flag is set the SCSI stack prevents retries for
failed SCSI commands.

Problem is that the multipath layer cannot distinguish between path
and device failures (and won't do any retries for the failed request
on the same path anyway).

How can an lld force the SCSI stack to retry a failed scsi-command
(without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
the retry counter).

What about a DID_FORCE_RETRY ?  Or is there any outlook when there
will be a better interface between the SCSI stack and the multipath
layer to properly handle retries.

Regards,

Andreas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fastfail operation and retries
  2005-04-19 17:19 fastfail operation and retries Andreas Herrmann
@ 2005-04-21 16:42 ` Patrick Mansfield
  2005-04-21 19:54   ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Mansfield @ 2005-04-21 16:42 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Linux SCSI, dm-devel

On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote:
> Hi,
> 
> I have question(s) regarding the fastfail operation of the SCSI stack.
> 
> Performing multipath-tests with an IBM ESS I encountered problems.
> During certain operations on an ESS (quiesce/resume and such) requests
> on all paths fail temporarily with an data underrun (resid is set in
> the FCP-response).  In another situation abort sequences happen (see
> FC-FS).
> 
> In both cases it is not a path failure but the device (ESS) reports
> error conditions temporarily (some seconds).
> 
> Now on error on the first path the multipath layer initiates failover
> to other available path(s) where requests will immediately fail.
> 
> Using linux-2.4 and LVM such problems did not occure. There were
> enough retries (5 for each path) to handle such situations.
> 
> Now if the FASTFAIL flag is set the SCSI stack prevents retries for
> failed SCSI commands.
> 
> Problem is that the multipath layer cannot distinguish between path
> and device failures (and won't do any retries for the failed request
> on the same path anyway).
> 
> How can an lld force the SCSI stack to retry a failed scsi-command
> (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
> the retry counter).
> 
> What about a DID_FORCE_RETRY ?  Or is there any outlook when there
> will be a better interface between the SCSI stack and the multipath
> layer to properly handle retries.

We need a patch like Mike Christie had, this:

http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2

The scsi core should decode the sense data and pass up the result, then dm
need not decode sense data, and we don't need sense data passed around via
the block layer.

scsi core could be changed to handle device specific decoding via sense
tables that can be modified via sysfs, similar to devinfo code (well,
devinfo still lacks a sysfs interface).

For ESS, you probably also need the BLIST_RETRY_HWERROR that is in
current 2.6.12 rc.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 16:42 ` Patrick Mansfield
@ 2005-04-21 19:54   ` Lars Marowsky-Bree
  2005-04-21 22:13     ` Patrick Mansfield
  0 siblings, 1 reply; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 19:54 UTC (permalink / raw)
  To: device-mapper development, Andreas Herrmann; +Cc: Linux SCSI

On 2005-04-21T09:42:05, Patrick Mansfield <patmans@us.ibm.com> wrote:

> On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote:
> > Hi,
> > 
> > I have question(s) regarding the fastfail operation of the SCSI stack.
> > 
> > Performing multipath-tests with an IBM ESS I encountered problems.
> > During certain operations on an ESS (quiesce/resume and such) requests
> > on all paths fail temporarily with an data underrun (resid is set in
> > the FCP-response).  In another situation abort sequences happen (see
> > FC-FS).
> > 
> > In both cases it is not a path failure but the device (ESS) reports
> > error conditions temporarily (some seconds).
> > 
> > Now on error on the first path the multipath layer initiates failover
> > to other available path(s) where requests will immediately fail.
> > 
> > Using linux-2.4 and LVM such problems did not occure. There were
> > enough retries (5 for each path) to handle such situations.
> > 
> > Now if the FASTFAIL flag is set the SCSI stack prevents retries for
> > failed SCSI commands.
> > 
> > Problem is that the multipath layer cannot distinguish between path
> > and device failures (and won't do any retries for the failed request
> > on the same path anyway).
> > 
> > How can an lld force the SCSI stack to retry a failed scsi-command
> > (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change
> > the retry counter).
> > 
> > What about a DID_FORCE_RETRY ?  Or is there any outlook when there
> > will be a better interface between the SCSI stack and the multipath
> > layer to properly handle retries.
> 
> We need a patch like Mike Christie had, this:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2
> 
> The scsi core should decode the sense data and pass up the result, then dm
> need not decode sense data, and we don't need sense data passed around via
> the block layer.

The most recent udm patchset has a patch by Jens Axboe and myself to
pass up sense data / error codes in the bio so the dm mpath module can
deal with it.  

Only issue still is that the SCSI midlayer does only generate a single
"EIO" code also for timeouts; however, that pretty much means it's a
transport error, because if it was a media error, we'd be getting sense
data ;-)

Together with the "queue_if_no_path" feature flag for dm-mpath that
should do what you need to handle this (arguably broken) array
behaviour: It'll queue until the error goes away and multipathd retests
and reactivates the paths. That ought to work, but given that I don't
have an IBM ESS accessible, please confirm that.

It is possible that to fully support them a dm mpath hardware handler
(like for the EMC CX family) might be required, too.

(For easier testing, you'll find that all this functionality is
available in the latest SLES9 SP2 betas, to which you ought to have
access at IBM, and the kernels are also available via
ftp://ftp.suse.com/pub/projects/kernel/kotd/.)

> scsi core could be changed to handle device specific decoding via sense
> tables that can be modified via sysfs, similar to devinfo code (well,
> devinfo still lacks a sysfs interface).

dm-path's capabilities go a bit beyond just the error decoding (which
for generic devices is also provided for in a generic
dm_scsi_err_handler()); for example you can code special initialization
commands and behaviour an array might need.

Maybe this could indeed be abstracted further to download the command
and/or specific decoding tables from user-space via sysfs or configfs by
a generic user-space customizable dm-hw-handler-generic.[ch] plugin; I
think patches are being accepted ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 19:54   ` Lars Marowsky-Bree
@ 2005-04-21 22:13     ` Patrick Mansfield
  2005-04-21 22:52       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: Patrick Mansfield @ 2005-04-21 22:13 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On Thu, Apr 21, 2005 at 09:54:35PM +0200, Lars Marowsky-Bree wrote:

> > We need a patch like Mike Christie had, this:
> > 
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2
> > 
> > The scsi core should decode the sense data and pass up the result, then dm
> > need not decode sense data, and we don't need sense data passed around via
> > the block layer.
> 
> The most recent udm patchset has a patch by Jens Axboe and myself to
> pass up sense data / error codes in the bio so the dm mpath module can
> deal with it.  

But the scmd->result is not passed back.

If we passed it back there would be enough information available, but then
you still need to add the same decoding as already found in scsi core
(scsi_decide_disposition and more).

Better to decode the error once, and then pass that data back to the
blk layer.

> Only issue still is that the SCSI midlayer does only generate a single
> "EIO" code also for timeouts; however, that pretty much means it's a
> transport error, because if it was a media error, we'd be getting sense
> data ;-)

How does lack of sense data imply that there was no media/device error? A
timeout could be a failure anywhere, in the transport or because of
target/media/LUN problems. Or not a real error at all, just a busy device
or too short a timeout setting.

Currently scsi core does not fastfail time outs ...

Does path checker take paths permanently offline after multiple failures?

If a timeout causes a path failure (means today that scsi core already
retried the command), and path checker re-enables the path (for example,
path checker can send a test unit ready with no failure; this also means
scsi core has already retried the command), this could lead to retrying
that IO (or even another IO) and hitting a timeout again on that path.

Also a SCSI failure (command made it to the media/device, but got some
error) can happen without sense data, like any SCSI errors other than
a CHECK_CONDITION that are not requeued by scsi core (see
scsi_decide_disposition switch cases for status_byte(scmd->result)).

It's probably OK to just fail the path for all driver/transport errors
(and non-sense errors) even if they are retryable: path checker will just
re-enable the path (maybe immediately). But, we end up with different and
potentially significant behaviour for some error cases with/without fastfail.

So though I don't like the approach: distinguishing timeouts or ensuring
that path checker won't continually reenable a path might be good enough,
as long as there are no other error cases (driver or SCSI) that could lead
to long lasting failures.

> > scsi core could be changed to handle device specific decoding via sense
> > tables that can be modified via sysfs, similar to devinfo code (well,
> > devinfo still lacks a sysfs interface).
> 
> dm-path's capabilities go a bit beyond just the error decoding (which
> for generic devices is also provided for in a generic
> dm_scsi_err_handler()); for example you can code special initialization
> commands and behaviour an array might need.

Yes, but that doesn't mean we should decode SCSI sense or scsi core error
errors (i.e. scmd->result) in dm space.

Also, non-scsi drivers would like to use dm multipath, like DASD. Using
extended blk errors allows simpler support for such devices and drivers.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:13     ` Patrick Mansfield
@ 2005-04-21 22:52       ` Lars Marowsky-Bree
  2005-04-22  0:22         ` Patrick Mansfield
  0 siblings, 1 reply; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 22:52 UTC (permalink / raw)
  To: Patrick Mansfield; +Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote:

> > The most recent udm patchset has a patch by Jens Axboe and myself to
> > pass up sense data / error codes in the bio so the dm mpath module can
> > deal with it.  
> But the scmd->result is not passed back.

Bear with me and my limitted knowledge of the SCSI midlayer for a
second: What additional benefit would this provide over sense
key/asc/ascq & the error parameter in the bio end_io path?

> Better to decode the error once, and then pass that data back to the
> blk layer.

Decoding is device specific. So is the handling of path initialization
and others. I'd rather have this consolidated in one module, than have
parts of it in the mid-layer and other parts in the multipath code.

Could this be handled by a module in the mid-layer which receives
commands from the DM multipath layers above, and pass appropriate flags
back up? Probably. (I think this is what you're suggesting.) But
frankly, I prefer the current approach, which works. I don't see a real
benefit in your architecture, besides spreading things out further.

> > Only issue still is that the SCSI midlayer does only generate a single
> > "EIO" code also for timeouts; however, that pretty much means it's a
> > transport error, because if it was a media error, we'd be getting sense
> > data ;-)
> How does lack of sense data imply that there was no media/device error?

It does not always imply that. Note the "pretty much ... ;-)".

The one thing which could be improved here is that I'm not sure if an
EIO w/o sense data from the SCSI mid-layer always corresponds to a
timeout. Could we get EIO also for other errors?

However, as you correctly state later, it's pretty safe to treat such
errors as a "path error" and retry elsewhere, because if it was a false
failure, the path checker will reinstate soonish.

> timeout could be a failure anywhere, in the transport or because of
> target/media/LUN problems. Or not a real error at all, just a busy device
> or too short a timeout setting.

Well, the not real errors might benefit from the IO being retried on
another path though.

> Does path checker take paths permanently offline after multiple failures?

The path checker lives in user-space, and that's policy ;-) So, from the
kernel perspective, it doesn't matter. User-space currently does not
'permanently' fail paths, but it could be modified to do so if it goes
up/down at a too high rate, basically dampening for stability.  Patches
welcome.

> So though I don't like the approach: distinguishing timeouts or ensuring
> that path checker won't continually reenable a path might be good enough,
> as long as there are no other error cases (driver or SCSI) that could lead
> to long lasting failures.

That's essentially what is being done. However, there's some more
special cases (like a storage array telling us that that service
processor is no longer active and we should switch not to another path
on the same, but to the other SP; which we model in dm-mpath via
different priority groups and causing a PG switch), and some errors
translate to errors being immediately propagated upwards (media error,
illegal request, data protect and some others; again, this might include
specific handling based on the storage being addressed), because for
these retrying on another path (or switching service processors) doesn't
make any sense or might be even harmful.

> Yes, but that doesn't mean we should decode SCSI sense or scsi core error
> errors (i.e. scmd->result) in dm space.

This happens in the SCSI layer; dm-mpath only sees already 'decoded'
sense key/asc/ascq.

> Also, non-scsi drivers would like to use dm multipath, like DASD. Using
> extended blk errors allows simpler support for such devices and drivers.

Sure. The bi_error field introduced by Axboe's patch has flags detailing
what kind of error information is available - it's either ERRNO
(basically, the current "error"), SENSE (for certain scsi requests,
where sense is available), and could be extended to include a DASD
class, and then be complemented by a dm-dasd module for hw-specific
handling for any other specific needs they might have.

Can you sketch/summarize your suggested design in more detail? That
would be helpful for me, because I missed parts of the earlier
discussion.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:52       ` Lars Marowsky-Bree
@ 2005-04-22  0:22         ` Patrick Mansfield
  0 siblings, 0 replies; 17+ messages in thread
From: Patrick Mansfield @ 2005-04-22  0:22 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, Linux SCSI, Andreas Herrmann

On Fri, Apr 22, 2005 at 12:52:56AM +0200, Lars Marowsky-Bree wrote:
> On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote:
> 
> > > The most recent udm patchset has a patch by Jens Axboe and myself to
> > > pass up sense data / error codes in the bio so the dm mpath module can
> > > deal with it.  
> > But the scmd->result is not passed back.
> 
> Bear with me and my limitted knowledge of the SCSI midlayer for a
> second: What additional benefit would this provide over sense
> key/asc/ascq & the error parameter in the bio end_io path?

So we can mark a path in dm as failed or not; then dm won't mark a path
failed for driver, transport, or other retryable errors.

As noted, this might not lead to user visible affects, since retryable
errors _should_ end up failing and then re-enabling the path, but that
could lead to problems. But the code paths will be cleaner.

> > Better to decode the error once, and then pass that data back to the
> > blk layer.
> 
> Decoding is device specific. So is the handling of path initialization
> and others. I'd rather have this consolidated in one module, than have
> parts of it in the mid-layer and other parts in the multipath code.

Me too, but I'm arguing to decode them in scsi core.

> Could this be handled by a module in the mid-layer which receives
> commands from the DM multipath layers above, and pass appropriate flags
> back up? Probably. (I think this is what you're suggesting.) But
> frankly, I prefer the current approach, which works. I don't see a real
> benefit in your architecture, besides spreading things out further.

scsi core (I don't like calling it midlayer) could have a module or such.

The same decoding that is being put into dm hardware modules should
also be in scsi core. That is, when running such hardware without dm
multipath (single pathed or just stupidly) we still want the decoding of
the sense data, especially for retryable errors.

> The one thing which could be improved here is that I'm not sure if an
> EIO w/o sense data from the SCSI mid-layer always corresponds to a
> timeout. Could we get EIO also for other errors?

You should be getting EIO for all IO failures, timeout or not. For example
a cable pull returns DID_NO_CONNECT (for at least qlogic, and maybe for
emulex), its decoded in scsi_decide_disposition, and scsi core calls
scsi_end_request(x, 0, x, x), and then calls end_that_request_chunk(x, 0,
x) and that sets error to EIO.

> However, as you correctly state later, it's pretty safe to treat such
> errors as a "path error" and retry elsewhere, because if it was a false
> failure, the path checker will reinstate soonish.
> 
> > timeout could be a failure anywhere, in the transport or because of
> > target/media/LUN problems. Or not a real error at all, just a busy device
> > or too short a timeout setting.
> 
> Well, the not real errors might benefit from the IO being retried on
> another path though.

Yes.

> > Does path checker take paths permanently offline after multiple failures?
> 
> The path checker lives in user-space, and that's policy ;-) So, from the
> kernel perspective, it doesn't matter. User-space currently does not
> 'permanently' fail paths, but it could be modified to do so if it goes
> up/down at a too high rate, basically dampening for stability.  Patches
> welcome.
> 
> > So though I don't like the approach: distinguishing timeouts or ensuring
> > that path checker won't continually reenable a path might be good enough,
> > as long as there are no other error cases (driver or SCSI) that could lead
> > to long lasting failures.
> 
> That's essentially what is being done. However, there's some more
> special cases (like a storage array telling us that that service
> processor is no longer active and we should switch not to another path
> on the same, but to the other SP; which we model in dm-mpath via
> different priority groups and causing a PG switch), and some errors
> translate to errors being immediately propagated upwards (media error,
> illegal request, data protect and some others; again, this might include
> specific handling based on the storage being addressed), because for
> these retrying on another path (or switching service processors) doesn't
> make any sense or might be even harmful.

Yes ... I'm familiar with such hardware.

> > Yes, but that doesn't mean we should decode SCSI sense or scsi core error
> > errors (i.e. scmd->result) in dm space.
> 
> This happens in the SCSI layer; dm-mpath only sees already 'decoded'
> sense key/asc/ascq.

But that data is not decoded, dm has to look at the sense value etc. Some
of that must overlap with the code in scsi core.

> > Also, non-scsi drivers would like to use dm multipath, like DASD. Using
> > extended blk errors allows simpler support for such devices and drivers.
> 
> Sure. The bi_error field introduced by Axboe's patch has flags detailing
> what kind of error information is available - it's either ERRNO
> (basically, the current "error"), SENSE (for certain scsi requests,
> where sense is available), and could be extended to include a DASD
> class, and then be complemented by a dm-dasd module for hw-specific
> handling for any other specific needs they might have.
> 
> Can you sketch/summarize your suggested design in more detail? That
> would be helpful for me, because I missed parts of the earlier
> discussion.

I can try forward porting Mike C's patch ... what we need on top of the
bi_error is to pass back the bi_error when calling end_that_request_first,
instead of a boolean 0/1 for uptodate, pass a BIO_ERROR_xxx. And set
bio->bi_error and/or just pass it back in bio_endio().

The errors could be:
	
	BIO_SUCCESS = 0,
	BIO_ERROR_ERR,
	BIO_ERROR_RETRY,
	BIO_ERROR_DEV_FAILURE,
	BIO_ERROR_DEV_RETRY,
	BIO_ERROR_TRNSPT_FAILURE,
	BIO_ERROR_TRNSPT_RETRY,
	BIO_ERROR_TIMEOUT,

And maybe (for non-failure failover case, when an SP is no longer active)
a BIO_ERROR_TRNSPT_INACTIVE or ?.

These somewhat match Mike C's values, he had:

+	BLK_SUCCESS,
+	BLK_ERR,		/* Generic error like -EIO */
+	BLK_FATAL_DEV,		/* Fatal driver error */
+	BLK_FATAL_TRNSPT,	/* Fatal transport error */
+	BLK_FATAL_DRV,		/* Fatal driver error */
+	BLK_RETRY_DEV,		/* Device error, I/O may be retried */
+	BLK_RETRY_TRNSPT,	/* Transport error, I/O may retried */
+	BLK_RETRY_DRV,		/* Driver error, I/O may be retried */

AFICT, the only need for a _DRV as in Mike's patch was too handle the
-EWOULDBLOCK (can't find this in current source though ..), so we might
need only a BIO_ERROR_RETRY.

And then in dm:

	BIO_SUCCESS:		complete IO with no failure
	BIO_ERROR_RETRY: 	never makes it to dm
	BIO_ERROR_ERR:		treat as a failed IO
	BIO_ERROR_DEV_FAILURE:	failed IO
	BIO_ERROR_DEV_RETRY:	retry on any path
	BIO_ERROR_TRNSPT_FAILURE:	fail path
	BIO_ERROR_TRNSPT_RETRY:	retry on any path
	BIO_ERROR_TIMEOUT:	hard to handle, needs to retry on another
				path, but mark this path as potentially
				failing. For now, it could just fail the
				path (then we are in the same situation as
				today).
	BIO_ERROR_TRNSPT_INACTIVE:	failover ...

Non-dm users treat non BIO_SUCCESS results as IO failures (we should not
return retry errors unless fast fail is set).

"retry on any path" would normally use a different path if one is
available.

We still need a scsi vendor specific decoder, I'd volunteered to do that
work before but there were no responses last time I brought it up (on
linux-scsi).

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Re: fastfail operation and retries
@ 2005-04-21 21:02 goggin, edward
  0 siblings, 0 replies; 17+ messages in thread
From: goggin, edward @ 2005-04-21 21:02 UTC (permalink / raw)
  To: 'Lars Marowsky-Bree', device-mapper development,
	Andreas Herrmann
  Cc: Linux SCSI

On Thursday, April 21, 2005 3:55 PM,  Lars Marowsky-Bree wrote:
> Together with the "queue_if_no_path" feature flag for dm-mpath that
> should do what you need to handle this (arguably broken) array
> behaviour: It'll queue until the error goes away and 
> multipathd retests
> and reactivates the paths. That ought to work, but given that I don't
> have an IBM ESS accessible, please confirm that.

Depending on the "queue_if_no_path" feature has the current undesirable
side-effect of requiring intervention of the user space multipath components
to reinstate at least one of the paths to a useable state in the multipath
target driver.  This dependency currently creates the potential for deadlock
scenarios since the user space multipath components (nor the kernel for that
matter) are currently architected to avoid them.

I think for now it may be better to try to avoid having to fail a path if it
is possible that an io error is not path related.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Re: fastfail operation and retries
@ 2005-04-21 21:31 goggin, edward
  2005-04-21 21:49 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: goggin, edward @ 2005-04-21 21:31 UTC (permalink / raw)
  To: 'Lars Marowsky-Bree', device-mapper development,
	Andreas Herrmann
  Cc: Linux SCSI

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org 
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Lars 
> Marowsky-Bree
> Sent: Thursday, April 21, 2005 5:19 PM
> To: device-mapper development; Andreas Herrmann
> Cc: Linux SCSI
> Subject: Re: [dm-devel] Re: fastfail operation and retries
> 
> On 2005-04-21T17:02:44, "goggin, edward" <egoggin@emc.com> wrote:
> 
> > Depending on the "queue_if_no_path" feature has the current 
> undesirable
> > side-effect of requiring intervention of the user space 
> multipath components
> > to reinstate at least one of the paths to a useable state 
> in the multipath
> > target driver.  This dependency currently creates the 
> potential for deadlock
> > scenarios since the user space multipath components (nor 
> the kernel for that
> > matter) are currently architected to avoid them.
> 
> multipath-tools is, to a certain degree, architected to avoid 
> them. And
> the kernel is meant to be, too - there's bugs and known FIXME's, but
> those are just bugs and we're taking patches gladly ;-)
> 
> > I think for now it may be better to try to avoid having to 
> fail a path if it
> > is possible that an io error is not path related.
> 
> No. Basically every time out error creates a "dunno why" 
> error right now
> - could be the storage system itself, could be the network in between.
>

I was really thinking of the code where the sense key/asc/ascq makes it
into the bio.
 
> A failover to another path is the obvious remedy; take for example the
> CX series where even if it's not the path, it's the SP, and 
> failing over
> to the other SP will cure the problem.
> 
> If the storage at least rejects the IO with a specific error code, it
> can be worked around by a specific hw handler which doesn't fail the
> path but just causes the IO to be queued and retried; that's a pretty
> simple hardware handler to write.

I agree we and likely other storage vendors could do a better job here.
But that said, the multipathing code could also avoid failing the path
just because an io error occurred on that path.  Instead, this could be
the sole responsibility of path testing (from user space) which could
reduce the likelihood of media errors being confused with path
connectivity ones.

> 
> But quite frankly, storage subsystems which _reject_ all IO 
> for a given
> time are just broken for reliable configurations. What good 
> are they in
> multipath configurations if they fail _all_ paths at the same 
> time? How
> can they even dare claim redundancy? We can build more or less smelly
> kludges around them, but it remains a problem to be fixed at 
> the storage
> subsystem level IMNSHO.

I agree that its unfortunate that the CLARiion is failing all paths
during NDU, even for a restricted amount of time.  Even so, it must
be dealt with as is.

> 
> 
> Sincerely,
>     Lars Marowsky-Brée <lmb@suse.de>
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 21:31 goggin, edward
@ 2005-04-21 21:49 ` Lars Marowsky-Bree
  0 siblings, 0 replies; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 21:49 UTC (permalink / raw)
  To: device-mapper development, Andreas Herrmann; +Cc: Linux SCSI

On 2005-04-21T17:31:46, "goggin, edward" <egoggin@emc.com> wrote:

> > No. Basically every time out error creates a "dunno why" error right
> > now - could be the storage system itself, could be the network in
> > between.
> >
> I was really thinking of the code where the sense key/asc/ascq makes
> it into the bio.

We don't get sense data for transport errors and certain storage
failures, though.

> I agree we and likely other storage vendors could do a better job
> here.  But that said, the multipathing code could also avoid failing
> the path just because an io error occurred on that path.  Instead,
> this could be the sole responsibility of path testing (from user
> space) which could reduce the likelihood of media errors being
> confused with path connectivity ones.

If we can't differentiate in the kernel where we have the IO error
details available, then how would user-space? You're not solving the
problem ;-)

> I agree that its unfortunate that the CLARiion is failing all paths
> during NDU, even for a restricted amount of time.  Even so, it must
> be dealt with as is.

It does? According to my documentation, the CX-family, the FC4700(-2)
and likely the Symmetrix NDU is a rolling update, so that always one
Service-Processor remains accessible, with enough delay in between them
that path retesting will have reenabled the path.

We get an 02/04/03 Path Not Ready error code for this case, which in the
dm-emc.c handler is translated to an immediate switch_pg.

In fact, the user-space testing code will receive pre-notification of a
pending NDU by the LUN Operations field being set to 1, which will cause
user-space to flag that path as down, even if there's no in-flight IO.

This combined ought to cover the NDU case pretty well and is implemented
already. (And supposedly works in SLES9 SP2 beta3.)

According to my docs, the only EMC array which does fail all paths
during a software update (by doing a "Warm Reboot") is a FC4500 array.
Not sure whether this also includes the AX-series, though, my doc
doesn't mention it. The FC4500 might not respond to IO for upto 50
seconds; in which case the queue_if_no_path and user-space retesting
provides adequate (as good as possible) coverage to reinstate the paths.

(The fact that no write/reads complete should automatically throttle the
IO, too; however, this might not be true for certain write patterns, and
in particular async IO (how could we possible throttle _that_?). IO
throttling in this case remains a problem which we might need to
address.)

I guess you get what you pay for: The arrays which _do_ have this
misbehaviour _will_ be problematic in certain configurations; putting
swap on them comes to mind.

As this allows EMC and other vendors to sell their higher end arrays, I
can't see how you could possibly complain ;-)

I stand by my point that any array which does have this behaviour does
not qualify as high-end storage.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] Re: fastfail operation and retries
@ 2005-04-21 21:33 Andreas Herrmann
  2005-04-21 22:24 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: Andreas Herrmann @ 2005-04-21 21:33 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, Linux SCSI, linux-scsi-owner

        Lars Marowsky-Bree <lmb@suse.de>
        21.04.2005 21:54
 
> On 2005-04-21T09:42:05, Patrick Mansfield <patmans@us.ibm.com> wrote:

> > On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote:

  <snip>

> > 
> > We need a patch like Mike Christie had, this:
> > 
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2
> > 
> > The scsi core should decode the sense data and pass up the result, 
then dm
> > need not decode sense data, and we don't need sense data passed around 
via
> > the block layer.

> The most recent udm patchset has a patch by Jens Axboe and myself to
> pass up sense data / error codes in the bio so the dm mpath module can
> deal with it. 

> Only issue still is that the SCSI midlayer does only generate a single
> "EIO" code also for timeouts; however, that pretty much means it's a
> transport error, because if it was a media error, we'd be getting sense
> data ;-)

Well, there are various situations when all paths to the ESS are
"temporarily unavailable". In some cases TASK_SET_FULL/BUSY is
reported as it should be. In other cases we just encounter data
underruns or exchange sequences are aborted and finally it might be
that requests just time out. BTW, it is not only ESS where I have seen
such (broken) behaviour.

> Together with the "queue_if_no_path" feature flag for dm-mpath that
> should do what you need to handle this (arguably broken) array
> behaviour: It'll queue until the error goes away and multipathd retests
> and reactivates the paths. That ought to work, but given that I don't
> have an IBM ESS accessible, please confirm that.

Sounds good. Will make some tests using the "queue_if_no_path" feature.

> It is possible that to fully support them a dm mpath hardware handler
> (like for the EMC CX family) might be required, too.

For the time being I hope "queue_if_no_path" feature is sufficient
to succesfully pass our tests ;-)

> (For easier testing, you'll find that all this functionality is
> available in the latest SLES9 SP2 betas, to which you ought to have
> access at IBM, and the kernels are also available via
> ftp://ftp.suse.com/pub/projects/kernel/kotd/.)

> > scsi core could be changed to handle device specific decoding via 
sense
> > tables that can be modified via sysfs, similar to devinfo code (well,
> > devinfo still lacks a sysfs interface).

> dm-path's capabilities go a bit beyond just the error decoding (which
> for generic devices is also provided for in a generic
> dm_scsi_err_handler()); for example you can code special initialization
> commands and behaviour an array might need.

> Maybe this could indeed be abstracted further to download the command
> and/or specific decoding tables from user-space via sysfs or configfs by
> a generic user-space customizable dm-hw-handler-generic.[ch] plugin; I
> think patches are being accepted ;-)

Thanks for the information.


Regards,

Andreas


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 21:33 [dm-devel] " Andreas Herrmann
@ 2005-04-21 22:24 ` Lars Marowsky-Bree
  2005-04-22 19:13   ` Lan
  0 siblings, 1 reply; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 22:24 UTC (permalink / raw)
  To: device-mapper development; +Cc: Linux SCSI, aherrman

On 2005-04-21T23:33:57, Andreas Herrmann <aherrman@de.ibm.com> wrote:

> Well, there are various situations when all paths to the ESS are
> "temporarily unavailable". In some cases TASK_SET_FULL/BUSY is
> reported as it should be.

Not sure whether this sense data is decoded and handled correctly in
dm-mpath yet. I don't have detailed specs, nor a feature request to
allocate time to work on making sure it really does. I recommend that
someone at IBM takes the real specs for the ESS and makes sure that it
all works, by a combination of the right defaults in the multipath-tools
hwtable and, if need be, a dm-ess plugin to handle this.

This would be much appreciated.

> underruns or exchange sequences are aborted and finally it might be
> that requests just time out. BTW, it is not only ESS where I have seen
> such (broken) behaviour.

Well, what can I say. Broken behaviour needs to be documented and worked
around, but obviously only as far as that is possible.

> > It is possible that to fully support them a dm mpath hardware handler
> > (like for the EMC CX family) might be required, too.
> For the time being I hope "queue_if_no_path" feature is sufficient
> to succesfully pass our tests ;-)

If it is sufficient, you might at least wish to update the
multipath-tools hwtable entry so that it is automagically set for your
arrays.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:24 ` Lars Marowsky-Bree
@ 2005-04-22 19:13   ` Lan
  2005-04-25 23:56     ` [dm-devel] " Tim Pepper
  2005-04-26  9:55     ` Lars Marowsky-Bree
  0 siblings, 2 replies; 17+ messages in thread
From: Lan @ 2005-04-22 19:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Linux SCSI, aherrman

On 4/21/05, Lars Marowsky-Bree <lmb@suse.de> wrote:
> On 2005-04-21T23:33:57, Andreas Herrmann <aherrman@de.ibm.com> wrote:
> 
> > Well, there are various situations when all paths to the ESS are
> > "temporarily unavailable". In some cases TASK_SET_FULL/BUSY is
> > reported as it should be.
> 
> Not sure whether this sense data is decoded and handled correctly in
> dm-mpath yet. I don't have detailed specs, nor a feature request to
> allocate time to work on making sure it really does. I recommend that
> someone at IBM takes the real specs for the ESS and makes sure that it
> all works, by a combination of the right defaults in the multipath-tools
> hwtable and, if need be, a dm-ess plugin to handle this.
> 
> This would be much appreciated.
>

Please correct me if my assumption is wrong, but I would think that
transient errors are expected, especially in a SAN, from both the
fabric and media. A storage device may have to return retryable status
conditions at certain points, and that such retryable conditions are
not necessarily specific to a storage device. For example, a
QUEUE_FULL or BUSY, implying that the device is congested. Wouldn't
most storage devices reasonably expect I/O failed due to this
condition will be retried? [Such a congestion handling mechanism, I
would think, would not have to be storage-specific, although the
policy for handling congestion might be?]  So  in order to deal with
transient conditions given that failfast flag is set, the
queue_if_no_path must be used; I'm not sure why any dm-multipath
storage users would not want to turn on queue_if_no_path by default?

As far as I know, ESS does not require any special handing of special
sense information, besides various sense data status conditions that
it expects would be retried. (Arent' data underruns also an expected
retryable condition?).  I'm not so familiar with all the various
possible transport and media errors/conditions, but I would think that
most could/would want to be handled generically by storage devices
(which is why the scsi core has generic error handling i'd imagine).
But I agree that more testing should be done with ESS and its spec to
verify that a special dm-ess error handler is actually not needed. 
And at the least, a hw entry should be added to dm to turn on
queue_if_no_path by default for ESS, and any other necessary defaults.
 Although, it seems need to add to multipath-tools the ability to set
a timeout limit on how long an I/O is queued and retried (otherwise in
a permanent failure, I think the I/O  could be queued for a quite
awhile, e.g. until system runs out of memory).

Also, what do you think about allowing a configurable threshold on I/O
failures in dm-multipath before deciding to set a path dead; 1 is
kinda low, and has no tolerance at all for transient errors. I think
it will lessen the dependency on waiting for multipath-tools to
reinstate a path that has been set dead due to a transient condition.

Thanks!
Lan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] Re: fastfail operation and retries
  2005-04-22 19:13   ` Lan
@ 2005-04-25 23:56     ` Tim Pepper
  2005-04-27 14:44       ` Lars Marowsky-Bree
  2005-04-26  9:55     ` Lars Marowsky-Bree
  1 sibling, 1 reply; 17+ messages in thread
From: Tim Pepper @ 2005-04-25 23:56 UTC (permalink / raw)
  To: tranlan; +Cc: device-mapper development, Linux SCSI, aherrman

On 4/22/05, Lan <transter@gmail.com> wrote:
>
> queue_if_no_path must be used; I'm not sure why any dm-multipath
> storage users would not want to turn on queue_if_no_path by default?

What protection is there against long term queueing and running the
machine out of memory?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-25 23:56     ` [dm-devel] " Tim Pepper
@ 2005-04-27 14:44       ` Lars Marowsky-Bree
  2005-04-27 22:57         ` Tim Pepper
  0 siblings, 1 reply; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-27 14:44 UTC (permalink / raw)
  To: Tim Pepper, device-mapper development, tranlan; +Cc: Linux SCSI, aherrman

On 2005-04-25T16:56:56, Tim Pepper <tpepper@gmail.com> wrote:

> > queue_if_no_path must be used; I'm not sure why any dm-multipath
> > storage users would not want to turn on queue_if_no_path by default?
> What protection is there against long term queueing and running the
> machine out of memory?

User-space needs to take action and tell us when to stop queuing.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-27 14:44       ` Lars Marowsky-Bree
@ 2005-04-27 22:57         ` Tim Pepper
  2005-05-03 11:11           ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: Tim Pepper @ 2005-04-27 22:57 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: device-mapper development, tranlan, Linux SCSI, aherrman

On 4/27/05, Lars Marowsky-Bree <lmb@suse.de> wrote:
> User-space needs to take action and tell us when to stop queuing.

Is there any risk of priority inversion?  I can't think of a specific
issue beyond the userspace daemon process simply not existing that
wouldn't hopefully settle out over time and I haven't looked closely
at this aspect of 2.6, but it used to be easy to get/keep the cpu busy
enough on flushing IO to disk to hurt userspace response times (fibre
pulls during heavy buffered, filesystem IO effectively DoSing the
machine for a long period).  If that sort of thing is still possible,
it seems risky relying on a userspace application for
timely/meaningful recovery of the resources consumed by the IO.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-27 22:57         ` Tim Pepper
@ 2005-05-03 11:11           ` Lars Marowsky-Bree
  0 siblings, 0 replies; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-05-03 11:11 UTC (permalink / raw)
  To: Tim Pepper, device-mapper development; +Cc: tranlan, Linux SCSI, aherrman

On 2005-04-27T15:57:09, Tim Pepper <tpepper@gmail.com> wrote:

> > User-space needs to take action and tell us when to stop queuing.
> Is there any risk of priority inversion?

That risk of course always exists (and it'd exist in the kernel too).
The code in question needs to be auditted to make sure this case is
taken care of.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-22 19:13   ` Lan
  2005-04-25 23:56     ` [dm-devel] " Tim Pepper
@ 2005-04-26  9:55     ` Lars Marowsky-Bree
  1 sibling, 0 replies; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-26  9:55 UTC (permalink / raw)
  To: tranlan, device-mapper development; +Cc: Linux SCSI, aherrman

On 2005-04-22T12:13:53, Lan <transter@gmail.com> wrote:

>  Although, it seems need to add to multipath-tools the ability to set
> a timeout limit on how long an I/O is queued and retried (otherwise in
> a permanent failure, I think the I/O  could be queued for a quite
> awhile, e.g. until system runs out of memory).

This can actually be implemented in user-space. If the paths stay down
for N seconds, remove the queue_if_no_path feature flag, and all IO will
be failed.

> Also, what do you think about allowing a configurable threshold on I/O
> failures in dm-multipath before deciding to set a path dead; 1 is
> kinda low, and has no tolerance at all for transient errors.

That might be a good idea. 

Note however that DM mpath already distinguishes between path failures
and media failures for example: A media failure will not cause a path to
be failed.

And there's also a trade-off: As long as the path is not failed, it'll
receive more IO. Which, if it doesn't turn out to be a transient error,
we will need to wait on to fail, has to be requeued and retried
somewhere else. This causes delays.

Failing the path on the first error potentially attributable to the
transport will cause an immediate retry on another path though; and if
it turns out to be a transient error, the path will be returned into
operation within a couple of seconds by user-space.

> I think it will lessen the dependency on waiting for multipath-tools
> to reinstate a path that has been set dead due to a transient
> condition.

True, but this is actually by current design, because we want to
redirect IO to healthy paths as quickly as possible.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Re: fastfail operation and retries
@ 2005-04-21 22:01 goggin, edward
  2005-04-21 22:16 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 17+ messages in thread
From: goggin, edward @ 2005-04-21 22:01 UTC (permalink / raw)
  To: 'Lars Marowsky-Bree', device-mapper development,
	Andreas Herrmann
  Cc: Linux SCSI

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org 
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Lars 
> Marowsky-Bree
> Sent: Thursday, April 21, 2005 5:50 PM
> To: device-mapper development; Andreas Herrmann
> Cc: Linux SCSI
> Subject: Re: [dm-devel] Re: fastfail operation and retries
> 
> On 2005-04-21T17:31:46, "goggin, edward" <egoggin@emc.com> wrote:
> 
> > > No. Basically every time out error creates a "dunno why" 
> error right
> > > now - could be the storage system itself, could be the network in
> > > between.
> > >
> > I was really thinking of the code where the sense key/asc/ascq makes
> > it into the bio.
> 
> We don't get sense data for transport errors and certain storage
> failures, though.
> 
> > I agree we and likely other storage vendors could do a better job
> > here.  But that said, the multipathing code could also avoid failing
> > the path just because an io error occurred on that path.  Instead,
> > this could be the sole responsibility of path testing (from user
> > space) which could reduce the likelihood of media errors being
> > confused with path connectivity ones.
> 
> If we can't differentiate in the kernel where we have the IO error
> details available, then how would user-space? You're not solving the
> problem ;-)

Maybe not completely, but at least an inquiry of page 83 will not trip
over media errors.  Also, why use a different test for determining path
success than the one used for path failure?

> 
> > I agree that its unfortunate that the CLARiion is failing all paths
> > during NDU, even for a restricted amount of time.  Even so, it must
> > be dealt with as is.
> 
> It does? According to my documentation, the CX-family, the FC4700(-2)
> and likely the Symmetrix NDU is a rolling update, so that always one
> Service-Processor remains accessible, with enough delay in 
> between them
> that path retesting will have reenabled the path.
> 
> We get an 02/04/03 Path Not Ready error code for this case, 
> which in the
> dm-emc.c handler is translated to an immediate switch_pg.
> 
> In fact, the user-space testing code will receive 
> pre-notification of a
> pending NDU by the LUN Operations field being set to 1, which 
> will cause
> user-space to flag that path as down, even if there's no in-flight IO.
> 
> This combined ought to cover the NDU case pretty well and is 
> implemented
> already. (And supposedly works in SLES9 SP2 beta3.)
> 
> According to my docs, the only EMC array which does fail all paths
> during a software update (by doing a "Warm Reboot") is a FC4500 array.
> Not sure whether this also includes the AX-series, though, my doc
> doesn't mention it. The FC4500 might not respond to IO for upto 50
> seconds; in which case the queue_if_no_path and user-space retesting
> provides adequate (as good as possible) coverage to reinstate 
> the paths.

I am seeing all-paths-down time period whenever I perfrom an NDU
for a CX300 while running 1 (async write behind) dd thread per
mapped device for 16 mapped devices.

> 
> (The fact that no write/reads complete should automatically 
> throttle the
> IO, too; however, this might not be true for certain write 
> patterns, and
> in particular async IO (how could we possible throttle _that_?). IO
> throttling in this case remains a problem which we might need to
> address.)

This is the problem I am refering to.

> 
> I guess you get what you pay for: The arrays which _do_ have this
> misbehaviour _will_ be problematic in certain configurations; putting
> swap on them comes to mind.
> 
> As this allows EMC and other vendors to sell their higher end 
> arrays, I
> can't see how you could possibly complain ;-)
> 
> I stand by my point that any array which does have this behaviour does
> not qualify as high-end storage.
> 
> 
> Sincerely,
>     Lars Marowsky-Brée <lmb@suse.de>
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business
> 
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: fastfail operation and retries
  2005-04-21 22:01 goggin, edward
@ 2005-04-21 22:16 ` Lars Marowsky-Bree
  0 siblings, 0 replies; 17+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-21 22:16 UTC (permalink / raw)
  To: device-mapper development, Andreas Herrmann; +Cc: Linux SCSI

On 2005-04-21T18:01:04, "goggin, edward" <egoggin@emc.com> wrote:

> > If we can't differentiate in the kernel where we have the IO error
> > details available, then how would user-space? You're not solving the
> > problem ;-)
> Maybe not completely, but at least an inquiry of page 83 will not trip
> over media errors.  Also, why use a different test for determining path
> success than the one used for path failure?

If the kernel sees an error, it needs to take action. It has immediate
knowledge of the error, while the further user-space diagnosis (or even
further in-kernel diagnosis; where this is actually implemented doesn't
matter) obviously lags behind.

I think the aim is to immediately react and re-route IO to reduce the
interruption to upper layers. In principle, if we have healthy paths,
rerouting is always safe; only if we know for sure it's a media error
(as indicated by appropriate sense data) do we immediately report IO
error to upper layers, or switch pgs instead of failing the path etc.
This is a pessimistic approach: take a potentially failed path out of
service asap.

What also happens though is that an event is sent to user-space, and
user-space "immediately" retests the path, and if it finds it healthy,
will reinstate it.

I believe this is correct behaviour.

> > According to my docs, the only EMC array which does fail all paths
> > during a software update (by doing a "Warm Reboot") is a FC4500 array.
> > Not sure whether this also includes the AX-series, though, my doc
> > doesn't mention it. The FC4500 might not respond to IO for upto 50
> > seconds; in which case the queue_if_no_path and user-space retesting
> > provides adequate (as good as possible) coverage to reinstate 
> > the paths.
> 
> I am seeing all-paths-down time period whenever I perfrom an NDU
> for a CX300 while running 1 (async write behind) dd thread per
> mapped device for 16 mapped devices.

Are you already running the code with the sense data decoding enabled,
for example a _very_ recent SLES9 SP2 beta kernel (basically, as of a
couple hours ago) or one with all patches applied from the multipath
bugzilla + multipath-tools pre18, and are you connected to both SPs?

If not, it's possible that that combo kernel didn't correctly handle
that case, because it didn't know about triggering a switch_pg etc.

And, if the CX300 indeed fails all paths during NDU at the same time, it
is behaving contrary to the published CX-series specification; in which
case it is an EMC (and not ours! ;-) bug and needs to be fixed in the
firmware ;-)

> > (The fact that no write/reads complete should automatically throttle
> > the IO, too; however, this might not be true for certain write
> > patterns, and in particular async IO (how could we possible throttle
> > _that_?). IO throttling in this case remains a problem which we
> > might need to address.)
> This is the problem I am refering to.

Well, I don't think so. This is an additional problem, but not one you
should be running into.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-05-03 11:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-19 17:19 fastfail operation and retries Andreas Herrmann
2005-04-21 16:42 ` Patrick Mansfield
2005-04-21 19:54   ` Lars Marowsky-Bree
2005-04-21 22:13     ` Patrick Mansfield
2005-04-21 22:52       ` Lars Marowsky-Bree
2005-04-22  0:22         ` Patrick Mansfield
  -- strict thread matches above, loose matches on Subject: below --
2005-04-21 21:02 goggin, edward
2005-04-21 21:31 goggin, edward
2005-04-21 21:49 ` Lars Marowsky-Bree
2005-04-21 21:33 [dm-devel] " Andreas Herrmann
2005-04-21 22:24 ` Lars Marowsky-Bree
2005-04-22 19:13   ` Lan
2005-04-25 23:56     ` [dm-devel] " Tim Pepper
2005-04-27 14:44       ` Lars Marowsky-Bree
2005-04-27 22:57         ` Tim Pepper
2005-05-03 11:11           ` Lars Marowsky-Bree
2005-04-26  9:55     ` Lars Marowsky-Bree
2005-04-21 22:01 goggin, edward
2005-04-21 22:16 ` Lars Marowsky-Bree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox