* fastfail operation and retries @ 2005-04-19 17:19 Andreas Herrmann 2005-04-21 16:42 ` Patrick Mansfield 0 siblings, 1 reply; 7+ messages in thread From: Andreas Herrmann @ 2005-04-19 17:19 UTC (permalink / raw) To: Linux SCSI Hi, I have question(s) regarding the fastfail operation of the SCSI stack. Performing multipath-tests with an IBM ESS I encountered problems. During certain operations on an ESS (quiesce/resume and such) requests on all paths fail temporarily with an data underrun (resid is set in the FCP-response). In another situation abort sequences happen (see FC-FS). In both cases it is not a path failure but the device (ESS) reports error conditions temporarily (some seconds). Now on error on the first path the multipath layer initiates failover to other available path(s) where requests will immediately fail. Using linux-2.4 and LVM such problems did not occure. There were enough retries (5 for each path) to handle such situations. Now if the FASTFAIL flag is set the SCSI stack prevents retries for failed SCSI commands. Problem is that the multipath layer cannot distinguish between path and device failures (and won't do any retries for the failed request on the same path anyway). How can an lld force the SCSI stack to retry a failed scsi-command (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change the retry counter). What about a DID_FORCE_RETRY ? Or is there any outlook when there will be a better interface between the SCSI stack and the multipath layer to properly handle retries. Regards, Andreas ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fastfail operation and retries 2005-04-19 17:19 fastfail operation and retries Andreas Herrmann @ 2005-04-21 16:42 ` Patrick Mansfield 2005-04-21 19:54 ` Lars Marowsky-Bree 0 siblings, 1 reply; 7+ messages in thread From: Patrick Mansfield @ 2005-04-21 16:42 UTC (permalink / raw) To: Andreas Herrmann; +Cc: Linux SCSI, dm-devel On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote: > Hi, > > I have question(s) regarding the fastfail operation of the SCSI stack. > > Performing multipath-tests with an IBM ESS I encountered problems. > During certain operations on an ESS (quiesce/resume and such) requests > on all paths fail temporarily with an data underrun (resid is set in > the FCP-response). In another situation abort sequences happen (see > FC-FS). > > In both cases it is not a path failure but the device (ESS) reports > error conditions temporarily (some seconds). > > Now on error on the first path the multipath layer initiates failover > to other available path(s) where requests will immediately fail. > > Using linux-2.4 and LVM such problems did not occure. There were > enough retries (5 for each path) to handle such situations. > > Now if the FASTFAIL flag is set the SCSI stack prevents retries for > failed SCSI commands. > > Problem is that the multipath layer cannot distinguish between path > and device failures (and won't do any retries for the failed request > on the same path anyway). > > How can an lld force the SCSI stack to retry a failed scsi-command > (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change > the retry counter). > > What about a DID_FORCE_RETRY ? Or is there any outlook when there > will be a better interface between the SCSI stack and the multipath > layer to properly handle retries. We need a patch like Mike Christie had, this: http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2 The scsi core should decode the sense data and pass up the result, then dm need not decode sense data, and we don't need sense data passed around via the block layer. scsi core could be changed to handle device specific decoding via sense tables that can be modified via sysfs, similar to devinfo code (well, devinfo still lacks a sysfs interface). For ESS, you probably also need the BLIST_RETRY_HWERROR that is in current 2.6.12 rc. -- Patrick Mansfield ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: fastfail operation and retries 2005-04-21 16:42 ` Patrick Mansfield @ 2005-04-21 19:54 ` Lars Marowsky-Bree 2005-04-21 22:13 ` Patrick Mansfield 0 siblings, 1 reply; 7+ messages in thread From: Lars Marowsky-Bree @ 2005-04-21 19:54 UTC (permalink / raw) To: device-mapper development, Andreas Herrmann; +Cc: Linux SCSI On 2005-04-21T09:42:05, Patrick Mansfield <patmans@us.ibm.com> wrote: > On Tue, Apr 19, 2005 at 07:19:53PM +0200, Andreas Herrmann wrote: > > Hi, > > > > I have question(s) regarding the fastfail operation of the SCSI stack. > > > > Performing multipath-tests with an IBM ESS I encountered problems. > > During certain operations on an ESS (quiesce/resume and such) requests > > on all paths fail temporarily with an data underrun (resid is set in > > the FCP-response). In another situation abort sequences happen (see > > FC-FS). > > > > In both cases it is not a path failure but the device (ESS) reports > > error conditions temporarily (some seconds). > > > > Now on error on the first path the multipath layer initiates failover > > to other available path(s) where requests will immediately fail. > > > > Using linux-2.4 and LVM such problems did not occure. There were > > enough retries (5 for each path) to handle such situations. > > > > Now if the FASTFAIL flag is set the SCSI stack prevents retries for > > failed SCSI commands. > > > > Problem is that the multipath layer cannot distinguish between path > > and device failures (and won't do any retries for the failed request > > on the same path anyway). > > > > How can an lld force the SCSI stack to retry a failed scsi-command > > (without using DID_REQUEUE or DID_IMM_RETRY, which both do not change > > the retry counter). > > > > What about a DID_FORCE_RETRY ? Or is there any outlook when there > > will be a better interface between the SCSI stack and the multipath > > layer to properly handle retries. > > We need a patch like Mike Christie had, this: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2 > > The scsi core should decode the sense data and pass up the result, then dm > need not decode sense data, and we don't need sense data passed around via > the block layer. The most recent udm patchset has a patch by Jens Axboe and myself to pass up sense data / error codes in the bio so the dm mpath module can deal with it. Only issue still is that the SCSI midlayer does only generate a single "EIO" code also for timeouts; however, that pretty much means it's a transport error, because if it was a media error, we'd be getting sense data ;-) Together with the "queue_if_no_path" feature flag for dm-mpath that should do what you need to handle this (arguably broken) array behaviour: It'll queue until the error goes away and multipathd retests and reactivates the paths. That ought to work, but given that I don't have an IBM ESS accessible, please confirm that. It is possible that to fully support them a dm mpath hardware handler (like for the EMC CX family) might be required, too. (For easier testing, you'll find that all this functionality is available in the latest SLES9 SP2 betas, to which you ought to have access at IBM, and the kernels are also available via ftp://ftp.suse.com/pub/projects/kernel/kotd/.) > scsi core could be changed to handle device specific decoding via sense > tables that can be modified via sysfs, similar to devinfo code (well, > devinfo still lacks a sysfs interface). dm-path's capabilities go a bit beyond just the error decoding (which for generic devices is also provided for in a generic dm_scsi_err_handler()); for example you can code special initialization commands and behaviour an array might need. Maybe this could indeed be abstracted further to download the command and/or specific decoding tables from user-space via sysfs or configfs by a generic user-space customizable dm-hw-handler-generic.[ch] plugin; I think patches are being accepted ;-) Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: fastfail operation and retries 2005-04-21 19:54 ` Lars Marowsky-Bree @ 2005-04-21 22:13 ` Patrick Mansfield 2005-04-21 22:52 ` Lars Marowsky-Bree 0 siblings, 1 reply; 7+ messages in thread From: Patrick Mansfield @ 2005-04-21 22:13 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: device-mapper development, Linux SCSI, Andreas Herrmann On Thu, Apr 21, 2005 at 09:54:35PM +0200, Lars Marowsky-Bree wrote: > > We need a patch like Mike Christie had, this: > > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=107961883914541&w=2 > > > > The scsi core should decode the sense data and pass up the result, then dm > > need not decode sense data, and we don't need sense data passed around via > > the block layer. > > The most recent udm patchset has a patch by Jens Axboe and myself to > pass up sense data / error codes in the bio so the dm mpath module can > deal with it. But the scmd->result is not passed back. If we passed it back there would be enough information available, but then you still need to add the same decoding as already found in scsi core (scsi_decide_disposition and more). Better to decode the error once, and then pass that data back to the blk layer. > Only issue still is that the SCSI midlayer does only generate a single > "EIO" code also for timeouts; however, that pretty much means it's a > transport error, because if it was a media error, we'd be getting sense > data ;-) How does lack of sense data imply that there was no media/device error? A timeout could be a failure anywhere, in the transport or because of target/media/LUN problems. Or not a real error at all, just a busy device or too short a timeout setting. Currently scsi core does not fastfail time outs ... Does path checker take paths permanently offline after multiple failures? If a timeout causes a path failure (means today that scsi core already retried the command), and path checker re-enables the path (for example, path checker can send a test unit ready with no failure; this also means scsi core has already retried the command), this could lead to retrying that IO (or even another IO) and hitting a timeout again on that path. Also a SCSI failure (command made it to the media/device, but got some error) can happen without sense data, like any SCSI errors other than a CHECK_CONDITION that are not requeued by scsi core (see scsi_decide_disposition switch cases for status_byte(scmd->result)). It's probably OK to just fail the path for all driver/transport errors (and non-sense errors) even if they are retryable: path checker will just re-enable the path (maybe immediately). But, we end up with different and potentially significant behaviour for some error cases with/without fastfail. So though I don't like the approach: distinguishing timeouts or ensuring that path checker won't continually reenable a path might be good enough, as long as there are no other error cases (driver or SCSI) that could lead to long lasting failures. > > scsi core could be changed to handle device specific decoding via sense > > tables that can be modified via sysfs, similar to devinfo code (well, > > devinfo still lacks a sysfs interface). > > dm-path's capabilities go a bit beyond just the error decoding (which > for generic devices is also provided for in a generic > dm_scsi_err_handler()); for example you can code special initialization > commands and behaviour an array might need. Yes, but that doesn't mean we should decode SCSI sense or scsi core error errors (i.e. scmd->result) in dm space. Also, non-scsi drivers would like to use dm multipath, like DASD. Using extended blk errors allows simpler support for such devices and drivers. -- Patrick Mansfield ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: fastfail operation and retries 2005-04-21 22:13 ` Patrick Mansfield @ 2005-04-21 22:52 ` Lars Marowsky-Bree 2005-04-22 0:22 ` Patrick Mansfield 0 siblings, 1 reply; 7+ messages in thread From: Lars Marowsky-Bree @ 2005-04-21 22:52 UTC (permalink / raw) To: Patrick Mansfield; +Cc: device-mapper development, Linux SCSI, Andreas Herrmann On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote: > > The most recent udm patchset has a patch by Jens Axboe and myself to > > pass up sense data / error codes in the bio so the dm mpath module can > > deal with it. > But the scmd->result is not passed back. Bear with me and my limitted knowledge of the SCSI midlayer for a second: What additional benefit would this provide over sense key/asc/ascq & the error parameter in the bio end_io path? > Better to decode the error once, and then pass that data back to the > blk layer. Decoding is device specific. So is the handling of path initialization and others. I'd rather have this consolidated in one module, than have parts of it in the mid-layer and other parts in the multipath code. Could this be handled by a module in the mid-layer which receives commands from the DM multipath layers above, and pass appropriate flags back up? Probably. (I think this is what you're suggesting.) But frankly, I prefer the current approach, which works. I don't see a real benefit in your architecture, besides spreading things out further. > > Only issue still is that the SCSI midlayer does only generate a single > > "EIO" code also for timeouts; however, that pretty much means it's a > > transport error, because if it was a media error, we'd be getting sense > > data ;-) > How does lack of sense data imply that there was no media/device error? It does not always imply that. Note the "pretty much ... ;-)". The one thing which could be improved here is that I'm not sure if an EIO w/o sense data from the SCSI mid-layer always corresponds to a timeout. Could we get EIO also for other errors? However, as you correctly state later, it's pretty safe to treat such errors as a "path error" and retry elsewhere, because if it was a false failure, the path checker will reinstate soonish. > timeout could be a failure anywhere, in the transport or because of > target/media/LUN problems. Or not a real error at all, just a busy device > or too short a timeout setting. Well, the not real errors might benefit from the IO being retried on another path though. > Does path checker take paths permanently offline after multiple failures? The path checker lives in user-space, and that's policy ;-) So, from the kernel perspective, it doesn't matter. User-space currently does not 'permanently' fail paths, but it could be modified to do so if it goes up/down at a too high rate, basically dampening for stability. Patches welcome. > So though I don't like the approach: distinguishing timeouts or ensuring > that path checker won't continually reenable a path might be good enough, > as long as there are no other error cases (driver or SCSI) that could lead > to long lasting failures. That's essentially what is being done. However, there's some more special cases (like a storage array telling us that that service processor is no longer active and we should switch not to another path on the same, but to the other SP; which we model in dm-mpath via different priority groups and causing a PG switch), and some errors translate to errors being immediately propagated upwards (media error, illegal request, data protect and some others; again, this might include specific handling based on the storage being addressed), because for these retrying on another path (or switching service processors) doesn't make any sense or might be even harmful. > Yes, but that doesn't mean we should decode SCSI sense or scsi core error > errors (i.e. scmd->result) in dm space. This happens in the SCSI layer; dm-mpath only sees already 'decoded' sense key/asc/ascq. > Also, non-scsi drivers would like to use dm multipath, like DASD. Using > extended blk errors allows simpler support for such devices and drivers. Sure. The bi_error field introduced by Axboe's patch has flags detailing what kind of error information is available - it's either ERRNO (basically, the current "error"), SENSE (for certain scsi requests, where sense is available), and could be extended to include a DASD class, and then be complemented by a dm-dasd module for hw-specific handling for any other specific needs they might have. Can you sketch/summarize your suggested design in more detail? That would be helpful for me, because I missed parts of the earlier discussion. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Re: fastfail operation and retries 2005-04-21 22:52 ` Lars Marowsky-Bree @ 2005-04-22 0:22 ` Patrick Mansfield 0 siblings, 0 replies; 7+ messages in thread From: Patrick Mansfield @ 2005-04-22 0:22 UTC (permalink / raw) To: Lars Marowsky-Bree Cc: device-mapper development, Linux SCSI, Andreas Herrmann On Fri, Apr 22, 2005 at 12:52:56AM +0200, Lars Marowsky-Bree wrote: > On 2005-04-21T15:13:16, Patrick Mansfield <patmans@us.ibm.com> wrote: > > > > The most recent udm patchset has a patch by Jens Axboe and myself to > > > pass up sense data / error codes in the bio so the dm mpath module can > > > deal with it. > > But the scmd->result is not passed back. > > Bear with me and my limitted knowledge of the SCSI midlayer for a > second: What additional benefit would this provide over sense > key/asc/ascq & the error parameter in the bio end_io path? So we can mark a path in dm as failed or not; then dm won't mark a path failed for driver, transport, or other retryable errors. As noted, this might not lead to user visible affects, since retryable errors _should_ end up failing and then re-enabling the path, but that could lead to problems. But the code paths will be cleaner. > > Better to decode the error once, and then pass that data back to the > > blk layer. > > Decoding is device specific. So is the handling of path initialization > and others. I'd rather have this consolidated in one module, than have > parts of it in the mid-layer and other parts in the multipath code. Me too, but I'm arguing to decode them in scsi core. > Could this be handled by a module in the mid-layer which receives > commands from the DM multipath layers above, and pass appropriate flags > back up? Probably. (I think this is what you're suggesting.) But > frankly, I prefer the current approach, which works. I don't see a real > benefit in your architecture, besides spreading things out further. scsi core (I don't like calling it midlayer) could have a module or such. The same decoding that is being put into dm hardware modules should also be in scsi core. That is, when running such hardware without dm multipath (single pathed or just stupidly) we still want the decoding of the sense data, especially for retryable errors. > The one thing which could be improved here is that I'm not sure if an > EIO w/o sense data from the SCSI mid-layer always corresponds to a > timeout. Could we get EIO also for other errors? You should be getting EIO for all IO failures, timeout or not. For example a cable pull returns DID_NO_CONNECT (for at least qlogic, and maybe for emulex), its decoded in scsi_decide_disposition, and scsi core calls scsi_end_request(x, 0, x, x), and then calls end_that_request_chunk(x, 0, x) and that sets error to EIO. > However, as you correctly state later, it's pretty safe to treat such > errors as a "path error" and retry elsewhere, because if it was a false > failure, the path checker will reinstate soonish. > > > timeout could be a failure anywhere, in the transport or because of > > target/media/LUN problems. Or not a real error at all, just a busy device > > or too short a timeout setting. > > Well, the not real errors might benefit from the IO being retried on > another path though. Yes. > > Does path checker take paths permanently offline after multiple failures? > > The path checker lives in user-space, and that's policy ;-) So, from the > kernel perspective, it doesn't matter. User-space currently does not > 'permanently' fail paths, but it could be modified to do so if it goes > up/down at a too high rate, basically dampening for stability. Patches > welcome. > > > So though I don't like the approach: distinguishing timeouts or ensuring > > that path checker won't continually reenable a path might be good enough, > > as long as there are no other error cases (driver or SCSI) that could lead > > to long lasting failures. > > That's essentially what is being done. However, there's some more > special cases (like a storage array telling us that that service > processor is no longer active and we should switch not to another path > on the same, but to the other SP; which we model in dm-mpath via > different priority groups and causing a PG switch), and some errors > translate to errors being immediately propagated upwards (media error, > illegal request, data protect and some others; again, this might include > specific handling based on the storage being addressed), because for > these retrying on another path (or switching service processors) doesn't > make any sense or might be even harmful. Yes ... I'm familiar with such hardware. > > Yes, but that doesn't mean we should decode SCSI sense or scsi core error > > errors (i.e. scmd->result) in dm space. > > This happens in the SCSI layer; dm-mpath only sees already 'decoded' > sense key/asc/ascq. But that data is not decoded, dm has to look at the sense value etc. Some of that must overlap with the code in scsi core. > > Also, non-scsi drivers would like to use dm multipath, like DASD. Using > > extended blk errors allows simpler support for such devices and drivers. > > Sure. The bi_error field introduced by Axboe's patch has flags detailing > what kind of error information is available - it's either ERRNO > (basically, the current "error"), SENSE (for certain scsi requests, > where sense is available), and could be extended to include a DASD > class, and then be complemented by a dm-dasd module for hw-specific > handling for any other specific needs they might have. > > Can you sketch/summarize your suggested design in more detail? That > would be helpful for me, because I missed parts of the earlier > discussion. I can try forward porting Mike C's patch ... what we need on top of the bi_error is to pass back the bi_error when calling end_that_request_first, instead of a boolean 0/1 for uptodate, pass a BIO_ERROR_xxx. And set bio->bi_error and/or just pass it back in bio_endio(). The errors could be: BIO_SUCCESS = 0, BIO_ERROR_ERR, BIO_ERROR_RETRY, BIO_ERROR_DEV_FAILURE, BIO_ERROR_DEV_RETRY, BIO_ERROR_TRNSPT_FAILURE, BIO_ERROR_TRNSPT_RETRY, BIO_ERROR_TIMEOUT, And maybe (for non-failure failover case, when an SP is no longer active) a BIO_ERROR_TRNSPT_INACTIVE or ?. These somewhat match Mike C's values, he had: + BLK_SUCCESS, + BLK_ERR, /* Generic error like -EIO */ + BLK_FATAL_DEV, /* Fatal driver error */ + BLK_FATAL_TRNSPT, /* Fatal transport error */ + BLK_FATAL_DRV, /* Fatal driver error */ + BLK_RETRY_DEV, /* Device error, I/O may be retried */ + BLK_RETRY_TRNSPT, /* Transport error, I/O may retried */ + BLK_RETRY_DRV, /* Driver error, I/O may be retried */ AFICT, the only need for a _DRV as in Mike's patch was too handle the -EWOULDBLOCK (can't find this in current source though ..), so we might need only a BIO_ERROR_RETRY. And then in dm: BIO_SUCCESS: complete IO with no failure BIO_ERROR_RETRY: never makes it to dm BIO_ERROR_ERR: treat as a failed IO BIO_ERROR_DEV_FAILURE: failed IO BIO_ERROR_DEV_RETRY: retry on any path BIO_ERROR_TRNSPT_FAILURE: fail path BIO_ERROR_TRNSPT_RETRY: retry on any path BIO_ERROR_TIMEOUT: hard to handle, needs to retry on another path, but mark this path as potentially failing. For now, it could just fail the path (then we are in the same situation as today). BIO_ERROR_TRNSPT_INACTIVE: failover ... Non-dm users treat non BIO_SUCCESS results as IO failures (we should not return retry errors unless fast fail is set). "retry on any path" would normally use a different path if one is available. We still need a scsi vendor specific decoder, I'd volunteered to do that work before but there were no responses last time I brought it up (on linux-scsi). -- Patrick Mansfield ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fastfail operation and retries
@ 2005-04-20 8:10 Andreas Herrmann
0 siblings, 0 replies; 7+ messages in thread
From: Andreas Herrmann @ 2005-04-20 8:10 UTC (permalink / raw)
To: 由渊霞; +Cc: Linux SCSI
??? <yxyou@yahoo.com.cn> wrote:
20.04.2005 03:17
> what multipath are you using? Software, or hardware,
> or both?
We are using udm with evms (Linux on zSeries).
Hardware setup is:
- switched fabric FC-SAN,
- 4 paths to each FC-LUN on the ESS 800
All 4 paths are "failing fast" during operations on
the ESS and our stress test tool encounteres I/O-errors.
Regards,
Andreas
^ permalink raw reply [flat|nested] 7+ messages in threadend of thread, other threads:[~2005-04-22 0:22 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-04-19 17:19 fastfail operation and retries Andreas Herrmann 2005-04-21 16:42 ` Patrick Mansfield 2005-04-21 19:54 ` Lars Marowsky-Bree 2005-04-21 22:13 ` Patrick Mansfield 2005-04-21 22:52 ` Lars Marowsky-Bree 2005-04-22 0:22 ` Patrick Mansfield -- strict thread matches above, loose matches on Subject: below -- 2005-04-20 8:10 Andreas Herrmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox