Re: blk_abort_queue on failed paths?

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: blk_abort_queue on failed paths?
       [not found] <448b15030906021555j4e476193kcf69e019992dc592@mail.gmail.com>
@ 2009-06-03 21:39 ` Mike Christie
  2009-06-04 17:18   ` [dm-devel] " Mike Anderson
  2009-06-04 18:09   ` Mike Christie
  0 siblings, 2 replies; 8+ messages in thread
From: Mike Christie @ 2009-06-03 21:39 UTC (permalink / raw)
  To: device-mapper development, SCSI Mailing List, Mike Anderson

adding linux-scsi and Mike Anderson

David Strand wrote:
> After updating to kernel 2.6.28 I found that when I performed some
> cable break testing during device i/o, I would get unwanted device or
> host resets. Ultimately I traced it back to this patch:
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2
> 
> The call to blk_abort_queue causes the block layer to call
> scsi_times_out for pending i/o, which can (or will) ultimately lead to
> device, and/or bus and/or host resets, which of course cause all the
> other devices significant disruption.
> 

What driver were you using? I just did a work around for qla4xxx for 
this (have not posted it yet). I added a scsi_times_out handler to the 
driver so that if the IO was failed to a transport problem then the eh 
does not run.

FC drivers already use fc_timed_out, but I think that will not work. The 
FC driver could fail the IO then call fc_remote_port_delete. So the 
failed IO could hit dm-mpath.c and that could call into the 
scsi_times_out (which for fc drivers call into fc_timed_out) but the 
fc_remote_port_delete has not been done yet, so the port_state is still 
online so that kicks off the scsi eh.

For transport errors I do not think blk_abort_queue is needed anymore - 
at least for scsi drivers. For FC almost every driver supports the 
terminate_rport_io call back (just mptfc does not), so you can set the 
fast io fail tmo to make sure all IO is failed quickly. For iscsi, we 
have the replacement/recovery_timeout. And for SAS, I think there is a 
timeout or the device/target/port is deleted, right?

> What was the reason for this change? I searched through my email from
> this mailing list and could not find a discussion about it.

It seems like it would only make sense to call blk_abort_queue for maybe 
some block drivers (does cciss or dasd need it) or maybe for device 
errors. But it seems to be broken for the common multipath use cases.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] blk_abort_queue on failed paths?
  2009-06-03 21:39 ` blk_abort_queue on failed paths? Mike Christie
@ 2009-06-04 17:18   ` Mike Anderson
  2009-06-04 17:56     ` Mike Christie
  2009-06-04 18:09   ` Mike Christie
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Anderson @ 2009-06-04 17:18 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Mike Christie <michaelc@cs.wisc.edu> wrote:
> adding linux-scsi and Mike Anderson
>
> David Strand wrote:
>> After updating to kernel 2.6.28 I found that when I performed some
>> cable break testing during device i/o, I would get unwanted device or
>> host resets. Ultimately I traced it back to this patch:
>>
>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2
>>
>> The call to blk_abort_queue causes the block layer to call
>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>> device, and/or bus and/or host resets, which of course cause all the
>> other devices significant disruption.
>>
>
> What driver were you using? I just did a work around for qla4xxx for  
> this (have not posted it yet). I added a scsi_times_out handler to the  
> driver so that if the IO was failed to a transport problem then the eh  
> does not run.
>
> FC drivers already use fc_timed_out, but I think that will not work. The  
> FC driver could fail the IO then call fc_remote_port_delete. So the  
> failed IO could hit dm-mpath.c and that could call into the  
> scsi_times_out (which for fc drivers call into fc_timed_out) but the  
> fc_remote_port_delete has not been done yet, so the port_state is still  
> online so that kicks off the scsi eh.
>

For HA link transport failure cases the waking of scsi_eh should not
matter. For tgt link transport failures the waking of scsi_eh is not good.
Previous test runs with added debug I only saw a few case of going into the
abort routines, but maybe my test configs where not complete (timing of
the workqueues running will alter the outcome also). I will look into this
more. The original described failure case of getting host resets is not
good though and would like to understand how we get this far.

> For transport errors I do not think blk_abort_queue is needed anymore -  
> at least for scsi drivers. For FC almost every driver supports the  
> terminate_rport_io call back (just mptfc does not), so you can set the  
> fast io fail tmo to make sure all IO is failed quickly. For iscsi, we  
> have the replacement/recovery_timeout. And for SAS, I think there is a  
> timeout or the device/target/port is deleted, right?
>
>

Yes. (I believe there is an end case that others have discussed in the past
that path checkers or other requests without the fast_fail flag set may
wait until devloss).

>> What was the reason for this change? I searched through my email from
>> this mailing list and could not find a discussion about it.
>
>
> It seems like it would only make sense to call blk_abort_queue for maybe  
> some block drivers (does cciss or dasd need it) or maybe for device  
> errors. But it seems to be broken for the common multipath use cases.

One usage is to handle the case of slow multipath failover where devices
are still responsive on the transport, but are not completing IOs. We can
see a very long delay depending on IO timeout value vs. queue depth of the
target.

If this failure case is perceived to be minor or causing side effects we
could restrict this behavior to a multipath.conf parameter. Another option
would be to refresh your old patch on getting extended result info
allowing deactivate path to only run under certain cases.

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blk_abort_queue on failed paths?
  2009-06-04 17:18   ` [dm-devel] " Mike Anderson
@ 2009-06-04 17:56     ` Mike Christie
  2009-06-04 18:02       ` [dm-devel] " Mike Christie
  2009-06-05  8:28       ` Mike Anderson
  0 siblings, 2 replies; 8+ messages in thread
From: Mike Christie @ 2009-06-04 17:56 UTC (permalink / raw)
  To: Mike Anderson; +Cc: device-mapper development, SCSI Mailing List

Mike Anderson wrote:
> Mike Christie <michaelc@cs.wisc.edu> wrote:
>> adding linux-scsi and Mike Anderson
>>
>> David Strand wrote:
>>> After updating to kernel 2.6.28 I found that when I performed some
>>> cable break testing during device i/o, I would get unwanted device or
>>> host resets. Ultimately I traced it back to this patch:
>>>
>>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2
>>>
>>> The call to blk_abort_queue causes the block layer to call
>>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>>> device, and/or bus and/or host resets, which of course cause all the
>>> other devices significant disruption.
>>>
>> What driver were you using? I just did a work around for qla4xxx for  
>> this (have not posted it yet). I added a scsi_times_out handler to the  
>> driver so that if the IO was failed to a transport problem then the eh  
>> does not run.
>>
>> FC drivers already use fc_timed_out, but I think that will not work. The  
>> FC driver could fail the IO then call fc_remote_port_delete. So the  
>> failed IO could hit dm-mpath.c and that could call into the  
>> scsi_times_out (which for fc drivers call into fc_timed_out) but the  
>> fc_remote_port_delete has not been done yet, so the port_state is still  
>> online so that kicks off the scsi eh.
>>
> 
> For HA link transport failure cases the waking of scsi_eh should not


What is a HA link transport failure?


> matter. For tgt link transport failures the waking of scsi_eh is not good.
> Previous test runs with added debug I only saw a few case of going into the
> abort routines, but maybe my test configs where not complete (timing of
> the workqueues running will alter the outcome also). I will look into this


I think going into the abort routines is still bad. If are in the scsi 
eh then all IO on that host is stopped. So if you had two ports coming 
on that host, and if just one path is bad, now we cannot send IO on the 
other path until the scsi eh is done running. This could be quick, but 
for FC drivers we also do not just send an abort right away. If we have 
transitioned the port state to blocked by this time, then drivers wait 
for the port state to transition like this:

static void
qla2x00_block_error_handler(struct scsi_cmnd *cmnd)
{
         struct Scsi_Host *shost = cmnd->device->host;
         struct fc_rport *rport = 
starget_to_rport(scsi_target(cmnd->device));
         unsigned long flags;

         spin_lock_irqsave(shost->host_lock, flags);
         while (rport->port_state == FC_PORTSTATE_BLOCKED) {
                 spin_unlock_irqrestore(shost->host_lock, flags);
                 msleep(1000);
                 spin_lock_irqsave(shost->host_lock, flags);
         }
         spin_unlock_irqrestore(shost->host_lock, flags);
         return;
}

So we are stuck in the scsi eh until the dev loss timeo fires. There is 
a similar problem for some iscsi drivers.



> more. The original described failure case of getting host resets is not
> good though and would like to understand how we get this far.
> 
>> For transport errors I do not think blk_abort_queue is needed anymore -  
>> at least for scsi drivers. For FC almost every driver supports the  
>> terminate_rport_io call back (just mptfc does not), so you can set the  
>> fast io fail tmo to make sure all IO is failed quickly. For iscsi, we  
>> have the replacement/recovery_timeout. And for SAS, I think there is a  
>> timeout or the device/target/port is deleted, right?
>>
>>
> 
> Yes. (I believe there is an end case that others have discussed in the past
> that path checkers or other requests without the fast_fail flag set may
> wait until devloss).

That is not really there any more. Set the fast io fail tmo and IO is 
failed before dev loss.

The exceptions are for mptfc (does not have a terminate rport io 
callback) and for the scsi eh case like above where the scsi eh starts 
up then the port is deleted (so we miss the fc_timed_out check) and then 
drivers block until the port state transistions.

> 
>>> What was the reason for this change? I searched through my email from
>>> this mailing list and could not find a discussion about it.
>>
>> It seems like it would only make sense to call blk_abort_queue for maybe  
>> some block drivers (does cciss or dasd need it) or maybe for device  
>> errors. But it seems to be broken for the common multipath use cases.
> 
> One usage is to handle the case of slow multipath failover where devices
> are still responsive on the transport, but are not completing IOs. We can
> see a very long delay depending on IO timeout value vs. queue depth of the
> target.

I did not get that part. What component is bad? If you change paths, 
don't you just send IO to the same device? Is this that dasd setup? Or 
does device above mean the target controller or can you access a 
different logical unit through different ports on some multipath setups 
(some sort of clustering magic?)?

And also for this problem, what type of failure is it? Are drivers 
returning a DID_* error for this? Or is it some scsi error?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] blk_abort_queue on failed paths?
  2009-06-04 17:56     ` Mike Christie
@ 2009-06-04 18:02       ` Mike Christie
  2009-06-05  8:28       ` Mike Anderson
  1 sibling, 0 replies; 8+ messages in thread
From: Mike Christie @ 2009-06-04 18:02 UTC (permalink / raw)
  To: device-mapper development; +Cc: Mike Anderson, SCSI Mailing List

Mike Christie wrote:
> Mike Anderson wrote:
>> Mike Christie <michaelc@cs.wisc.edu> wrote:
>>> adding linux-scsi and Mike Anderson
>>>
>>> David Strand wrote:
>>>> After updating to kernel 2.6.28 I found that when I performed some
>>>> cable break testing during device i/o, I would get unwanted device or
>>>> host resets. Ultimately I traced it back to this patch:
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2 
>>>>
>>>>
>>>> The call to blk_abort_queue causes the block layer to call
>>>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>>>> device, and/or bus and/or host resets, which of course cause all the
>>>> other devices significant disruption.
>>>>
>>> What driver were you using? I just did a work around for qla4xxx for  
>>> this (have not posted it yet). I added a scsi_times_out handler to 
>>> the  driver so that if the IO was failed to a transport problem then 
>>> the eh  does not run.
>>>
>>> FC drivers already use fc_timed_out, but I think that will not work. 
>>> The  FC driver could fail the IO then call fc_remote_port_delete. So 
>>> the  failed IO could hit dm-mpath.c and that could call into the  
>>> scsi_times_out (which for fc drivers call into fc_timed_out) but the  
>>> fc_remote_port_delete has not been done yet, so the port_state is 
>>> still  online so that kicks off the scsi eh.
>>>
>>
>> For HA link transport failure cases the waking of scsi_eh should not
> 
> 
> What is a HA link transport failure?
> 
> 
>> matter. For tgt link transport failures the waking of scsi_eh is not 
>> good.
>> Previous test runs with added debug I only saw a few case of going 
>> into the
>> abort routines, but maybe my test configs where not complete (timing of
>> the workqueues running will alter the outcome also). I will look into 
>> this
> 
> 
> I think going into the abort routines is still bad. If are in the scsi 
> eh then all IO on that host is stopped. So if you had two ports coming 
> on that host, and if just one path is bad, now we cannot send IO on the 
> other path until the scsi eh is done running. This could be quick, but 
> for FC drivers we also do not just send an abort right away. If we have 
> transitioned the port state to blocked by this time, then drivers wait 
> for the port state to transition like this:
> 
> static void
> qla2x00_block_error_handler(struct scsi_cmnd *cmnd)
> {
>         struct Scsi_Host *shost = cmnd->device->host;
>         struct fc_rport *rport = 
> starget_to_rport(scsi_target(cmnd->device));
>         unsigned long flags;
> 
>         spin_lock_irqsave(shost->host_lock, flags);
>         while (rport->port_state == FC_PORTSTATE_BLOCKED) {
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 msleep(1000);
>                 spin_lock_irqsave(shost->host_lock, flags);
>         }
>         spin_unlock_irqrestore(shost->host_lock, flags);
>         return;
> }
> 

Oh yeah for this, is it right? Maybe we only want to wait for min(time 
of port state transition (dev loss tmo or port readdition), fast io fail 
tmo firing)?

It would still be a wait, but a shorter one at least.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blk_abort_queue on failed paths?
  2009-06-03 21:39 ` blk_abort_queue on failed paths? Mike Christie
  2009-06-04 17:18   ` [dm-devel] " Mike Anderson
@ 2009-06-04 18:09   ` Mike Christie
  2009-06-04 20:35     ` [dm-devel] " David Strand
  2009-06-05  7:56     ` Mike Anderson
  1 sibling, 2 replies; 8+ messages in thread
From: Mike Christie @ 2009-06-04 18:09 UTC (permalink / raw)
  To: device-mapper development, SCSI Mailing List, Mike Anderson

Mike Christie wrote:
> adding linux-scsi and Mike Anderson
> 
> David Strand wrote:
>> After updating to kernel 2.6.28 I found that when I performed some
>> cable break testing during device i/o, I would get unwanted device or
>> host resets. Ultimately I traced it back to this patch:
>>
>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2 
>>
>>
>> The call to blk_abort_queue causes the block layer to call
>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>> device, and/or bus and/or host resets, which of course cause all the
>> other devices significant disruption.
>>
> 
> What driver were you using? 

Oh yeah, I do not think this should happen in new kernels if the driver 
is failing the IO with DID_TRANSPORT_DISRUPTED when it is deleting the 
rport. That should cause the IO to requeue and wait for fast io fail to 
fire.

Maybe we just need to convert some more drivers?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] blk_abort_queue on failed paths?
  2009-06-04 18:09   ` Mike Christie
@ 2009-06-04 20:35     ` David Strand
  2009-06-05  7:56     ` Mike Anderson
  1 sibling, 0 replies; 8+ messages in thread
From: David Strand @ 2009-06-04 20:35 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List, Mike Anderson

>> What driver were you using?
>
> Oh yeah, I do not think this should happen in new kernels if the driver is
> failing the IO with DID_TRANSPORT_DISRUPTED when it is deleting the rport.
> That should cause the IO to requeue and wait for fast io fail to fire.
>
> Maybe we just need to convert some more drivers?

I am using the 2.6.29 mptfc driver, which I believe relies on
scsi_transport_fc.c, which does not appear to support
DID_TRANSPORT_DISRUPTED.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] blk_abort_queue on failed paths?
  2009-06-04 18:09   ` Mike Christie
  2009-06-04 20:35     ` [dm-devel] " David Strand
@ 2009-06-05  7:56     ` Mike Anderson
  1 sibling, 0 replies; 8+ messages in thread
From: Mike Anderson @ 2009-06-05  7:56 UTC (permalink / raw)
  To: device-mapper development; +Cc: SCSI Mailing List

Mike Christie <michaelc@cs.wisc.edu> wrote:
> Mike Christie wrote:
>> adding linux-scsi and Mike Anderson
>>
>> David Strand wrote:
>>> After updating to kernel 2.6.28 I found that when I performed some
>>> cable break testing during device i/o, I would get unwanted device or
>>> host resets. Ultimately I traced it back to this patch:
>>>
 
>>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2 
>>> 
>>>
>>>
>>> The call to blk_abort_queue causes the block layer to call
>>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>>> device, and/or bus and/or host resets, which of course cause all the
>>> other devices significant disruption.
>>>
>>
>> What driver were you using? 
>
> Oh yeah, I do not think this should happen in new kernels if the driver  
> is failing the IO with DID_TRANSPORT_DISRUPTED when it is deleting the  
> rport. That should cause the IO to requeue and wait for fast io fail to  
> fire.
>
> Maybe we just need to convert some more drivers?

Yes, I am seeing this in my test runs using a DS4K storage device and the
RDAC device handler.
"Jun  5 00:39:58 elm3c244 kernel: [  873.180267] sd 1:0:0:1: [sdd] Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK"

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: blk_abort_queue on failed paths?
  2009-06-04 17:56     ` Mike Christie
  2009-06-04 18:02       ` [dm-devel] " Mike Christie
@ 2009-06-05  8:28       ` Mike Anderson
  1 sibling, 0 replies; 8+ messages in thread
From: Mike Anderson @ 2009-06-05  8:28 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Mike Christie <michaelc@cs.wisc.edu> wrote:
> Mike Anderson wrote:
>> Mike Christie <michaelc@cs.wisc.edu> wrote:
>>> adding linux-scsi and Mike Anderson
>>>
>>> David Strand wrote:
>>>> After updating to kernel 2.6.28 I found that when I performed some
>>>> cable break testing during device i/o, I would get unwanted device or
>>>> host resets. Ultimately I traced it back to this patch:
>>>>
>>>> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=224cb3e981f1b2f9f93dbd49eaef505d17d894c2
>>>>
>>>> The call to blk_abort_queue causes the block layer to call
>>>> scsi_times_out for pending i/o, which can (or will) ultimately lead to
>>>> device, and/or bus and/or host resets, which of course cause all the
>>>> other devices significant disruption.
>>>>
>>> What driver were you using? I just did a work around for qla4xxx for  
>>> this (have not posted it yet). I added a scsi_times_out handler to 
>>> the  driver so that if the IO was failed to a transport problem then 
>>> the eh  does not run.
>>>
>>> FC drivers already use fc_timed_out, but I think that will not work. 
>>> The  FC driver could fail the IO then call fc_remote_port_delete. So 
>>> the  failed IO could hit dm-mpath.c and that could call into the   
>>> scsi_times_out (which for fc drivers call into fc_timed_out) but the  
>>> fc_remote_port_delete has not been done yet, so the port_state is 
>>> still  online so that kicks off the scsi eh.
>>>
>>
>> For HA link transport failure cases the waking of scsi_eh should not
>
>
> What is a HA link transport failure?
>
>

I was just trying to differentiate between failures of the host bus
adapter to switch link vs the switch to the target link. In the host bus
adapter to switch link failure case the waking of scsi_eh in this case has
less impact but that does not add much in this discussion as we where
talking about the impact of the other targets on the host.

>> matter. For tgt link transport failures the waking of scsi_eh is not good.
>> Previous test runs with added debug I only saw a few case of going into the
>> abort routines, but maybe my test configs where not complete (timing of
>> the workqueues running will alter the outcome also). I will look into this
>
>
> I think going into the abort routines is still bad. If are in the scsi  
> eh then all IO on that host is stopped. So if you had two ports coming  
> on that host, and if just one path is bad, now we cannot send IO on the  
> other path until the scsi eh is done running. This could be quick, but  
> for FC drivers we also do not just send an abort right away. If we have  
> transitioned the port state to blocked by this time, then drivers wait  
> for the port state to transition like this:
>
> static void
> qla2x00_block_error_handler(struct scsi_cmnd *cmnd)
> {
>         struct Scsi_Host *shost = cmnd->device->host;
>         struct fc_rport *rport =  
> starget_to_rport(scsi_target(cmnd->device));
>         unsigned long flags;
>
>         spin_lock_irqsave(shost->host_lock, flags);
>         while (rport->port_state == FC_PORTSTATE_BLOCKED) {
>                 spin_unlock_irqrestore(shost->host_lock, flags);
>                 msleep(1000);
>                 spin_lock_irqsave(shost->host_lock, flags);
>         }
>         spin_unlock_irqrestore(shost->host_lock, flags);
>         return;
> }
>
> So we are stuck in the scsi eh until the dev loss timeo fires. There is  
> a similar problem for some iscsi drivers.
>
>
>
>> more. The original described failure case of getting host resets is not
>> good though and would like to understand how we get this far.
>>
>>> For transport errors I do not think blk_abort_queue is needed anymore 
>>> -  at least for scsi drivers. For FC almost every driver supports the 
>>>  terminate_rport_io call back (just mptfc does not), so you can set 
>>> the  fast io fail tmo to make sure all IO is failed quickly. For 
>>> iscsi, we  have the replacement/recovery_timeout. And for SAS, I 
>>> think there is a  timeout or the device/target/port is deleted, 
>>> right?
>>>
>>>
>>
>> Yes. (I believe there is an end case that others have discussed in the past
>> that path checkers or other requests without the fast_fail flag set may
>> wait until devloss).
>
> That is not really there any more. Set the fast io fail tmo and IO is  
> failed before dev loss.
>
> The exceptions are for mptfc (does not have a terminate rport io  
> callback) and for the scsi eh case like above where the scsi eh starts  
> up then the port is deleted (so we miss the fc_timed_out check) and then  
> drivers block until the port state transistions.
>
>>
>>>> What was the reason for this change? I searched through my email from
>>>> this mailing list and could not find a discussion about it.
>>>
>>> It seems like it would only make sense to call blk_abort_queue for 
>>> maybe  some block drivers (does cciss or dasd need it) or maybe for 
>>> device  errors. But it seems to be broken for the common multipath 
>>> use cases.
>>
>> One usage is to handle the case of slow multipath failover where devices
>> are still responsive on the transport, but are not completing IOs. We can
>> see a very long delay depending on IO timeout value vs. queue depth of the
>> target.
>
> I did not get that part. What component is bad? If you change paths,  
> don't you just send IO to the same device? Is this that dasd setup? Or  
> does device above mean the target controller or can you access a  
> different logical unit through different ports on some multipath setups  
> (some sort of clustering magic?)?
>

The bad component would be the target controller or one unit in a peer to
peer configuration.

One example is peer to peer connected storage systems when the primary
fails over.  Another example would be a dual controller configuration
where a controller is timing out IO, but the backend storage is ok.


> And also for this problem, what type of failure is it? Are drivers  
> returning a DID_* error for this? Or is it some scsi error?

SCSI timeouts. We fail back the timed out IO, but the IOs in flight on
the failed path will take target queue depth * IO timeout time for each
batch to be timed out and retried on the good path. 

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-06-05  8:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <448b15030906021555j4e476193kcf69e019992dc592@mail.gmail.com>
2009-06-03 21:39 ` blk_abort_queue on failed paths? Mike Christie
2009-06-04 17:18   ` [dm-devel] " Mike Anderson
2009-06-04 17:56     ` Mike Christie
2009-06-04 18:02       ` [dm-devel] " Mike Christie
2009-06-05  8:28       ` Mike Anderson
2009-06-04 18:09   ` Mike Christie
2009-06-04 20:35     ` [dm-devel] " David Strand
2009-06-05  7:56     ` Mike Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox