All of lore.kernel.org
 help / color / mirror / Atom feed
* path priority group and path state
@ 2005-02-10 16:48 goggin, edward
  2005-02-12 10:23 ` Christophe Varoqui
  0 siblings, 1 reply; 10+ messages in thread
From: goggin, edward @ 2005-02-10 16:48 UTC (permalink / raw)
  To: 'dm-devel@redhat.com'; +Cc: 'christophe varoqui'

The multipath utility is relying on having at least one block
read/write I/O be serviced through a multipath mapped
device in order to show one of the path priority groups in
an active state.  While I can see the semantic correctness
in this claim since the priority group is not yet initialized,
is this what is intended?  Why show both the single priority
group of an active-active storage system using a multibus
path grouping policy and the non-active priority group of an
active-passive storage system using a priority path grouping
policy both as "enabled" when the actual readiness of each
differs quite significantly?

Also, multipath will not set a path to a failed state until the
first block read/write I/O to that path fails.  This approach
can be misleading while monitoring path health via
"multipath -l".  Why not have multipath(8) fail paths known to
fail path testing?  Waiting instead for block I/O requests to
fail lessens the responsiveness of the product to path failures.
Also, the failed paths of enabled, but non-active path priority
groups will not have their path state updated for possibly a
very long time -- and this seems very misleading.



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
  2005-02-10 16:48 goggin, edward
@ 2005-02-12 10:23 ` Christophe Varoqui
  0 siblings, 0 replies; 10+ messages in thread
From: Christophe Varoqui @ 2005-02-12 10:23 UTC (permalink / raw)
  To: device-mapper development

goggin, edward wrote:

>The multipath utility is relying on having at least one block
>read/write I/O be serviced through a multipath mapped
>device in order to show one of the path priority groups in
>an active state.  While I can see the semantic correctness
>in this claim since the priority group is not yet initialized,
>is this what is intended? 
>
In fact, the multipath tool shares the same checker with the daemon.

It is intended the tool doesn't rely on the path status the daemon could 
provide, because, the check interval being what it is, we can't assume 
the daemon path status are an accurate representation of the current 
reality. The tool being in charge to load new maps, I fear it could load 
erroneous ones if relying on outdated info.

Maybe I'm paranoid, but I'm still convinced it's a safe bet to do so.

> Why show both the single priority
>group of an active-active storage system using a multibus
>path grouping policy and the non-active priority group of an
>active-passive storage system using a priority path grouping
>policy both as "enabled" when the actual readiness of each
>differs quite significantly?
>  
>
We don't have so many choices there. The device mapper declares 3 PG 
states : active, enabled, disabled.
How would you map these states upon the 2 scenarii you mention ?

>Also, multipath will not set a path to a failed state until the
>first block read/write I/O to that path fails.  This approach
>can be misleading while monitoring path health via
>"multipath -l".  Why not have multipath(8) fail paths known to
>fail path testing?  Waiting instead for block I/O requests to
>fail lessens the responsiveness of the product to path failures.
>Also, the failed paths of enabled, but non-active path priority
>groups will not have their path state updated for possibly a
>very long time -- and this seems very misleading.
>
>  
>
Maybe I'm overseeing something, but to my knowledge "multipath -l" gets 
the paths status from devinfo.c, which in turn switches to pp->checkfn() 
... ie the same checker the daemon uses.

>
>--
>dm-devel mailing list
>dm-devel@redhat.com
>https://www.redhat.com/mailman/listinfo/dm-devel
>  
>


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
@ 2005-02-14 15:58 goggin, edward
  2005-02-14 22:28 ` Christophe Varoqui
  0 siblings, 1 reply; 10+ messages in thread
From: goggin, edward @ 2005-02-14 15:58 UTC (permalink / raw)
  To: 'dm-devel@redhat.com'

On Sat, 12 Feb 2005 11:23:50 +0100 Christophe Varoqui wrote:
> 
> >The multipath utility is relying on having at least one block
> >read/write I/O be serviced through a multipath mapped
> >device in order to show one of the path priority groups in
> >an active state.  While I can see the semantic correctness
> >in this claim since the priority group is not yet initialized,
> >is this what is intended? 
> >
> In fact, the multipath tool shares the same checker with the daemon.
> 
> It is intended the tool doesn't rely on the path status the 
> daemon could 
> provide, because, the check interval being what it is, we 
> can't assume 
> the daemon path status are an accurate representation of the current 
> reality. The tool being in charge to load new maps, I fear it 
> could load 
> erroneous ones if relying on outdated info.
> 
> Maybe I'm paranoid, but I'm still convinced it's a safe bet to do so.

I see your approach -- wanting to avoid failing paths which previously
failed a path test but are now in an active state.  Would the inaccuracy
be due to delays in the invocation of multipath from multipathd in the
event of a failed path test?  Wouldn't multipath repeat the path test as
part of discovering the SAN?  Wont there always be a non-zero time delay
between detecting a path failure (whether that be from a failed I/O in
the kernel or a failed path test in user space) and actually updating the
multipath kernel state to reflect that failure where sometime during that
time period the path could actually be used again (it was physically
restored) but it wont be after its path status is updated to failed?

I see the real cost of not failing paths from path testing results but
instead waiting for actual failed I/Os as a lack of responsiveness to
path failures.

> 
> > Why show both the single priority
> >group of an active-active storage system using a multibus
> >path grouping policy and the non-active priority group of an
> >active-passive storage system using a priority path grouping
> >policy both as "enabled" when the actual readiness of each
> >differs quite significantly?
> >  
> >
> We don't have so many choices there. The device mapper declares 3 PG 
> states : active, enabled, disabled.
> How would you map these states upon the 2 scenarii you mention ?
> 

As much as is reasonably possible, I would like to always know which
path priority group will be used by the next I/O -- even when none of
the priority groups have been initialized and therefore all of them
have an "enabled" path priority group state.  Looks like "first" will
tell me that, but it is not updated on "multipath -l".

> >Also, multipath will not set a path to a failed state until the
> >first block read/write I/O to that path fails.  This approach
> >can be misleading while monitoring path health via
> >"multipath -l".  Why not have multipath(8) fail paths known to
> >fail path testing?  Waiting instead for block I/O requests to
> >fail lessens the responsiveness of the product to path failures.
> >Also, the failed paths of enabled, but non-active path priority
> >groups will not have their path state updated for possibly a
> >very long time -- and this seems very misleading.
> >
> >  
> >
> Maybe I'm overseeing something, but to my knowledge 
> "multipath -l" gets 
> the paths status from devinfo.c, which in turn switches to 
> pp->checkfn() 
> ... ie the same checker the daemon uses.

I'm just wondering if multipathd could invoke multipath to fail paths
from user space in addition to reinstating them?  Seems like both
multipathd/main.c:checkerloop() and multipath/main.c:/reinstate_paths()
will only initiate a kernel path state transition from PSTATE_FAILED to
PSTATE_ACTIVE but not the other way around.  The state transition from
PSTATE_ACTIVE to PSTATE_FAILED requires a failed I/O since this state
is initiated only from the kernel code itself in the event of an I/O
failure on a multipath target device.

One could expand this approach to proactively fail (and immediately
schedule for testing) all paths associated with common bus components
(for SCSI, initiator and/or target).  The goal being not only to avoid
failing I/O for all but all-paths-down use cases, but to also avoid
long time-out driven delays and high path testing overhead for large
SANs in the process of doing so.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
  2005-02-14 15:58 goggin, edward
@ 2005-02-14 22:28 ` Christophe Varoqui
  0 siblings, 0 replies; 10+ messages in thread
From: Christophe Varoqui @ 2005-02-14 22:28 UTC (permalink / raw)
  To: device-mapper development

goggin, edward wrote:

>On Sat, 12 Feb 2005 11:23:50 +0100 Christophe Varoqui wrote:
>  
>
>>>The multipath utility is relying on having at least one block
>>>read/write I/O be serviced through a multipath mapped
>>>device in order to show one of the path priority groups in
>>>an active state.  While I can see the semantic correctness
>>>in this claim since the priority group is not yet initialized,
>>>is this what is intended? 
>>>
>>>      
>>>
>>In fact, the multipath tool shares the same checker with the daemon.
>>
>>It is intended the tool doesn't rely on the path status the 
>>daemon could 
>>provide, because, the check interval being what it is, we 
>>can't assume 
>>the daemon path status are an accurate representation of the current 
>>reality. The tool being in charge to load new maps, I fear it 
>>could load 
>>erroneous ones if relying on outdated info.
>>
>>Maybe I'm paranoid, but I'm still convinced it's a safe bet to do so.
>>    
>>
>
>I see your approach -- wanting to avoid failing paths which previously
>failed a path test but are now in an active state.  Would the inaccuracy
>be due to delays in the invocation of multipath from multipathd in the
>event of a failed path test?  
>
0) multipath can be triggered upon checker events, yes, but also
1) on map events (paths failed by DM upon IO)
2) by hotplug/udev, upon device add/remove
3) by administrator

In any of these scenarii the daemon can have invalid path state info :

0) Say paths A, B, C form the multipath M. The daemon checker loop kicks 
in, A is checked and shows a transition. multipath is executed for M 
before B and C are re-checked. If their actual status have changed too, 
and multipath asks the daemon about paths states, the daemon will answer 
with the previous/obsolete states so the tool will factor a wrong map.

1) A path failed by the DM will show up as such in the map status 
string, but it doesn't trigger an immediate checker loop run. So the 
tool kicks in while the daemon holds obsolete path states : the tools 
can't resonnably ask the daemon about path states there.

2) a device just added has no checker, no way the tool can ask for its 
state there. The tool execution finishes by a signal sent to the daemon, 
which rediscover paths and instantiate a new checker for the new path.

3) last but not least, if the admin take the pain to run the tool 
himself, he's certainly out of trust for the daemon :)

There can be room for improvement in responsiveness anyway. Caching the 
uids for example, but there you loose on uid changes detection, lacking 
an event-driven detection for that.
If you have suggestion, please post them.

>Wouldn't multipath repeat the path test as
>part of discovering the SAN?  Wont there always be a non-zero time delay
>between detecting a path failure (whether that be from a failed I/O in
>the kernel or a failed path test in user space) and actually updating the
>multipath kernel state to reflect that failure where sometime during that
>time period the path could actually be used again (it was physically
>restored) but it wont be after its path status is updated to failed?
>
>I see the real cost of not failing paths from path testing results but
>instead waiting for actual failed I/Os as a lack of responsiveness to
>path failures.
>
>  
>
We can't fail paths in the secondary path group of an 
assymetric-controler-driven multipathed LU, because I need those path 
ready to take IO in case of a PG switch. That is, if you don't want to 
impose the need of a hardware handler module for every assymetric 
controlers out there.

>>>Why show both the single priority
>>>group of an active-active storage system using a multibus
>>>path grouping policy and the non-active priority group of an
>>>active-passive storage system using a priority path grouping
>>>policy both as "enabled" when the actual readiness of each
>>>differs quite significantly?
>>> 
>>>
>>>      
>>>
>>We don't have so many choices there. The device mapper declares 3 PG 
>>states : active, enabled, disabled.
>>How would you map these states upon the 2 scenarii you mention ?
>>
>>    
>>
>
>As much as is reasonably possible, I would like to always know which
>path priority group will be used by the next I/O -- even when none of
>the priority groups have been initialized and therefore all of them
>have an "enabled" path priority group state.  Looks like "first" will
>tell me that, but it is not updated on "multipath -l".
>
>  
>
Not updated ? Can you elaborate ?
To me, this info is fetched in the map table string upon each exec ...

>>>Also, multipath will not set a path to a failed state until the
>>>first block read/write I/O to that path fails.  This approach
>>>can be misleading while monitoring path health via
>>>"multipath -l".  Why not have multipath(8) fail paths known to
>>>fail path testing?  Waiting instead for block I/O requests to
>>>fail lessens the responsiveness of the product to path failures.
>>>Also, the failed paths of enabled, but non-active path priority
>>>groups will not have their path state updated for possibly a
>>>very long time -- and this seems very misleading.
>>>
>>> 
>>>
>>>      
>>>
>>Maybe I'm overseeing something, but to my knowledge 
>>"multipath -l" gets 
>>the paths status from devinfo.c, which in turn switches to 
>>pp->checkfn() 
>>... ie the same checker the daemon uses.
>>    
>>
>
>I'm just wondering if multipathd could invoke multipath to fail paths
>from user space in addition to reinstating them?  Seems like both
>multipathd/main.c:checkerloop() and multipath/main.c:/reinstate_paths()
>will only initiate a kernel path state transition from PSTATE_FAILED to
>PSTATE_ACTIVE but not the other way around.  The state transition from
>PSTATE_ACTIVE to PSTATE_FAILED requires a failed I/O since this state
>is initiated only from the kernel code itself in the event of an I/O
>failure on a multipath target device.
>
>One could expand this approach to proactively fail (and immediately
>schedule for testing) all paths associated with common bus components
>(for SCSI, initiator and/or target).  The goal being not only to avoid
>failing I/O for all but all-paths-down use cases, but to also avoid
>long time-out driven delays and high path testing overhead for large
>SANs in the process of doing so.
>
>  
>
It is commonly accepted those timouts are set to 0 in multipathed SAN.
Have you experienced real problems here, or is it just a theory ?

I particularly fear the "proactively fail all paths associated to a 
component", as this may lead to dramatic errors like : "I'm so smart, I 
failed all paths for this multipath, now the FS is remounted read-only 
but wait, in fact you can use this path as it is up after all"

Regards,
cvaroqui

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: path priority group and path state
@ 2005-02-15 17:59 Caushik, Ramesh
  2005-02-15 21:35 ` Christophe Varoqui
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Caushik, Ramesh @ 2005-02-15 17:59 UTC (permalink / raw)
  To: device-mapper development

Given that some of the problems I am noticing in my testing relates to
mismatch between the path state recorded by the driver and the daemon, I
thought I will chime in with my questions / observations.
   
My setup consists of a dual port qla2312 controller connected to a JBOD
through a FC switch thus creating 2 paths A & B to the drive. I have all
the paths in one PG using round-robin selector and "queue if no path"
set. I run a bonnie++ transfer to the mounted drive, and then pull out
the path A connection. When the transfer switches to path B I reinsert A
and then after a little while pull out B and repeat this a few times.
Sometimes the transfer just hangs and the log messages indicate the
driver is queueing the i/o (both paths are marked faulty). This is what
seems to happen. When the cable on path  A is pulled out the controller
receives a "LOOP DOWN" on that port and ALSO a "LIP RESET" on path B.
This causes i/o on both paths to return SCSI error and so both paths are
set faulty (some of the in-flight i/o on path B fails as a result of the
LIP RESET). However when the daemon checker loop wakes up and tests the
path (via checkfn) path B returns OK, and since the daemon will
reconfigure the paths only if newstate != oldstate it does not
reconfigure the path. As a result, we end up with a situation where the
driver marks path B as faulty due to i/o error in the path, and waits
for the daemon to reconfigure the path, while the daemon does not
reconfigure path B because the checkfn does not detect a state change.
First of all please tell me if this analyses is correct. If it is then
my suggestion is for the daemon checker loop to reinstate the path
anytime the there is a mismatch between the path state in the driver and
that returned by the checkfn, and not just based on the newstate !=
oldstate check. I am in the process of coding this up to see if it will
fix the problem. Meanwhile I would much appreciate any comments or
suggestions on this. Thanks,

Ramesh.

        
-----Original Message-----
From: dm-devel-bounces@redhat.com [mailto:dm-devel-bounces@redhat.com]
On Behalf Of Christophe Varoqui
Sent: Monday, February 14, 2005 2:29 PM
To: device-mapper development
Subject: Re: [dm-devel] path priority group and path state

goggin, edward wrote:

>On Sat, 12 Feb 2005 11:23:50 +0100 Christophe Varoqui wrote:
>  
>
>>>The multipath utility is relying on having at least one block
>>>read/write I/O be serviced through a multipath mapped
>>>device in order to show one of the path priority groups in
>>>an active state.  While I can see the semantic correctness
>>>in this claim since the priority group is not yet initialized,
>>>is this what is intended? 
>>>
>>>      
>>>
>>In fact, the multipath tool shares the same checker with the daemon.
>>
>>It is intended the tool doesn't rely on the path status the 
>>daemon could 
>>provide, because, the check interval being what it is, we 
>>can't assume 
>>the daemon path status are an accurate representation of the current 
>>reality. The tool being in charge to load new maps, I fear it 
>>could load 
>>erroneous ones if relying on outdated info.
>>
>>Maybe I'm paranoid, but I'm still convinced it's a safe bet to do so.
>>    
>>
>
>I see your approach -- wanting to avoid failing paths which previously
>failed a path test but are now in an active state.  Would the
inaccuracy
>be due to delays in the invocation of multipath from multipathd in the
>event of a failed path test?  
>
0) multipath can be triggered upon checker events, yes, but also
1) on map events (paths failed by DM upon IO)
2) by hotplug/udev, upon device add/remove
3) by administrator

In any of these scenarii the daemon can have invalid path state info :

0) Say paths A, B, C form the multipath M. The daemon checker loop kicks

in, A is checked and shows a transition. multipath is executed for M 
before B and C are re-checked. If their actual status have changed too, 
and multipath asks the daemon about paths states, the daemon will answer

with the previous/obsolete states so the tool will factor a wrong map.

1) A path failed by the DM will show up as such in the map status 
string, but it doesn't trigger an immediate checker loop run. So the 
tool kicks in while the daemon holds obsolete path states : the tools 
can't resonnably ask the daemon about path states there.

2) a device just added has no checker, no way the tool can ask for its 
state there. The tool execution finishes by a signal sent to the daemon,

which rediscover paths and instantiate a new checker for the new path.

3) last but not least, if the admin take the pain to run the tool 
himself, he's certainly out of trust for the daemon :)

There can be room for improvement in responsiveness anyway. Caching the 
uids for example, but there you loose on uid changes detection, lacking 
an event-driven detection for that.
If you have suggestion, please post them.

>Wouldn't multipath repeat the path test as
>part of discovering the SAN?  Wont there always be a non-zero time
delay
>between detecting a path failure (whether that be from a failed I/O in
>the kernel or a failed path test in user space) and actually updating
the
>multipath kernel state to reflect that failure where sometime during
that
>time period the path could actually be used again (it was physically
>restored) but it wont be after its path status is updated to failed?
>
>I see the real cost of not failing paths from path testing results but
>instead waiting for actual failed I/Os as a lack of responsiveness to
>path failures.
>
>  
>
We can't fail paths in the secondary path group of an 
assymetric-controler-driven multipathed LU, because I need those path 
ready to take IO in case of a PG switch. That is, if you don't want to 
impose the need of a hardware handler module for every assymetric 
controlers out there.

>>>Why show both the single priority
>>>group of an active-active storage system using a multibus
>>>path grouping policy and the non-active priority group of an
>>>active-passive storage system using a priority path grouping
>>>policy both as "enabled" when the actual readiness of each
>>>differs quite significantly?
>>> 
>>>
>>>      
>>>
>>We don't have so many choices there. The device mapper declares 3 PG 
>>states : active, enabled, disabled.
>>How would you map these states upon the 2 scenarii you mention ?
>>
>>    
>>
>
>As much as is reasonably possible, I would like to always know which
>path priority group will be used by the next I/O -- even when none of
>the priority groups have been initialized and therefore all of them
>have an "enabled" path priority group state.  Looks like "first" will
>tell me that, but it is not updated on "multipath -l".
>
>  
>
Not updated ? Can you elaborate ?
To me, this info is fetched in the map table string upon each exec ...

>>>Also, multipath will not set a path to a failed state until the
>>>first block read/write I/O to that path fails.  This approach
>>>can be misleading while monitoring path health via
>>>"multipath -l".  Why not have multipath(8) fail paths known to
>>>fail path testing?  Waiting instead for block I/O requests to
>>>fail lessens the responsiveness of the product to path failures.
>>>Also, the failed paths of enabled, but non-active path priority
>>>groups will not have their path state updated for possibly a
>>>very long time -- and this seems very misleading.
>>>
>>> 
>>>
>>>      
>>>
>>Maybe I'm overseeing something, but to my knowledge 
>>"multipath -l" gets 
>>the paths status from devinfo.c, which in turn switches to 
>>pp->checkfn() 
>>... ie the same checker the daemon uses.
>>    
>>
>
>I'm just wondering if multipathd could invoke multipath to fail paths
>from user space in addition to reinstating them?  Seems like both
>multipathd/main.c:checkerloop() and multipath/main.c:/reinstate_paths()
>will only initiate a kernel path state transition from PSTATE_FAILED to
>PSTATE_ACTIVE but not the other way around.  The state transition from
>PSTATE_ACTIVE to PSTATE_FAILED requires a failed I/O since this state
>is initiated only from the kernel code itself in the event of an I/O
>failure on a multipath target device.
>
>One could expand this approach to proactively fail (and immediately
>schedule for testing) all paths associated with common bus components
>(for SCSI, initiator and/or target).  The goal being not only to avoid
>failing I/O for all but all-paths-down use cases, but to also avoid
>long time-out driven delays and high path testing overhead for large
>SANs in the process of doing so.
>
>  
>
It is commonly accepted those timouts are set to 0 in multipathed SAN.
Have you experienced real problems here, or is it just a theory ?

I particularly fear the "proactively fail all paths associated to a 
component", as this may lead to dramatic errors like : "I'm so smart, I 
failed all paths for this multipath, now the FS is remounted read-only 
but wait, in fact you can use this path as it is up after all"

Regards,
cvaroqui

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
  2005-02-15 17:59 path priority group and path state Caushik, Ramesh
@ 2005-02-15 21:35 ` Christophe Varoqui
  2005-02-17 20:25 ` Christophe Varoqui
  2005-02-20 22:45 ` Christophe Varoqui
  2 siblings, 0 replies; 10+ messages in thread
From: Christophe Varoqui @ 2005-02-15 21:35 UTC (permalink / raw)
  To: device-mapper development

Caushik, Ramesh wrote:

>Given that some of the problems I am noticing in my testing relates to
>mismatch between the path state recorded by the driver and the daemon, I
>thought I will chime in with my questions / observations.
>   
>My setup consists of a dual port qla2312 controller connected to a JBOD
>through a FC switch thus creating 2 paths A & B to the drive. I have all
>the paths in one PG using round-robin selector and "queue if no path"
>set. I run a bonnie++ transfer to the mounted drive, and then pull out
>the path A connection. When the transfer switches to path B I reinsert A
>and then after a little while pull out B and repeat this a few times.
>Sometimes the transfer just hangs and the log messages indicate the
>driver is queueing the i/o (both paths are marked faulty). This is what
>seems to happen. When the cable on path  A is pulled out the controller
>receives a "LOOP DOWN" on that port and ALSO a "LIP RESET" on path B.
>This causes i/o on both paths to return SCSI error and so both paths are
>set faulty (some of the in-flight i/o on path B fails as a result of the
>LIP RESET). However when the daemon checker loop wakes up and tests the
>path (via checkfn) path B returns OK, and since the daemon will
>reconfigure the paths only if newstate != oldstate it does not
>reconfigure the path. As a result, we end up with a situation where the
>driver marks path B as faulty due to i/o error in the path, and waits
>for the daemon to reconfigure the path, while the daemon does not
>reconfigure path B because the checkfn does not detect a state change.
>First of all please tell me if this analyses is correct. If it is then
>my suggestion is for the daemon checker loop to reinstate the path
>anytime the there is a mismatch between the path state in the driver and
>that returned by the checkfn, and not just based on the newstate !=
>oldstate check. I am in the process of coding this up to see if it will
>fix the problem. Meanwhile I would much appreciate any comments or
>suggestions on this. Thanks,
>
>Ramesh.
>
Agreed : this is a real hole in the design.
Suggested solution seems sane.

Thanks,
cvaroqui

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
  2005-02-15 17:59 path priority group and path state Caushik, Ramesh
  2005-02-15 21:35 ` Christophe Varoqui
@ 2005-02-17 20:25 ` Christophe Varoqui
  2005-02-20 22:45 ` Christophe Varoqui
  2 siblings, 0 replies; 10+ messages in thread
From: Christophe Varoqui @ 2005-02-17 20:25 UTC (permalink / raw)
  To: device-mapper development

Caushik, Ramesh wrote:

>Given that some of the problems I am noticing in my testing relates to
>mismatch between the path state recorded by the driver and the daemon, I
>thought I will chime in with my questions / observations.
>   
>My setup consists of a dual port qla2312 controller connected to a JBOD
>through a FC switch thus creating 2 paths A & B to the drive. I have all
>the paths in one PG using round-robin selector and "queue if no path"
>set. I run a bonnie++ transfer to the mounted drive, and then pull out
>the path A connection. When the transfer switches to path B I reinsert A
>and then after a little while pull out B and repeat this a few times.
>Sometimes the transfer just hangs and the log messages indicate the
>driver is queueing the i/o (both paths are marked faulty). This is what
>seems to happen. When the cable on path  A is pulled out the controller
>receives a "LOOP DOWN" on that port and ALSO a "LIP RESET" on path B.
>This causes i/o on both paths to return SCSI error and so both paths are
>set faulty (some of the in-flight i/o on path B fails as a result of the
>LIP RESET). However when the daemon checker loop wakes up and tests the
>path (via checkfn) path B returns OK, and since the daemon will
>reconfigure the paths only if newstate != oldstate it does not
>reconfigure the path. As a result, we end up with a situation where the
>driver marks path B as faulty due to i/o error in the path, and waits
>for the daemon to reconfigure the path, while the daemon does not
>reconfigure path B because the checkfn does not detect a state change.
>First of all please tell me if this analyses is correct. If it is then
>my suggestion is for the daemon checker loop to reinstate the path
>anytime the there is a mismatch between the path state in the driver and
>that returned by the checkfn, and not just based on the newstate !=
>oldstate check. I am in the process of coding this up to see if it will
>fix the problem. Meanwhile I would much appreciate any comments or
>suggestions on this. Thanks,
>
>Ramesh.
>
> 
>
Actualy, I'd rather see the DM move to the netlink generic event model 
anf catch fail path events from the daemon.
Those messages would be more explicit and contain the path mjor/minor 
info, which would enable the daemon to set the path' state to fail upon 
arrival. That should close the design hole more elegantly.

Alasdair, al, care to comment about this DM evolution ?

Regards,
cvaroqui

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
  2005-02-15 17:59 path priority group and path state Caushik, Ramesh
  2005-02-15 21:35 ` Christophe Varoqui
  2005-02-17 20:25 ` Christophe Varoqui
@ 2005-02-20 22:45 ` Christophe Varoqui
  2 siblings, 0 replies; 10+ messages in thread
From: Christophe Varoqui @ 2005-02-20 22:45 UTC (permalink / raw)
  To: ramesh.caushik; +Cc: device-mapper development

Please test 
http://christophe.varoqui.free.fr/multipath-tools/multipath-tools-0.4.3-pre3.tar.bz2
It should close the design hole you noted here.

regards,
cvaroqui

Caushik, Ramesh wrote:

>Given that some of the problems I am noticing in my testing relates to
>mismatch between the path state recorded by the driver and the daemon, I
>thought I will chime in with my questions / observations.
>   
>My setup consists of a dual port qla2312 controller connected to a JBOD
>through a FC switch thus creating 2 paths A & B to the drive. I have all
>the paths in one PG using round-robin selector and "queue if no path"
>set. I run a bonnie++ transfer to the mounted drive, and then pull out
>the path A connection. When the transfer switches to path B I reinsert A
>and then after a little while pull out B and repeat this a few times.
>Sometimes the transfer just hangs and the log messages indicate the
>driver is queueing the i/o (both paths are marked faulty). This is what
>seems to happen. When the cable on path  A is pulled out the controller
>receives a "LOOP DOWN" on that port and ALSO a "LIP RESET" on path B.
>This causes i/o on both paths to return SCSI error and so both paths are
>set faulty (some of the in-flight i/o on path B fails as a result of the
>LIP RESET). However when the daemon checker loop wakes up and tests the
>path (via checkfn) path B returns OK, and since the daemon will
>reconfigure the paths only if newstate != oldstate it does not
>reconfigure the path. As a result, we end up with a situation where the
>driver marks path B as faulty due to i/o error in the path, and waits
>for the daemon to reconfigure the path, while the daemon does not
>reconfigure path B because the checkfn does not detect a state change.
>First of all please tell me if this analyses is correct. If it is then
>my suggestion is for the daemon checker loop to reinstate the path
>anytime the there is a mismatch between the path state in the driver and
>that returned by the checkfn, and not just based on the newstate !=
>oldstate check. I am in the process of coding this up to see if it will
>fix the problem. Meanwhile I would much appreciate any comments or
>suggestions on this. Thanks,
>  
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
@ 2005-02-22 22:59 goggin, edward
  0 siblings, 0 replies; 10+ messages in thread
From: goggin, edward @ 2005-02-22 22:59 UTC (permalink / raw)
  To: 'dm-devel@redhat.com'

On Mon, 14 Feb 2005 23:28:55 +0100, Christophe Varoqui wrote

...

> >As much as is reasonably possible, I would like to always know which
> >path priority group will be used by the next I/O -- even when none of
> >the priority groups have been initialized and therefore all of them
> >have an "enabled" path priority group state.  Looks like "first" will
> >tell me that, but it is not updated on "multipath -l".
>
>  
>
>Not updated ? Can you elaborate ?
>To me, this info is fetched in the map table string upon each exec ...

I'm sorry for the confusion.  I wasn't very clear at all.

When the active path group of an active/passive array is changed
external to multipath (either a SAN utility or multipath running
on a different node in a multi-node cluster), sometimes the wrong
path group is shown to be "active" and sometimes both multipath
priority groups are shown as "enabled".  I think the former
condition occurs between the time the active path group is changed
externally and the time of the first block I/O to the multipath
mapped device.  I think the latter condition occurs between the
time multipath changes the active group back to the highest priority
path group and the first block I/O to the multipath mapped device.

I think that the former condition could be addressed by validating
with the storage array that the highest priority path group is
actually the current active path group whenever path health is
checked.  Yet, if they are different, it is not always clear that
the right thing to do is to trespass back to the highest priority
group. 

I think the latter condition could be addressed by always initializing
a multipath path group immediately whenever the path group is either
initially setup or changed instead of waiting for the first block i/o. 


> There can be room for improvement in responsiveness anyway. 
> Caching the 
> uids for example, but there you loose on uid changes 
> detection, lacking 
> an event-driven detection for that.
> If you have suggestion, please post them.
> 

My suggestion is based on (1) periodic testing of physical paths not
logical ones, (2) immediately placing all logical paths associated
with failed physical components into a bypassed state whether the
failure was detected by i/o failure or path test, (3) prioritizing
the testing of bypassed paths, and (4) failing logical paths which
fail path tests.  For SCSI, a physical path would be defined as a
unique combination of initiator and target.  These components would
need to be identified and associated with all of the logical paths
(LU specific) which utilize the component.

A similar approach can also be taken to improve the responsiveness to
the restoration of physical path components.

> >I'm just wondering if multipathd could invoke multipath to fail paths
> >from user space in addition to reinstating them?  Seems like both
> >multipathd/main.c:checkerloop() and 
> multipath/main.c:/reinstate_paths()
> >will only initiate a kernel path state transition from 
> PSTATE_FAILED to
> >PSTATE_ACTIVE but not the other way around.  The state 
> transition from
> >PSTATE_ACTIVE to PSTATE_FAILED requires a failed I/O since this state
> >is initiated only from the kernel code itself in the event of an I/O
> >failure on a multipath target device.
> >
> >One could expand this approach to proactively fail (and immediately
> >schedule for testing) all paths associated with common bus components
> >(for SCSI, initiator and/or target).  The goal being not 
> only to avoid
> >failing I/O for all but all-paths-down use cases, but to also avoid
> >long time-out driven delays and high path testing overhead for large
> >SANs in the process of doing so.
> >
> >  
> >
> It is commonly accepted those timouts are set to 0 in multipathed SAN.
> Have you experienced real problems here, or is it just a theory ?
>
 
Qlogic FC HBA people have long warned EMC PowerPath multipath developers
NOT to mess around with the timeout values of their SCSI commands.  As
a result we have been burdoned with 30-60 second SCSI command timeout
values for SCSI commands sent to multipathed devices in a SAN.
My understanding was that they needed this time in order to deal
convincingly with target-side FC disconnect failures.  Things may
certainly have changed to allow getting rid of this huge timeout,
though this is far from certain to me.

> I particularly fear the "proactively fail all paths associated to a 
> component", as this may lead to dramatic errors like : "I'm 
> so smart, I 
> failed all paths for this multipath, now the FS is remounted 
> read-only 
> but wait, in fact you can use this path as it is up after all"
>

One can deal with that concern by not actually failing these paths, but
putting them into a "bypassed" state whereby (1) they are skipped (similar
to how bypassed path groups are skipped) in all cases unless there are no
other path choices (instead of failing the i/o) and (2) each of these paths
are immediately scheduled for testing ahead of all others.  But since this
testing is done in user space, we would still need the capability to fail
these paths from user space once their failed state is verified through
path testing.  I am still not understanding your logic behind not wanting
to do that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: path priority group and path state
       [not found] <20050221170006.F1A087364A@hormel.redhat.com>
@ 2005-03-03 18:16 ` Lan
  0 siblings, 0 replies; 10+ messages in thread
From: Lan @ 2005-03-03 18:16 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: dm-devel

While testing failover/faiback with multipath-tools-0.4.2 and IBM ESS
2105800 and 2105E20 storage, I was also seeing problems with failback
not working because recovered paths were not getting reclaimed
correctly. I have tried using multipath-tools-0.4.3-pre3.tar.bz2, and
now failback is working!   :)

I was using a dual-port QLA2342 HBA connected to ESS 2105800 and ESS
2105E20 storage through a FC switch, so 4 paths per LUN. Use dd to run
I/O on all 4 paths to a LUN. Disable a port on the switch, wait for
I/O to failover to remaining 2 paths (which works fine!), reenable the
port, and  immediately paths are reclaimed and I/O resumes on all 4
paths. it's great! thanks!

Configlet used: (similar for the 2105800)
devices {
        device {
                vendor "IBM "
                product "2105E20 "
                path_grouping_policy group_by_serial
                features        "1 queue_if_no_path"
                getuid_callout "/sbin/scsi_id -g -s /block/%n"
                path_checker tur
        }
}


lan

> Date: Sun, 20 Feb 2005 23:45:11 +0100
> From: Christophe Varoqui <christophe.varoqui@free.fr>
> Subject: Re: [dm-devel] path priority group and path state
> To: ramesh.caushik@intel.com
> Cc: device-mapper development <dm-devel@redhat.com>
> Message-ID: <421912F7.5000305@free.fr>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Please test
> http://christophe.varoqui.free.fr/multipath-tools/multipath-tools-0.4.3-pre3.tar.bz2
> It should close the design hole you noted here.
> 
> regards,
> cvaroqui
> 
> Caushik, Ramesh wrote:
> 
> >Given that some of the problems I am noticing in my testing relates to
> >mismatch between the path state recorded by the driver and the daemon, I
> >thought I will chime in with my questions / observations.
> >
> >My setup consists of a dual port qla2312 controller connected to a JBOD
> >through a FC switch thus creating 2 paths A & B to the drive. I have all
> >the paths in one PG using round-robin selector and "queue if no path"
> >set. I run a bonnie++ transfer to the mounted drive, and then pull out
> >the path A connection. When the transfer switches to path B I reinsert A
> >and then after a little while pull out B and repeat this a few times.
> >Sometimes the transfer just hangs and the log messages indicate the
> >driver is queueing the i/o (both paths are marked faulty). This is what
> >seems to happen. When the cable on path  A is pulled out the controller
> >receives a "LOOP DOWN" on that port and ALSO a "LIP RESET" on path B.
> >This causes i/o on both paths to return SCSI error and so both paths are
> >set faulty (some of the in-flight i/o on path B fails as a result of the
> >LIP RESET). However when the daemon checker loop wakes up and tests the
> >path (via checkfn) path B returns OK, and since the daemon will
> >reconfigure the paths only if newstate != oldstate it does not
> >reconfigure the path. As a result, we end up with a situation where the
> >driver marks path B as faulty due to i/o error in the path, and waits
> >for the daemon to reconfigure the path, while the daemon does not
> >reconfigure path B because the checkfn does not detect a state change.
> >First of all please tell me if this analyses is correct. If it is then
> >my suggestion is for the daemon checker loop to reinstate the path
> >anytime the there is a mismatch between the path state in the driver and
> >that returned by the checkfn, and not just based on the newstate !=
> >oldstate check. I am in the process of coding this up to see if it will
> >fix the problem. Meanwhile I would much appreciate any comments or
> >suggestions on this. Thanks,
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-03-03 18:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-15 17:59 path priority group and path state Caushik, Ramesh
2005-02-15 21:35 ` Christophe Varoqui
2005-02-17 20:25 ` Christophe Varoqui
2005-02-20 22:45 ` Christophe Varoqui
     [not found] <20050221170006.F1A087364A@hormel.redhat.com>
2005-03-03 18:16 ` Lan
  -- strict thread matches above, loose matches on Subject: below --
2005-02-22 22:59 goggin, edward
2005-02-14 15:58 goggin, edward
2005-02-14 22:28 ` Christophe Varoqui
2005-02-10 16:48 goggin, edward
2005-02-12 10:23 ` Christophe Varoqui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.