LSF: Multipathing and path checking question

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* LSF: Multipathing and path checking question
@ 2009-04-16 22:59 Mike Christie
  2009-04-17  7:50 ` [dm-devel] " Hannes Reinecke
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-16 22:59 UTC (permalink / raw)
  To: device-mapper development, SCSI Mailing List

Hey,

For this topic:

-----------------------
Next-Gen Multipathing
---------------------
Dr. Hannes Reinecke

......

Should path checkers use sd->state to check for errors or availability?
----------------------

What was decided?

Could this problem be fixed or helped if multipath tools always sets the 
fast io fail tmo for FC or the replacement_timeout for iscsi?

If those are set then IO in the blocked queue and in the driver will get 
failed after fast io fail tmo/replacement_timeout seconds (driver has to 
implement a terminate rport IO callback and only mptfc does not now). So 
at this time, do we want to fail the path?

Or are people thinking that we want to fail the path when the problem is 
initially detected like when the LLD deletes the rport for fc for example?

Also for this one:
-----------------------
How to communication device went away:
1) send event to udev (uses netlink)
-----------------------

Is this an event when dev_loss_tmo fires or when the LLD first detects 
something like a link down (or any event it might block the rport for), 
or would it be for when the fast fail io tmo fires (when the fc class is 
going to fail running IO and incoming IO), or would we have events for 
all of them?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-16 22:59 LSF: Multipathing and path checking question Mike Christie
@ 2009-04-17  7:50 ` Hannes Reinecke
  2009-04-17 14:55   ` Mike Christie
  0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2009-04-17  7:50 UTC (permalink / raw)
  To: device-mapper development; +Cc: SCSI Mailing List

Hi Mike,

Mike Christie wrote:
> Hey,
> 
> For this topic:
> 
> -----------------------
> Next-Gen Multipathing
> ---------------------
> Dr. Hannes Reinecke
> 
> ......
> 
> Should path checkers use sd->state to check for errors or availability?
> ----------------------
> 
> What was decided?
> 
> Could this problem be fixed or helped if multipath tools always sets the
> fast io fail tmo for FC or the replacement_timeout for iscsi?
> 
No, I already do this for FC (should be checking the replacement_timeout, too ...)

> If those are set then IO in the blocked queue and in the driver will get
> failed after fast io fail tmo/replacement_timeout seconds (driver has to
> implement a terminate rport IO callback and only mptfc does not now). So
> at this time, do we want to fail the path?
> 
> Or are people thinking that we want to fail the path when the problem is
> initially detected like when the LLD deletes the rport for fc for example?
> 
Well, the idea is the following:

The primary purpose of the path checkers is to check the availability of
the paths (my, that was easy :-).

And the main problem we have with the path checkers is that they are using
actual SCSI commands to determine this, thereby incurring unrelated errors
(Disk errors, delaying response due to blocked path behaviour or error handling
etc). So we have to invest quite a bit of logic to separate the 'true' path
condition from unrelated errors, simply because we're checking at the wrong
level; the path state is maintained by the transport layer, not by the
SCSI layer.

So the suggestion here is to check the transport layer for the path states
and do away with the existing path_checker SG_IO mechanism.

The secondary use of the path checkers (determine inactive paths) will have
to be delegated to the priority callouts, which then have to arrange the
paths correctly.

FC Transport already maintains an attribute for the path state, and even
sends netlink events if and when this attribute changes. For iSCSI I have
to defer to your superior knowledge; of course it would be easiest if
iSCSI could send out the very same message FC does.

> 
> 
> Also for this one:
> -----------------------
> How to communication device went away:
> 1) send event to udev (uses netlink)
> -----------------------
> 
> Is this an event when dev_loss_tmo fires or when the LLD first detects
> something like a link down (or any event it might block the rport for),
> or would it be for when the fast fail io tmo fires (when the fc class is
> going to fail running IO and incoming IO), or would we have events for
> all of them?
> 
Currently the event is sent when the device itself is removed from sysfs.
And only then can we actually update the path maps and (possibly) change
to another part. We cannot do anything when the path is blocked (ie when
dev_loss_tmo is active) as we require this interval to capture jitter on
the line.

So we have this state diagram:

sdev state:   RUNNING  <-> BLOCKED -> CANCEL
mpath state:  path up  <-> <stall> -> path down / remove from map

Notice the '<stall>' here; we cannot check the path state when the
sdev is blocked as all I/O will be queued. And also note that we
now lump two different multipath path states together; a path down
is basically always followed immediately by a path remove event.

However, when all paths are down (and queue_if_no_path is active) we might
run into a deadlock when a path comes back, as we might not have enough
memory to actually create the required structures.

Idea was to modify the state machine so that fast_fail_io_tmo is
being made mandatory, which transitions the sdev into an intermediate
state 'DISABLED' and sends out a netlink message.

sdev state:   RUNNING <-> BLOCKED <-> DISABLED -> CANCEL
mpath state:  path up <-> <stall> <-> path down -> remove from map

This will allow us to switch paths early, ie when it moves into
'DISABLED' state. But the path structure themselves are still alive,
so when a path comes back between 'DISABLED' and 'CANCEL' we won't
have an issue reconnecting it. And we could even allow to set a
dev_loss_tmo to infinity thereby simulating the 'old' behaviour.

However, this proposal didn't go through.

Instead it was proposed to do away with the unlimited queue_if_no_path
setting and _always_ have a timeout there, so that the machine is able
to recover after a certain period of time.

I still like my original proposal, though.

Maybe we can do the EU referendum thing and just ask again and again
until everyone becomes tired of it and just says 'yes' to get rid
of this issue ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-17  7:50 ` [dm-devel] " Hannes Reinecke
@ 2009-04-17 14:55   ` Mike Christie
  2009-04-17 15:21     ` Mike Christie
  2009-04-20  7:59     ` Hannes Reinecke
  0 siblings, 2 replies; 12+ messages in thread
From: Mike Christie @ 2009-04-17 14:55 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development, SCSI Mailing List

Hannes Reinecke wrote:
> 
> FC Transport already maintains an attribute for the path state, and even
> sends netlink events if and when this attribute changes. For iSCSI I have

Are you referring to fc_host_post_event? Is the same thing we talked 
about last year, where you wanted events? Is this in multipath tools now 
or just in the SLES ones?

For something like FCH_EVT_LINKDOWN, are you going to fail the path at 
that time or when would the multipath path be marked failed?

> to defer to your superior knowledge; of course it would be easiest if
> iSCSI could send out the very same message FC does.

We can do something like fc_host_event_code for iscsi.

Question on what you are needing:

Do you mean you want to make fc_host_event_code more generic (there are 
some FC specific ones like lip_reset)? Put them in scsi-ml and send from 
a new netlink group that just sends these events?

Or do you just want something similar from iscsi? iscsi will hook into 
the iscsi netlink code using the scsi_netlink.c and then send a 
ISCSIH_EVT_LINKUP, ISCSIH_EVT, LINKDOWN, etc.

What do the FCH_EVT_PORT_* ones means?

> 
> Idea was to modify the state machine so that fast_fail_io_tmo is
> being made mandatory, which transitions the sdev into an intermediate
> state 'DISABLED' and sends out a netlink message.

Above when you said, "No, I already do this for FC (should be checking 
the replacement_timeout, too ...)", did you mean that you have mulitpath 
tools always setting fast io fail now?

For iscsi the replacement_timeout is always set already. If from 
multipath tools you are going to add some code so multipth sets this I 
can make iscsi allow the replacement_timeout to be set from sysfs like 
is done for FC's fast io fail.

> 
> sdev state:   RUNNING <-> BLOCKED <-> DISABLED -> CANCEL
> mpath state:  path up <-> <stall> <-> path down -> remove from map
> 
> This will allow us to switch paths early, ie when it moves into
> 'DISABLED' state. But the path structure themselves are still alive,
> so when a path comes back between 'DISABLED' and 'CANCEL' we won't
> have an issue reconnecting it. And we could even allow to set a
> dev_loss_tmo to infinity thereby simulating the 'old' behaviour.
> 
> However, this proposal didn't go through.

You got my hopes up for a solution in the the long explanation, then you 
destroyed them :)

Was the reason people did not like this because of the scsi device 
lifetime issue?

I think we still want someone to set the fast io fail tmo for users when 
multipath is being used, because we want IO out of the queues and 
drivers and sent to the multipath layer before dev_loss_tmo if 
dev_loss_tmo is still going to be a lot longer. fast io fail tmo is 
usually less than 10 or 5 and for dev_loss_tmo seems like we still have 
user setting that to minutes.

Can't the transport layers just send two events?
1. On the initial link down when the port/session is blocked.
2. When there fast io fail tmos fire.

Today, instead of #2, the Red Hat multipath tools guy and I were talking 
about doing a probe with SG_IO. For example we would send down a path 
tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.

Or for #2 if we cannot have a new event, can we send a transport level 
bsg request? For iscsi this would be a nop. For FC, I am not sure what 
it would be?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: LSF: Multipathing and path checking question
  2009-04-17 14:55   ` Mike Christie
@ 2009-04-17 15:21     ` Mike Christie
  2009-04-20  8:19       ` [dm-devel] " Hannes Reinecke
  2009-04-20  7:59     ` Hannes Reinecke
  1 sibling, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-17 15:21 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development, SCSI Mailing List

Oops, I mashed two topics together. See below.

Mike Christie wrote:
> Hannes Reinecke wrote:
>>
>> FC Transport already maintains an attribute for the path state, and even
>> sends netlink events if and when this attribute changes. For iSCSI I have
> 
> Are you referring to fc_host_post_event? Is the same thing we talked 
> about last year, where you wanted events? Is this in multipath tools now 
> or just in the SLES ones?
> 
> For something like FCH_EVT_LINKDOWN, are you going to fail the path at 
> that time or when would the multipath path be marked failed?
> 

I was asking this because it seems we have people always making 
bugzillas saying they did not want the path to be marked failed for 
short problems.

There was the problem where we might get DID_ERROR for temporary dropped 
frame. That would be fixed by just listening to transport events like 
you explained.

But then I thought there was the case where if we get a linkdown then 
linkup within a couple seconds, we would not want to transition the 
multipath path state.

So below while you were talking about when to remove the device, I was 
talking about when to mark the path failed.



> 
> You got my hopes up for a solution in the the long explanation, then you 
> destroyed them :)
> 
> 
> Was the reason people did not like this because of the scsi device 
> lifetime issue?
> 
> 
> I think we still want someone to set the fast io fail tmo for users when 
> multipath is being used, because we want IO out of the queues and 
> drivers and sent to the multipath layer before dev_loss_tmo if 
> dev_loss_tmo is still going to be a lot longer. fast io fail tmo is 
> usually less than 10 or 5 and for dev_loss_tmo seems like we still have 
> user setting that to minutes.
> 
> 
> Can't the transport layers just send two events?
> 1. On the initial link down when the port/session is blocked.
> 2. When there fast io fail tmos fire.


So for #2, I just want a way to figure out when the transport is giving 
up on executing IO and is going to fail everything. At that time, I was 
thinking we want to mark the path failed.

I guess if multipiath tools is going to set fast io fail, it could also 
use that as its down timer to decide when to fail the path and not have 
to send SG IO or a bsg transport command.


> 
> Today, instead of #2, the Red Hat multipath tools guy and I were talking 
> about doing a probe with SG_IO. For example we would send down a path 
> tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.
> 
> Or for #2 if we cannot have a new event, can we send a transport level 
> bsg request? For iscsi this would be a nop. For FC, I am not sure what 
> it would be?
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-17 14:55   ` Mike Christie
  2009-04-17 15:21     ` Mike Christie
@ 2009-04-20  7:59     ` Hannes Reinecke
  2009-04-20 19:10       ` Mike Christie
  1 sibling, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2009-04-20  7:59 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Hi Mike,

Mike Christie wrote:
> Hannes Reinecke wrote:
>>
>> FC Transport already maintains an attribute for the path state, and even
>> sends netlink events if and when this attribute changes. For iSCSI I have
> 
> Are you referring to fc_host_post_event? Is the same thing we talked
> about last year, where you wanted events? Is this in multipath tools now
> or just in the SLES ones?
> 
Yep, that's the thing.

> For something like FCH_EVT_LINKDOWN, are you going to fail the path at
> that time or when would the multipath path be marked failed?
> 
This is just a notification that the path has gone down. Fast fail / dev_loss_tmo
still applies, ie that path won't get switched then.

> 
> 
>> to defer to your superior knowledge; of course it would be easiest if
>> iSCSI could send out the very same message FC does.
> 
> We can do something like fc_host_event_code for iscsi.
> 
Oh, that'll be grand.

> Question on what you are needing:
> 
> Do you mean you want to make fc_host_event_code more generic (there are
> some FC specific ones like lip_reset)? Put them in scsi-ml and send from
> a new netlink group that just sends these events?
> 
> Or do you just want something similar from iscsi? iscsi will hook into
> the iscsi netlink code using the scsi_netlink.c and then send a
> ISCSIH_EVT_LINKUP, ISCSIH_EVT, LINKDOWN, etc.
> 
Well, actually, I don't care. It's just if we were to go with the
proposal we'll have to fix up all transports to present the path state
to userspace; preferably with both, netlink events and sysfs attributes.

The actual implementation might well be transport-specific.

> What do the FCH_EVT_PORT_* ones means?
> 
FC stuff methinks. James S. should know better.

> 
> 
>>
>> Idea was to modify the state machine so that fast_fail_io_tmo is
>> being made mandatory, which transitions the sdev into an intermediate
>> state 'DISABLED' and sends out a netlink message.
> 
> 
> Above when you said, "No, I already do this for FC (should be checking
> the replacement_timeout, too ...)", did you mean that you have mulitpath
> tools always setting fast io fail now?
> 
Yes, quite so. Look at
git://git.kernel.org/pub/scm/linux/kernel/git/hare/multipath-tools
branch sles11
for details.

> For iscsi the replacement_timeout is always set already. If from
> multipath tools you are going to add some code so multipth sets this I
> can make iscsi allow the replacement_timeout to be set from sysfs like
> is done for FC's fast io fail.
> 
Oh, that would be awesome. Currently I think we have a mismatch / race
condition between iSCSI and multipathing, where ERL in iSCSI actually
counteracts multipathing. But I'll be investigating that one shortly.

> 
> 
>>
>> sdev state:   RUNNING <-> BLOCKED <-> DISABLED -> CANCEL
>> mpath state:  path up <-> <stall> <-> path down -> remove from map
>>
>> This will allow us to switch paths early, ie when it moves into
>> 'DISABLED' state. But the path structure themselves are still alive,
>> so when a path comes back between 'DISABLED' and 'CANCEL' we won't
>> have an issue reconnecting it. And we could even allow to set a
>> dev_loss_tmo to infinity thereby simulating the 'old' behaviour.
>>
>> However, this proposal didn't go through.
> 
> You got my hopes up for a solution in the the long explanation, then you
> destroyed them :)
> 
Yes, same here. I really thought this to be a sensible proposal, but
then the discussion veered off into queue_if_no_path handling.

> 
> Was the reason people did not like this because of the scsi device
> lifetime issue?
> 
> 
> I think we still want someone to set the fast io fail tmo for users when
> multipath is being used, because we want IO out of the queues and
> drivers and sent to the multipath layer before dev_loss_tmo if
> dev_loss_tmo is still going to be a lot longer. fast io fail tmo is
> usually less than 10 or 5 and for dev_loss_tmo seems like we still have
> user setting that to minutes.
> 
Exactly. Point here is that with the current implementation we basically
_cannot_ return 'path down' anymore, as the path is either blocked (during
which time all I/O got stalled) or failed completely (ie in state 'CANCEL').
Which is a bit of a detriment and we actually run into quite some contention
when the path is removed, as we have to kill all I/O, fail over paths, remove
stale paths, update device-mapper tables etc.

When decoupling this by having the midlayer always return 'DID_TRANSPORT_DISRUPTED'
after fast_fail_io we would be able to kill all I/O and switch paths gracefully.
Path removal and device-mapper table update would then be done later one when
dev_loss_tmo triggers.


> 
> Can't the transport layers just send two events?
> 1. On the initial link down when the port/session is blocked.
> 2. When there fast io fail tmos fire.
> 
Yes, that would be a good start.

> Today, instead of #2, the Red Hat multipath tools guy and I were talking
> about doing a probe with SG_IO. For example we would send down a path
> tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.
> 
No. this is exactly what you cannot do. SG_IO will be stalled when the
sdev is BLOCKED and will only return a result _after_ the sdev transitions
_out_ of the BLOCKED state.
Translated to FC this means that whenever dev_loss_tmo is _active_ (!)
no I/O will be send out neither any I/O result will be returned to userland.

Hence using SG_IO for path checker is a bad idea here.
Hence my proposal.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-17 15:21     ` Mike Christie
@ 2009-04-20  8:19       ` Hannes Reinecke
  2009-04-20 19:23         ` Mike Christie
  0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2009-04-20  8:19 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Hi Mike,

Mike Christie wrote:
> Oops, I mashed two topics together. See below.
> 
> Mike Christie wrote:
>> Hannes Reinecke wrote:
>>>
>>> FC Transport already maintains an attribute for the path state, and even
>>> sends netlink events if and when this attribute changes. For iSCSI I
>>> have
>>
>> Are you referring to fc_host_post_event? Is the same thing we talked
>> about last year, where you wanted events? Is this in multipath tools
>> now or just in the SLES ones?
>>
>> For something like FCH_EVT_LINKDOWN, are you going to fail the path at
>> that time or when would the multipath path be marked failed?
>>
> 
> I was asking this because it seems we have people always making
> bugzillas saying they did not want the path to be marked failed for
> short problems.
> 
> There was the problem where we might get DID_ERROR for temporary dropped
> frame. That would be fixed by just listening to transport events like
> you explained.
> 
> But then I thought there was the case where if we get a linkdown then
> linkup within a couple seconds, we would not want to transition the
> multipath path state.
> 
> So below while you were talking about when to remove the device, I was
> talking about when to mark the path failed.
> 
> 
I have the same bugzillas, too :-)

My proposal is to handle this in several stages:

- path fails
-> Send out netlink event
-> start dev_loss_tmo and fast_fail_io timer
-> fast_fail_io timer triggers: Abort all oustanding I/O with
   DID_TRANSPORT_DISRUPTED, return DID_TRANSPORT_FAILFAST for
   any future I/O, and send out netlink event.
-> dev_loss_tmo timer triggers: Remove sdev and cleanup rport.
   netlink event is sent implicitely by removing the sdev.

Multipath would then interact with this sequence by:

- Upon receiving 'path failed' event: mark path as 'ghost' or 'blocked',
  ie no I/O is currently possible and will be queued (no path switch yet).
- Upon receiving 'fast_fail_io' event: switch paths and resubmit queued I/Os
- Upon receiving 'path removed' event: remove path from internal structures,
  update multipath maps etc.

The time between 'path failed' and 'fast_fail_io triggers' would then be
able to capture any jitter / intermittent failures. Between 
'fast_fail_io triggers' and 'path removed' the path would be held in some
sort of 'limbo' in case it comes back again, eg for maintenance/SP update
etc. And we can even increase this one to rather long timespans (eg hours)
to give the admin enough time for a manual intervention.

I still like this proposal as it makes multipath interaction far cleaner.
And we can do away with path checkers completely here.

> 
>>
>> You got my hopes up for a solution in the the long explanation, then
>> you destroyed them :)
>>
>>
>> Was the reason people did not like this because of the scsi device
>> lifetime issue?
>>
>>
>> I think we still want someone to set the fast io fail tmo for users
>> when multipath is being used, because we want IO out of the queues and
>> drivers and sent to the multipath layer before dev_loss_tmo if
>> dev_loss_tmo is still going to be a lot longer. fast io fail tmo is
>> usually less than 10 or 5 and for dev_loss_tmo seems like we still
>> have user setting that to minutes.
>>
>>
>> Can't the transport layers just send two events?
>> 1. On the initial link down when the port/session is blocked.
>> 2. When there fast io fail tmos fire.
> 
> 
> So for #2, I just want a way to figure out when the transport is giving
> up on executing IO and is going to fail everything. At that time, I was
> thinking we want to mark the path failed.
> 
See above. Exactly my proposal.

> I guess if multipiath tools is going to set fast io fail, it could also
> use that as its down timer to decide when to fail the path and not have
> to send SG IO or a bsg transport command.
> 
But that's a bit of out-guessing the midlayer, no?
We're instructing the midlayer to fail all I/O at one point; so it makes
far more sense to me to have the midlayer telling us when this is going
to happen instead of trying to figure this one out ourselves.

For starters we just should send a netlink event when fast_fail_io has
fired. We could easily integrate that one in multipathd and would gain
an instant benefit from that as we can switch paths in advance.
Next step would be to implement an additional sdev state which would
return 'DID_TRANSPORT_FASTFAIL' for any 'normal' I/O; it would be
inserted between 'RUNNING' and 'CANCEL'.
Transition would be possible between 'RUNNING' and 'FASTFAIL', but
it would only be possible to transition into 'CANCEL' from 'FASTFAIL'.

Oh, and of course we have to persuade Eric Moore et al to implement
fast_fail_io into mptfc ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: LSF: Multipathing and path checking question
  2009-04-20  7:59     ` Hannes Reinecke
@ 2009-04-20 19:10       ` Mike Christie
  2009-04-20 19:28         ` [dm-devel] " Mike Christie
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-20 19:10 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development, SCSI Mailing List

Hannes Reinecke wrote:
>> Today, instead of #2, the Red Hat multipath tools guy and I were talking
>> about doing a probe with SG_IO. For example we would send down a path
>> tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.
>>
> No. this is exactly what you cannot do. SG_IO will be stalled when the
> sdev is BLOCKED and will only return a result _after_ the sdev transitions
> _out_ of the BLOCKED state.
> Translated to FC this means that whenever dev_loss_tmo is _active_ (!)
> no I/O will be send out neither any I/O result will be returned to userland.
> 

That is not true anymore. When fast io fail fires, the sdev and rport 
will be blocked, but the the fc class will call into the LLD to have it 
fail any IO still running in the driver. The FC class will then fail any 
IO in the block queues, and then it will also fail any new IO sent to it.

With your patch to have multipath-tools set fast io fail for multipath, 
then we should always get the IO failed before dev_loss_tmo fires.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: LSF: Multipathing and path checking question
  2009-04-20  8:19       ` [dm-devel] " Hannes Reinecke
@ 2009-04-20 19:23         ` Mike Christie
  2009-04-20 23:02           ` Mike Christie
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-20 19:23 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development, SCSI Mailing List

Hannes Reinecke wrote:
> Hi Mike,
> 
> Mike Christie wrote:
>> Oops, I mashed two topics together. See below.
>>
>> Mike Christie wrote:
>>> Hannes Reinecke wrote:
>>>> FC Transport already maintains an attribute for the path state, and even
>>>> sends netlink events if and when this attribute changes. For iSCSI I
>>>> have
>>> Are you referring to fc_host_post_event? Is the same thing we talked
>>> about last year, where you wanted events? Is this in multipath tools
>>> now or just in the SLES ones?
>>>
>>> For something like FCH_EVT_LINKDOWN, are you going to fail the path at
>>> that time or when would the multipath path be marked failed?
>>>
>> I was asking this because it seems we have people always making
>> bugzillas saying they did not want the path to be marked failed for
>> short problems.
>>
>> There was the problem where we might get DID_ERROR for temporary dropped
>> frame. That would be fixed by just listening to transport events like
>> you explained.
>>
>> But then I thought there was the case where if we get a linkdown then
>> linkup within a couple seconds, we would not want to transition the
>> multipath path state.
>>
>> So below while you were talking about when to remove the device, I was
>> talking about when to mark the path failed.
>>
>>
> I have the same bugzillas, too :-)
> 
> My proposal is to handle this in several stages:
> 
> - path fails
> -> Send out netlink event
> -> start dev_loss_tmo and fast_fail_io timer
> -> fast_fail_io timer triggers: Abort all oustanding I/O with
>    DID_TRANSPORT_DISRUPTED, return DID_TRANSPORT_FAILFAST for
>    any future I/O, and send out netlink event.


This is almost done. The IOs are failed. There is not netlink event yet.



> -> dev_loss_tmo timer triggers: Remove sdev and cleanup rport.
>    netlink event is sent implicitely by removing the sdev.
> 
> Multipath would then interact with this sequence by:
> 
> - Upon receiving 'path failed' event: mark path as 'ghost' or 'blocked',
>   ie no I/O is currently possible and will be queued (no path switch yet).
> - Upon receiving 'fast_fail_io' event: switch paths and resubmit queued I/Os
> - Upon receiving 'path removed' event: remove path from internal structures,
>   update multipath maps etc.
> 
> The time between 'path failed' and 'fast_fail_io triggers' would then be
> able to capture any jitter / intermittent failures. Between 
> 'fast_fail_io triggers' and 'path removed' the path would be held in some
> sort of 'limbo' in case it comes back again, eg for maintenance/SP update
> etc. And we can even increase this one to rather long timespans (eg hours)
> to give the admin enough time for a manual intervention.
> 
> I still like this proposal as it makes multipath interaction far cleaner.
> And we can do away with path checkers completely here.
> 
>>> You got my hopes up for a solution in the the long explanation, then
>>> you destroyed them :)
>>>
>>>
>>> Was the reason people did not like this because of the scsi device
>>> lifetime issue?
>>>
>>>
>>> I think we still want someone to set the fast io fail tmo for users
>>> when multipath is being used, because we want IO out of the queues and
>>> drivers and sent to the multipath layer before dev_loss_tmo if
>>> dev_loss_tmo is still going to be a lot longer. fast io fail tmo is
>>> usually less than 10 or 5 and for dev_loss_tmo seems like we still
>>> have user setting that to minutes.
>>>
>>>
>>> Can't the transport layers just send two events?
>>> 1. On the initial link down when the port/session is blocked.
>>> 2. When there fast io fail tmos fire.
>>
>> So for #2, I just want a way to figure out when the transport is giving
>> up on executing IO and is going to fail everything. At that time, I was
>> thinking we want to mark the path failed.
>>
> See above. Exactly my proposal.
> 
>> I guess if multipiath tools is going to set fast io fail, it could also
>> use that as its down timer to decide when to fail the path and not have
>> to send SG IO or a bsg transport command.
>>
> But that's a bit of out-guessing the midlayer, no?


Yeah, agree. Just brain storming.



> We're instructing the midlayer to fail all I/O at one point; so it makes
> far more sense to me to have the midlayer telling us when this is going
> to happen instead of trying to figure this one out ourselves.
> 
> For starters we just should send a netlink event when fast_fail_io has
> fired. We could easily integrate that one in multipathd and would gain
> an instant benefit from that as we can switch paths in advance.
> Next step would be to implement an additional sdev state which would
> return 'DID_TRANSPORT_FASTFAIL' for any 'normal' I/O; it would be
> inserted between 'RUNNING' and 'CANCEL'.
> Transition would be possible between 'RUNNING' and 'FASTFAIL', but
> it would only be possible to transition into 'CANCEL' from 'FASTFAIL'.
> 


Yeah, a new sdev state might be nice. Right now this state is handled by 
the classes. For iscsi and FC the port/session will be in 
blocked/ISCSI_SESSION_FAILED. Then internally the classes are decieding 
what to do with IO in the *_chkready functions.




> Oh, and of course we have to persuade Eric Moore et al to implement
> fast_fail_io into mptfc ...

Yeah, last holdout not counting the the old qlogic driver.

But actually in the current code if you just set the fast io fail tmo, 
all IO in the block queues and any incoming IO will get failed. It is 
sort of a partial support. Even if you cannot kill IO in the driver 
because you do not have the terminate rport IO callback you can at least 
get the queues cleared, so that IO does not sit in there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-20 19:10       ` Mike Christie
@ 2009-04-20 19:28         ` Mike Christie
  2009-04-21  7:04           ` Hannes Reinecke
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-20 19:28 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development, SCSI Mailing List

Mike Christie wrote:
> Hannes Reinecke wrote:
>>> Today, instead of #2, the Red Hat multipath tools guy and I were talking
>>> about doing a probe with SG_IO. For example we would send down a path
>>> tester IO and then wait for it to be failed with DID_TRANSPORT_FAILFAST.
>>>
>> No. this is exactly what you cannot do. SG_IO will be stalled when the
>> sdev is BLOCKED and will only return a result _after_ the sdev 
>> transitions
>> _out_ of the BLOCKED state.
>> Translated to FC this means that whenever dev_loss_tmo is _active_ (!)
>> no I/O will be send out neither any I/O result will be returned to 
>> userland.
>>
> 
> That is not true anymore. When fast io fail fires, the sdev and rport 
> will be blocked, but the the fc class will call into the LLD to have it 

I miswrote that. The rport will be show blocked state, but when fast io 
fail tmo fires, fc_terminate_rport_io will unblock the sdev, and the fc 
class chkready will fail any IO sent to it and of course 
terminate_rport_io will fail IO in the driver like I said below. And 
then you do not need a terminate_rport_io callback to have the fast io 
fail tmo now. If you set that timer at least IO in the block queue and 
new IO will be failed.


> fail any IO still running in the driver. The FC class will then fail any 
> IO in the block queues, and then it will also fail any new IO sent to it.
> 
> With your patch to have multipath-tools set fast io fail for multipath, 
> then we should always get the IO failed before dev_loss_tmo fires.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: LSF: Multipathing and path checking question
  2009-04-20 19:23         ` Mike Christie
@ 2009-04-20 23:02           ` Mike Christie
  2009-04-21  7:26             ` [dm-devel] " Hannes Reinecke
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Christie @ 2009-04-20 23:02 UTC (permalink / raw)
  To: device-mapper development; +Cc: SCSI Mailing List

Mike Christie wrote:
>> For starters we just should send a netlink event when fast_fail_io has
>> fired. We could easily integrate that one in multipathd and would gain
>> an instant benefit from that as we can switch paths in advance.
>> Next step would be to implement an additional sdev state which would
>> return 'DID_TRANSPORT_FASTFAIL' for any 'normal' I/O; it would be
>> inserted between 'RUNNING' and 'CANCEL'.
>> Transition would be possible between 'RUNNING' and 'FASTFAIL', but
>> it would only be possible to transition into 'CANCEL' from 'FASTFAIL'.
>>
> 
> 
> Yeah, a new sdev state might be nice. Right now this state is handled by 
> the classes. For iscsi and FC the port/session will be in 
> blocked/ISCSI_SESSION_FAILED. Then internally the classes are decieding 
> what to do with IO in the *_chkready functions.
> 
> 

How about setting the device to the offline state for this case where 
fast_io_fail has fired but the dev_loss_tmo has not yet fired? As fast 
as failing IO we get the same result. scsi-ml would fail the incoming IO 
instead of it getting to the class _chkready functions, but the scsi 
device state indicates that it cannot execute IO which might be nice for 
users.

Can we not do this because offline for the device only means when the 
scsi-eh has put it offline because it could not recover it or is it more 
generic like for any time it cannot execute IO?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-20 19:28         ` [dm-devel] " Mike Christie
@ 2009-04-21  7:04           ` Hannes Reinecke
  0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2009-04-21  7:04 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Hi Mike,

Mike Christie wrote:
> Mike Christie wrote:
>> Hannes Reinecke wrote:
>>>> Today, instead of #2, the Red Hat multipath tools guy and I were
>>>> talking
>>>> about doing a probe with SG_IO. For example we would send down a path
>>>> tester IO and then wait for it to be failed with
>>>> DID_TRANSPORT_FAILFAST.
>>>>
>>> No. this is exactly what you cannot do. SG_IO will be stalled when the
>>> sdev is BLOCKED and will only return a result _after_ the sdev
>>> transitions
>>> _out_ of the BLOCKED state.
>>> Translated to FC this means that whenever dev_loss_tmo is _active_ (!)
>>> no I/O will be send out neither any I/O result will be returned to
>>> userland.
>>>
>>
>> That is not true anymore. When fast io fail fires, the sdev and rport
>> will be blocked, but the the fc class will call into the LLD to have it 
> 
> I miswrote that. The rport will be show blocked state, but when fast io
> fail tmo fires, fc_terminate_rport_io will unblock the sdev, and the fc
> class chkready will fail any IO sent to it and of course
> terminate_rport_io will fail IO in the driver like I said below. And
> then you do not need a terminate_rport_io callback to have the fast io
> fail tmo now. If you set that timer at least IO in the block queue and
> new IO will be failed.
> 
Indeed, I didn't look closely enough. Ok, so I/O will be failed after
terminate_rport_io. 

So that means we can just implement a new netlink message after
terminate_rport_io to inform the multipath daemon about this changes.

And, of course, we _really_ should introduce a new sdev state here.
Having the sdev set to 'RUNNING' but having all I/O failed in the
transport class is just a quirky behaviour which is bound to cause
trouble.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dm-devel] LSF: Multipathing and path checking question
  2009-04-20 23:02           ` Mike Christie
@ 2009-04-21  7:26             ` Hannes Reinecke
  0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2009-04-21  7:26 UTC (permalink / raw)
  To: Mike Christie; +Cc: device-mapper development, SCSI Mailing List

Mike Christie wrote:
> Mike Christie wrote:
>>> For starters we just should send a netlink event when fast_fail_io has
>>> fired. We could easily integrate that one in multipathd and would gain
>>> an instant benefit from that as we can switch paths in advance.
>>> Next step would be to implement an additional sdev state which would
>>> return 'DID_TRANSPORT_FASTFAIL' for any 'normal' I/O; it would be
>>> inserted between 'RUNNING' and 'CANCEL'.
>>> Transition would be possible between 'RUNNING' and 'FASTFAIL', but
>>> it would only be possible to transition into 'CANCEL' from 'FASTFAIL'.
>>>
>>
>>
>> Yeah, a new sdev state might be nice. Right now this state is handled
>> by the classes. For iscsi and FC the port/session will be in
>> blocked/ISCSI_SESSION_FAILED. Then internally the classes are
>> decieding what to do with IO in the *_chkready functions.
>>
>>
> 
> 
> How about setting the device to the offline state for this case where
> fast_io_fail has fired but the dev_loss_tmo has not yet fired? As fast
> as failing IO we get the same result. scsi-ml would fail the incoming IO
> instead of it getting to the class _chkready functions, but the scsi
> device state indicates that it cannot execute IO which might be nice for
> users.
> 
Ah, no. OFFLINE is a dead end status out of which we cannot transition
from inside the kernel. I'd prefer a new state here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-04-21  7:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-16 22:59 LSF: Multipathing and path checking question Mike Christie
2009-04-17  7:50 ` [dm-devel] " Hannes Reinecke
2009-04-17 14:55   ` Mike Christie
2009-04-17 15:21     ` Mike Christie
2009-04-20  8:19       ` [dm-devel] " Hannes Reinecke
2009-04-20 19:23         ` Mike Christie
2009-04-20 23:02           ` Mike Christie
2009-04-21  7:26             ` [dm-devel] " Hannes Reinecke
2009-04-20  7:59     ` Hannes Reinecke
2009-04-20 19:10       ` Mike Christie
2009-04-20 19:28         ` [dm-devel] " Mike Christie
2009-04-21  7:04           ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).