[PATCH] scsi_transport_fc: handle transient error on multipath environment

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] scsi_transport_fc: handle transient error on multipath environment
@ 2010-02-12  8:09 Tomohiro Kusumi
  2010-02-12 15:27 ` James Smart
  2010-02-12 17:46 ` Mike Christie
  0 siblings, 2 replies; 6+ messages in thread
From: Tomohiro Kusumi @ 2010-02-12  8:09 UTC (permalink / raw)
  To: linux-scsi, michaelc, James.Bottomley; +Cc: linux-kernel, kusumi.tomohiro

Hi

We've been working on SCSI-FC for enterprise system using MD/DMMP.
In enterprise system, response time from disk is important factor,
thus it is important for multipathd to quickly discard current path and
failover to secondary RAID disk if any problem with disk I/O is detected.
In order to switch to alternative path as quick as possible, multipathd
should quickly recognize phenomenon such as fibre channel link down,
no response from disk, etc.

In the past, we've posted a patch that reduces response time from disk,
although it was a trial patch since there wasn't good framework to
implement those features. We did it in block layer and that wasn't
a good choice I guess.
http://marc.info/?l=linux-kernel&m=109598324018681&w=2

But in the recent SCSI driver, transport layer for each lower level
interface is getting bigger and better which I think is a good platform
to implement them. As far as I know, Mr. Mike Christie has already been
working on fast io fail timeout feature for fibre channel transport layer,
and that enables userland multipathd quickly guess that the path is down
when fibre channel linkdown occured on LLD like lpfc. This patch is a
simple additional feature to what Mike has been working on.

This is what I'm trying to do.
1. If SCSI command had timed out, I assume it's time to failover to the
   secondary disk without error recovery. Let's call it transient error.
2. Schedule fc_rport_recover_transient_error from fc_timed_out using work
   queue if the feature is enabled. Also, make fc_timed_out return
   BLK_EH_HANDLED so as not to wake up error handler kernel thread.
3. That workqueue calls transport template function recover_transient_error
   if LLD implements it. Otherwise, it simply calls fc_remote_port_delete
   and delete fibre channel remote port that corresponds to the SCSI target
   device that caused transient error.
4. Once fc_remote_port_delete is called, it removes the remote port and
   take care of existing and incoming I/O just like when fibre channel
   linkdown occured.
5. If fast io fail timeout is enabled, multipathd can quickly recognize
   disk I/O problem and make dm-mpath driver failover to secondary disk.
   Even if fast io fail timeout is disabled, multipathd can recognize it
   anyway after dev loss timeout expired.

In the current SCSI mid layer driver, SCSI command timeout wakes up error
handler kernel thread which takes quite long time depending on the imple-
mentation of LLD. Although waking up SCSI error handler is right thing to
do in most cases, I think it is not suitable for multipath environment
with requirement of quick response. Enabling recover_transient_feature
might help those who don't want recovery operation, but just quick failover.

Thanks,
Tomohiro Kusumi

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
---

diff -aNur linux-2.6.33-rc7.org/drivers/scsi/scsi_transport_fc.c linux-2.6.33-rc7/drivers/scsi/scsi_transport_fc.c
--- linux-2.6.33-rc7.org/drivers/scsi/scsi_transport_fc.c	2010-02-07 07:17:12.000000000 +0900
+++ linux-2.6.33-rc7/drivers/scsi/scsi_transport_fc.c	2010-02-12 15:16:05.985439982 +0900
@@ -284,13 +284,14 @@
 static void fc_timeout_deleted_rport(struct work_struct *work);
 static void fc_timeout_fail_rport_io(struct work_struct *work);
 static void fc_scsi_scan_rport(struct work_struct *work);
+static void fc_rport_recover_transient_error(struct work_struct *work);

 /*
  * Attribute counts pre object type...
  * Increase these values if you add attributes
  */
 #define FC_STARGET_NUM_ATTRS 	3
-#define FC_RPORT_NUM_ATTRS	10
+#define FC_RPORT_NUM_ATTRS	11
 #define FC_VPORT_NUM_ATTRS	9
 #define FC_HOST_NUM_ATTRS	22

@@ -935,6 +936,39 @@
 static FC_DEVICE_ATTR(rport, fast_io_fail_tmo, S_IRUGO | S_IWUSR,
 	show_fc_rport_fast_io_fail_tmo, store_fc_rport_fast_io_fail_tmo);

+/*
+ * recover_transient_error attribute
+ */
+static ssize_t
+show_fc_rport_recover_transient_error (struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	struct fc_rport *rport = transport_class_to_rport(dev);
+	return snprintf(buf, 20, "%d\n", rport->recover_transient_error);
+}
+
+static ssize_t
+store_fc_rport_recover_transient_error(struct device *dev,
+				struct device_attribute *attr, const char *buf,
+				size_t count)
+{
+	int val;
+	char *cp;
+	struct fc_rport *rport = transport_class_to_rport(dev);
+
+	if ((rport->port_state == FC_PORTSTATE_BLOCKED) ||
+	    (rport->port_state == FC_PORTSTATE_DELETED) ||
+	    (rport->port_state == FC_PORTSTATE_NOTPRESENT))
+		return -EBUSY;
+	val = simple_strtoul(buf, &cp, 0);
+	if ((*cp && (*cp != '\n')) || val < 0 || val > 1)
+		return -EINVAL;
+	rport->recover_transient_error = val;
+	return count;
+}
+static FC_DEVICE_ATTR(rport, recover_transient_error, S_IRUGO | S_IWUSR,
+	show_fc_rport_recover_transient_error,
+	store_fc_rport_recover_transient_error);

 /*
  * FC SCSI Target Attribute Management
@@ -1953,6 +1987,13 @@
 {
 	struct fc_rport *rport = starget_to_rport(scsi_target(scmd->device));

+	if (rport->recover_transient_error) {
+		fc_queue_work(scmd->device->host, &rport->rport_te_work);
+		scmd->result = ((scmd->result & 0xFF00FFFF) |
+				(DID_TRANSPORT_DISRUPTED << 16));
+		return BLK_EH_HANDLED;
+	}
+
 	if (rport->port_state == FC_PORTSTATE_BLOCKED)
 		return BLK_EH_RESET_TIMER;

@@ -2156,6 +2197,7 @@
 	SETUP_PRIVATE_RPORT_ATTRIBUTE_RD(port_state);
 	SETUP_PRIVATE_RPORT_ATTRIBUTE_RD(scsi_target_id);
 	SETUP_PRIVATE_RPORT_ATTRIBUTE_RW(fast_io_fail_tmo);
+	SETUP_PRIVATE_RPORT_ATTRIBUTE_RW(recover_transient_error);

 	BUG_ON(count > FC_RPORT_NUM_ATTRS);

@@ -2506,6 +2548,7 @@
 	INIT_WORK(&rport->scan_work, fc_scsi_scan_rport);
 	INIT_WORK(&rport->stgt_delete_work, fc_starget_delete);
 	INIT_WORK(&rport->rport_delete_work, fc_rport_final_delete);
+	INIT_WORK(&rport->rport_te_work, fc_rport_recover_transient_error);

 	spin_lock_irqsave(shost->host_lock, flags);

@@ -3157,6 +3200,36 @@
 }

 /**
+ * fc_rport_recover_transient_error - Transient error recovery handler
+ * @work:	contained by rport that corresponds to SCSI device to
+ * recover from transient error.
+ *
+ * Description: This routine is used when transient error
+ * (such as no response from disk) is detected. If FC LLD does not have
+ * implementation of recover_transient_error this routine simply deletes
+ * that port just like when FC linkdown is detected. In multipath envir-
+ * onment, if fast io fail timeout is enabled, multipathd can quickly
+ * guess that there is a problem with disk I/O and try to deactivate
+ * current path.
+ */
+static void
+fc_rport_recover_transient_error(struct work_struct *work)
+{
+	struct fc_rport *rport =
+		container_of(work, struct fc_rport, rport_te_work);
+	struct Scsi_Host *shost = rport_to_shost(rport);
+	struct fc_internal *i = to_fc_internal(shost->transportt);
+
+	if (rport->port_state == FC_PORTSTATE_BLOCKED)
+		return;
+
+	if (i->f->recover_transient_error)
+		i->f->recover_transient_error(rport);
+	else
+		fc_remote_port_delete(rport);
+}
+
+/**
  * fc_block_scsi_eh - Block SCSI eh thread for blocked fc_rport
  * @cmnd: SCSI command that scsi_eh is trying to recover
  *
diff -aNur linux-2.6.33-rc7.org/include/scsi/scsi_transport_fc.h linux-2.6.33-rc7/include/scsi/scsi_transport_fc.h
--- linux-2.6.33-rc7.org/include/scsi/scsi_transport_fc.h	2010-02-07 07:17:12.000000000 +0900
+++ linux-2.6.33-rc7/include/scsi/scsi_transport_fc.h	2010-02-11 22:28:38.988369450 +0900
@@ -336,6 +336,7 @@
 	enum fc_port_state port_state;	/* Will only be ONLINE or UNKNOWN */
 	u32 scsi_target_id;
 	u32 fast_io_fail_tmo;
+	int recover_transient_error;

 	/* exported data */
 	void *dd_data;			/* Used for driver-specific storage */
@@ -351,6 +352,7 @@
  	struct delayed_work fail_io_work;
  	struct work_struct stgt_delete_work;
 	struct work_struct rport_delete_work;
+	struct work_struct rport_te_work;
 	struct request_queue *rqst_q;	/* bsg support */
 } __attribute__((aligned(sizeof(unsigned long))));

@@ -648,6 +650,7 @@

 	void    (*dev_loss_tmo_callbk)(struct fc_rport *);
 	void	(*terminate_rport_io)(struct fc_rport *);
+	void	(*recover_transient_error)(struct fc_rport *);

 	void	(*set_vport_symbolic_name)(struct fc_vport *);
 	int  	(*vport_create)(struct fc_vport *, bool);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment
  2010-02-12  8:09 [PATCH] scsi_transport_fc: handle transient error on multipath environment Tomohiro Kusumi
@ 2010-02-12 15:27 ` James Smart
  2010-04-01  8:42   ` Tomohiro Kusumi
  2010-02-12 17:46 ` Mike Christie
  1 sibling, 1 reply; 6+ messages in thread
From: James Smart @ 2010-02-12 15:27 UTC (permalink / raw)
  To: Tomohiro Kusumi
  Cc: linux-scsi@vger.kernel.org, michaelc@cs.wisc.edu,
	James.Bottomley@suse.de, linux-kernel@vger.kernel.org

Tomohiro Kusumi wrote:
> Hi
> 
> We've been working on SCSI-FC for enterprise system using MD/DMMP.
> In enterprise system, response time from disk is important factor,
> thus it is important for multipathd to quickly discard current path and
> failover to secondary RAID disk if any problem with disk I/O is detected.
> In order to switch to alternative path as quick as possible, multipathd
> should quickly recognize phenomenon such as fibre channel link down,
> no response from disk, etc.
> 
> In the past, we've posted a patch that reduces response time from disk,
> although it was a trial patch since there wasn't good framework to
> implement those features. We did it in block layer and that wasn't
> a good choice I guess.
> http://marc.info/?l=linux-kernel&m=109598324018681&w=2
> 
> But in the recent SCSI driver, transport layer for each lower level
> interface is getting bigger and better which I think is a good platform
> to implement them. As far as I know, Mr. Mike Christie has already been
> working on fast io fail timeout feature for fibre channel transport layer,
> and that enables userland multipathd quickly guess that the path is down
> when fibre channel linkdown occured on LLD like lpfc. This patch is a
> simple additional feature to what Mike has been working on.
> 
> This is what I'm trying to do.
> 1. If SCSI command had timed out, I assume it's time to failover to the
>    secondary disk without error recovery. Let's call it transient error.

Link down is an indication of path connectivity loss, and connectivity loss is
one one of the tasks of the transport - to isolate the upper layers from
transient loss.  Mike's addition was appropriate as it changed the way i/o was
dealt with while in one of the transient loss states.

But interpretation of an i/o completion status is a very different matter. The
transport/LLDD shouldn't be making any inferences based on i/o completion
state. That's for upper layers who better know the device and the task at hand
to decide. The transport is simply tracking connectivity status *as driven by
the LLDD*.

So, although I can understand that you would like to use latency as a path
quality issue, I don't agree with making the transport be the one making a
failover policy, even if the feature is optional. Failover policy choice is
for the multipathing software.

Can you give me a reason why it is not addressed in multipathing layers ? Why
isn't the upper layer monitoring latency, which doesn't have to be an i/o
timeout, not tracked in the multipathing software.  The additional advantage
of doing this (at the right level) is that this failover due to latency on a
path, would apply to all transports.

> 2. Schedule fc_rport_recover_transient_error from fc_timed_out using work
>    queue if the feature is enabled. Also, make fc_timed_out return
>    BLK_EH_HANDLED so as not to wake up error handler kernel thread.
> 3. That workqueue calls transport template function recover_transient_error
>    if LLD implements it. Otherwise, it simply calls fc_remote_port_delete
>    and delete fibre channel remote port that corresponds to the SCSI target
>    device that caused transient error.

In order to agree to such a patch, I would need to know, very clearly, what an
LLDD is supposed to do in a "transient error" handler.  This was unspecified.

I have a hard time agreeing with a default policy that says, just because a
single i/o timed out, the entire target topology tree should be torn down. Due
to the reasons for a timeout, it may require more than 1 before a pattern
exists that says it should be considered "bad".  Mostly though - the topology
tree is there to represent the connectivity on the FC fabric *as seen by the
LLDD* and largely tracks to the LLDD discovery and login state.  Asynchronous
teardown of this tree by an i/o timeout can leave a mismatch in the transport
vs LLDD on the rport state (perhaps causing other errors) as well as forcing a
condition where OS tools/admins viewing the sysfs tree - see a colored view of
what the fabric connectivity actually is.

> 4. Once fc_remote_port_delete is called, it removes the remote port and
>    take care of existing and incoming I/O just like when fibre channel
>    linkdown occured.

Additionally, I think it's very odd to have a single i/o, which timed out,
kill all other i/o's to all luns on that target. Given array implementations
that may make lun relationships vary greatly (with preferred paths,
distributed controller implementations), this is too broad a scope to imply.

All of this is solved if you deal with it at the "device" level in the
multipathing software.

> 5. If fast io fail timeout is enabled, multipathd can quickly recognize
>    disk I/O problem and make dm-mpath driver failover to secondary disk.
>    Even if fast io fail timeout is disabled, multipathd can recognize it
>    anyway after dev loss timeout expired.
> 
> In the current SCSI mid layer driver, SCSI command timeout wakes up error
> handler kernel thread which takes quite long time depending on the imple-
> mentation of LLD. Although waking up SCSI error handler is right thing to
> do in most cases, I think it is not suitable for multipath environment
> with requirement of quick response. Enabling recover_transient_feature
> might help those who don't want recovery operation, but just quick failover.

Then it hints the error handler should be fixed....

-- james s

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment
  2010-02-12 15:27 ` James Smart
@ 2010-04-01  8:42   ` Tomohiro Kusumi
  0 siblings, 0 replies; 6+ messages in thread
From: Tomohiro Kusumi @ 2010-04-01  8:42 UTC (permalink / raw)
  To: James Smart
  Cc: linux-scsi@vger.kernel.org, michaelc@cs.wisc.edu,
	James.Bottomley@suse.de

Hi James,

Sorry for the late reply,
I basically agree to what you have mentioned.

> > We've been working on SCSI-FC for enterprise system using MD/DMMP.
> > In enterprise system, response time from disk is important factor,
> > thus it is important for multipathd to quickly discard current path and
> > failover to secondary RAID disk if any problem with disk I/O is detected.
> > In order to switch to alternative path as quick as possible, multipathd
> > should quickly recognize phenomenon such as fibre channel link down,
> > no response from disk, etc.
> > 
> > In the past, we've posted a patch that reduces response time from disk,
> > although it was a trial patch since there wasn't good framework to
> > implement those features. We did it in block layer and that wasn't
> > a good choice I guess.
> > http://marc.info/?l=linux-kernel&m=109598324018681&w=2
> > 
> > But in the recent SCSI driver, transport layer for each lower level
> > interface is getting bigger and better which I think is a good platform
> > to implement them. As far as I know, Mr. Mike Christie has already been
> > working on fast io fail timeout feature for fibre channel transport layer,
> > and that enables userland multipathd quickly guess that the path is down
> > when fibre channel linkdown occured on LLD like lpfc. This patch is a
> > simple additional feature to what Mike has been working on.
> > 
> > This is what I'm trying to do.
> > 1. If SCSI command had timed out, I assume it's time to failover to the
> >    secondary disk without error recovery. Let's call it transient error.
> 
> Link down is an indication of path connectivity loss, and connectivity loss is
> one one of the tasks of the transport - to isolate the upper layers from
> transient loss.  Mike's addition was appropriate as it changed the way i/o was
> dealt with while in one of the transient loss states.
>
> But interpretation of an i/o completion status is a very different matter. The
> transport/LLDD shouldn't be making any inferences based on i/o completion
> state. That's for upper layers who better know the device and the task at hand
> to decide. The transport is simply tracking connectivity status *as driven by
> the LLDD*.
> 
> So, although I can understand that you would like to use latency as a path
> quality issue, I don't agree with making the transport be the one making a
> failover policy, even if the feature is optional. Failover policy choice is
> for the multipathing software.
> 
> Can you give me a reason why it is not addressed in multipathing layers ? Why
> isn't the upper layer monitoring latency, which doesn't have to be an i/o
> timeout, not tracked in the multipathing software.  The additional advantage
> of doing this (at the right level) is that this failover due to latency on a
> path, would apply to all transports.

 Since scsi_transport.h says "eh_timed_out allows the transport to become
 involved when a scsi io timer fires", making action based on I/O completion
 state once made sense to me. Plus, how current fc transport code works when
 trying to delete remote port (blocks queue at first, and then unblock after
 fast fail tmo which make I/O fast fail in queuecommand) was good for what I
 was trying to do. But I do understand what you say, that decision should be
 made in upper layer.

> > 2. Schedule fc_rport_recover_transient_error from fc_timed_out using work
> >    queue if the feature is enabled. Also, make fc_timed_out return
> >    BLK_EH_HANDLED so as not to wake up error handler kernel thread.
> > 3. That workqueue calls transport template function recover_transient_error
> >    if LLD implements it. Otherwise, it simply calls fc_remote_port_delete
> >    and delete fibre channel remote port that corresponds to the SCSI target
> >    device that caused transient error.
> 
> In order to agree to such a patch, I would need to know, very clearly, what an
> LLDD is supposed to do in a "transient error" handler.  This was unspecified.

 I think the name "transient error" was difficult to understand. I meant to say
 transient error handler does whatever necessary in order to recover from transient
 error such as I/O latency. In multipath environment, it would be to quickly let
 multipath software recognize the broken path and discard it. In the case of fibre
 channel, deleting a remote port and avoiding unjam function would do it.

 Without such handler, normal scsi_times_out (unjam function) just takes too
 much time before it gets offlined and lets upper layer's multipath software
 discard it.

> I have a hard time agreeing with a default policy that says, just because a
> single i/o timed out, the entire target topology tree should be torn down. Due
> to the reasons for a timeout, it may require more than 1 before a pattern
> exists that says it should be considered "bad".  Mostly though - the topology
> tree is there to represent the connectivity on the FC fabric *as seen by the
> LLDD* and largely tracks to the LLDD discovery and login state.  Asynchronous
> teardown of this tree by an i/o timeout can leave a mismatch in the transport
> vs LLDD on the rport state (perhaps causing other errors) as well as forcing a
> condition where OS tools/admins viewing the sysfs tree - see a colored view of
> what the fabric connectivity actually is.
> 
> > 4. Once fc_remote_port_delete is called, it removes the remote port and
> >    take care of existing and incoming I/O just like when fibre channel
> >    linkdown occured.
> 
> Additionally, I think it's very odd to have a single i/o, which timed out,
> kill all other i/o's to all luns on that target. Given array implementations
> that may make lun relationships vary greatly (with preferred paths,
> distributed controller implementations), this is too broad a scope to imply.

 Agree, me myself thought it's odd to get rid of that remote port just because
 of a single I/O latency from fabric topology's point of view, although it did
 not have a problem after removing it (in my test environment).

> All of this is solved if you deal with it at the "device" level in the
> multipathing software.
> 
> 
> > 5. If fast io fail timeout is enabled, multipathd can quickly recognize
> >    disk I/O problem and make dm-mpath driver failover to secondary disk.
> >    Even if fast io fail timeout is disabled, multipathd can recognize it
> >    anyway after dev loss timeout expired.
> > 
> > In the current SCSI mid layer driver, SCSI command timeout wakes up error
> > handler kernel thread which takes quite long time depending on the imple-
> > mentation of LLD. Although waking up SCSI error handler is right thing to
> > do in most cases, I think it is not suitable for multipath environment
> > with requirement of quick response. Enabling recover_transient_feature
> > might help those who don't want recovery operation, but just quick failover.
> 
> Then it hints the error handler should be fixed....

 As you and Mike mentioned, implementing it on dm layer makes sense to me. 
 I think I'll post a patch to dm-devel and scsi mailing list before long.

Thanks,
Tomohiro Kusumi



(2010/02/13 0:27), James Smart wrote:
> Tomohiro Kusumi wrote:
>> Hi
>>
>> We've been working on SCSI-FC for enterprise system using MD/DMMP.
>> In enterprise system, response time from disk is important factor,
>> thus it is important for multipathd to quickly discard current path and
>> failover to secondary RAID disk if any problem with disk I/O is detected.
>> In order to switch to alternative path as quick as possible, multipathd
>> should quickly recognize phenomenon such as fibre channel link down,
>> no response from disk, etc.
>>
>> In the past, we've posted a patch that reduces response time from disk,
>> although it was a trial patch since there wasn't good framework to
>> implement those features. We did it in block layer and that wasn't
>> a good choice I guess.
>> http://marc.info/?l=linux-kernel&m=109598324018681&w=2
>>
>> But in the recent SCSI driver, transport layer for each lower level
>> interface is getting bigger and better which I think is a good platform
>> to implement them. As far as I know, Mr. Mike Christie has already been
>> working on fast io fail timeout feature for fibre channel transport layer,
>> and that enables userland multipathd quickly guess that the path is down
>> when fibre channel linkdown occured on LLD like lpfc. This patch is a
>> simple additional feature to what Mike has been working on.
>>
>> This is what I'm trying to do.
>> 1. If SCSI command had timed out, I assume it's time to failover to the
>>     secondary disk without error recovery. Let's call it transient error.
> 
> Link down is an indication of path connectivity loss, and connectivity loss is
> one one of the tasks of the transport - to isolate the upper layers from
> transient loss.  Mike's addition was appropriate as it changed the way i/o was
> dealt with while in one of the transient loss states.
> 
> But interpretation of an i/o completion status is a very different matter. The
> transport/LLDD shouldn't be making any inferences based on i/o completion
> state. That's for upper layers who better know the device and the task at hand
> to decide. The transport is simply tracking connectivity status *as driven by
> the LLDD*.
> 
> So, although I can understand that you would like to use latency as a path
> quality issue, I don't agree with making the transport be the one making a
> failover policy, even if the feature is optional. Failover policy choice is
> for the multipathing software.
> 
> Can you give me a reason why it is not addressed in multipathing layers ? Why
> isn't the upper layer monitoring latency, which doesn't have to be an i/o
> timeout, not tracked in the multipathing software.  The additional advantage
> of doing this (at the right level) is that this failover due to latency on a
> path, would apply to all transports.
> 
> 
>> 2. Schedule fc_rport_recover_transient_error from fc_timed_out using work
>>     queue if the feature is enabled. Also, make fc_timed_out return
>>     BLK_EH_HANDLED so as not to wake up error handler kernel thread.
>> 3. That workqueue calls transport template function recover_transient_error
>>     if LLD implements it. Otherwise, it simply calls fc_remote_port_delete
>>     and delete fibre channel remote port that corresponds to the SCSI target
>>     device that caused transient error.
> 
> In order to agree to such a patch, I would need to know, very clearly, what an
> LLDD is supposed to do in a "transient error" handler.  This was unspecified.
> 
> I have a hard time agreeing with a default policy that says, just because a
> single i/o timed out, the entire target topology tree should be torn down. Due
> to the reasons for a timeout, it may require more than 1 before a pattern
> exists that says it should be considered "bad".  Mostly though - the topology
> tree is there to represent the connectivity on the FC fabric *as seen by the
> LLDD* and largely tracks to the LLDD discovery and login state.  Asynchronous
> teardown of this tree by an i/o timeout can leave a mismatch in the transport
> vs LLDD on the rport state (perhaps causing other errors) as well as forcing a
> condition where OS tools/admins viewing the sysfs tree - see a colored view of
> what the fabric connectivity actually is.
> 
>> 4. Once fc_remote_port_delete is called, it removes the remote port and
>>     take care of existing and incoming I/O just like when fibre channel
>>     linkdown occured.
> 
> Additionally, I think it's very odd to have a single i/o, which timed out,
> kill all other i/o's to all luns on that target. Given array implementations
> that may make lun relationships vary greatly (with preferred paths,
> distributed controller implementations), this is too broad a scope to imply.
> 
> All of this is solved if you deal with it at the "device" level in the
> multipathing software.
> 
> 
>> 5. If fast io fail timeout is enabled, multipathd can quickly recognize
>>     disk I/O problem and make dm-mpath driver failover to secondary disk.
>>     Even if fast io fail timeout is disabled, multipathd can recognize it
>>     anyway after dev loss timeout expired.
>>
>> In the current SCSI mid layer driver, SCSI command timeout wakes up error
>> handler kernel thread which takes quite long time depending on the imple-
>> mentation of LLD. Although waking up SCSI error handler is right thing to
>> do in most cases, I think it is not suitable for multipath environment
>> with requirement of quick response. Enabling recover_transient_feature
>> might help those who don't want recovery operation, but just quick failover.
> 
> Then it hints the error handler should be fixed....
> 
> -- james s
> 
> 
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment
  2010-02-12  8:09 [PATCH] scsi_transport_fc: handle transient error on multipath environment Tomohiro Kusumi
  2010-02-12 15:27 ` James Smart
@ 2010-02-12 17:46 ` Mike Christie
  2010-02-12 18:03   ` Mike Christie
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Christie @ 2010-02-12 17:46 UTC (permalink / raw)
  To: Tomohiro Kusumi; +Cc: linux-scsi, James.Bottomley, linux-kernel

On 02/12/2010 02:09 AM, Tomohiro Kusumi wrote:
> @@ -1953,6 +1987,13 @@
>   {
>   	struct fc_rport *rport = starget_to_rport(scsi_target(scmd->device));
> 
> +	if (rport->recover_transient_error) {
> +		fc_queue_work(scmd->device->host,&rport->rport_te_work);
> +		scmd->result = ((scmd->result&  0xFF00FFFF) |
> +				(DID_TRANSPORT_DISRUPTED<<  16));
> +		return BLK_EH_HANDLED;
> +	}
> +
>   	if (rport->port_state == FC_PORTSTATE_BLOCKED)
>   		return BLK_EH_RESET_TIMER;

- For the link down case you mentioned, would we see that the rport is
blocked here then we would return RESET_TIMER. If the fast_io_fail tmo
is set, then that would fail io quickly upwards (the fast io fail timo
would probably fire before the cmd even timed out).

What transport problems are you seeing where the rport is not blocked
and the scsi cmd timer fires? Would it be mostly buggy switches or
something like that?

- Maybe you want to instead hook something into the dm-mutlipath's
request (no more bios like in 2004 :)). Can you set a timer on that
level of request. If that times out then, dm-multipath could do
something like call blk_abort_queue.

I think the problem with blk_abort_queue might be that it stops all IO
to the entire host where you probably just want to work on the remote
port/path. For that you could call something like
recover_transient_error. Maybe it could just be a call to
terminate_rport_io from a workqueue though.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment
  2010-02-12 17:46 ` Mike Christie
@ 2010-02-12 18:03   ` Mike Christie
  2010-04-01  8:42     ` Tomohiro Kusumi
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Christie @ 2010-02-12 18:03 UTC (permalink / raw)
  To: Tomohiro Kusumi; +Cc: linux-scsi, James.Bottomley, linux-kernel

On 02/12/2010 11:46 AM, Mike Christie wrote:
> - Maybe you want to instead hook something into the dm-mutlipath's
> request (no more bios like in 2004 :)). Can you set a timer on that
> level of request. If that times out then, dm-multipath could do
> something like call blk_abort_queue.

Some more detail. I was thinking maybe you could stack the timeout
handlers like is done for request_fn handlers or maybe the scsi cmd
would use the upper layer's timer somehow. Not sure... but at the least
I think we would not want both a scsi request and dm request timers
running at the same time.

Then for the error handling and timeout handling, most FC drivers have a
terminate_rport_io which works without having to block the entire host.
Those drivers could implement a newer eh where instead of firing the
code in scsi_error.c when a cmd times out, it would run
terminate_rport_io from some workqueue.

new dm request timed out()
	-> scsi_timed_out
		-> fc_timed_out()
			{
				run new eh from workqueue();
			}


new_eh()
	/* no new cmds should be started until we figure out what is going on */
	block rport()
	/* releases cmds upwards so they can run while we try to figure out
what is going on */
	terminate_rport_io()
	/* check if devices are ok */
	send_tur()
	if (tur failed)
		start old scsi_error.c code to unjam us.
	else
		/* everything looks ok so let IO run to this path again */
		unblock rport()


> 
> I think the problem with blk_abort_queue might be that it stops all IO
> to the entire host where you probably just want to work on the remote
> port/path. For that you could call something like
> recover_transient_error. Maybe it could just be a call to
> terminate_rport_io from a workqueue though.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] scsi_transport_fc: handle transient error on multipath environment
  2010-02-12 18:03   ` Mike Christie
@ 2010-04-01  8:42     ` Tomohiro Kusumi
  0 siblings, 0 replies; 6+ messages in thread
From: Tomohiro Kusumi @ 2010-04-01  8:42 UTC (permalink / raw)
  To: Mike Christie; +Cc: linux-scsi, James.Bottomley, James Smart

Hi Mike,

Sorry for the late reply, and thanks for giving me advice.

> On 02/12/2010 11:46 AM, Mike Christie wrote:

> What transport problems are you seeing where the rport is not blocked
> and the scsi cmd timer fires? Would it be mostly buggy switches or
> something like that?

 When a disk is broken (broken FC switch is possible though) there is a case
 that SCSI command times out without blocking fibre channel remote port which
 simply means no response (no hardware interrupt on I/O completion) from the disk.
 In the current SCSI driver, that ends up fc_timed_out returning BLK_EH_NOT_HANDLED,
 and as a result it takes long time to get through unjam function of error handler.
 The older scsi driver has no transport timeout, but it basically does the same.

 The previous patch was to avoid unjam and let multipath software quickly discard
 the broken path, but as you and James had mentioned dm layer should handle I/O
 latency as path quality issue, not transport.

> > - Maybe you want to instead hook something into the dm-mutlipath's
> > request (no more bios like in 2004 :)). Can you set a timer on that
> > level of request. If that times out then, dm-multipath could do
> > something like call blk_abort_queue.
>
> Some more detail. I was thinking maybe you could stack the timeout
> handlers like is done for request_fn handlers or maybe the scsi cmd
> would use the upper layer's timer somehow. Not sure... but at the least
> I think we would not want both a scsi request and dm request timers
> running at the same time.

 I've been reading device mapper and dmsetup command code, trying to understand
 what you were saying. It seems there are two ways to do it in device mapper layer.
 One is to simply modify dm-mpath target driver, and let request_queue override the
 timeout handler and use different one instead of normal scsi timeout handler. 
 Another is to make something like dm-timeout target driver which stacks timeout
 functionality on top of another device mapper. First one looks much easier, but
 stacking another functionality on top of dm-mpath has more flexibility (if it's
 possible).

> Then for the error handling and timeout handling, most FC drivers have a
> terminate_rport_io which works without having to block the entire host.
> Those drivers could implement a newer eh where instead of firing the
> code in scsi_error.c when a cmd times out, it would run
> terminate_rport_io from some workqueue.
>
> new dm request timed out()
> 	-> scsi_timed_out
> 		-> fc_timed_out()
> 			{
> 				run new eh from workqueue();
> 			}
> 
> new_eh()
> 	/* no new cmds should be started until we figure out what is going on */
> 	block rport()
> 	/* releases cmds upwards so they can run while we try to figure out
> what is going on */
> 	terminate_rport_io()
> 	/* check if devices are ok */
> 	send_tur()
> 	if (tur failed)
> 		start old scsi_error.c code to unjam us.
> 	else
> 		/* everything looks ok so let IO run to this path again */
> 		unblock rport()

 Agree, even if I take care of I/O latency issue in dm timeout, I still have to
 add different timeout handler in SCSI/FC layer, and that might end up removing
 remote port like what I did in the previous patch (and I agree removing a fibre
 channel remote port just because of a single I/O latency is somewhat odd).

> > 
> > I think the problem with blk_abort_queue might be that it stops all IO
> > to the entire host where you probably just want to work on the remote
> > port/path. For that you could call something like
> > recover_transient_error. Maybe it could just be a call to
> > terminate_rport_io from a workqueue though.

 I think I'll post a patch to dm-devel and scsi mailing list before long.

Thanks,
Tomohiro Kusumi



(2010/02/13 3:03), Mike Christie wrote:
> On 02/12/2010 11:46 AM, Mike Christie wrote:
>> - Maybe you want to instead hook something into the dm-mutlipath's
>> request (no more bios like in 2004 :)). Can you set a timer on that
>> level of request. If that times out then, dm-multipath could do
>> something like call blk_abort_queue.
> 
> Some more detail. I was thinking maybe you could stack the timeout
> handlers like is done for request_fn handlers or maybe the scsi cmd
> would use the upper layer's timer somehow. Not sure... but at the least
> I think we would not want both a scsi request and dm request timers
> running at the same time.
> 
> Then for the error handling and timeout handling, most FC drivers have a
> terminate_rport_io which works without having to block the entire host.
> Those drivers could implement a newer eh where instead of firing the
> code in scsi_error.c when a cmd times out, it would run
> terminate_rport_io from some workqueue.
> 
> new dm request timed out()
> 	->  scsi_timed_out
> 		->  fc_timed_out()
> 			{
> 				run new eh from workqueue();
> 			}
> 
> 
> new_eh()
> 	/* no new cmds should be started until we figure out what is going on */
> 	block rport()
> 	/* releases cmds upwards so they can run while we try to figure out
> what is going on */
> 	terminate_rport_io()
> 	/* check if devices are ok */
> 	send_tur()
> 	if (tur failed)
> 		start old scsi_error.c code to unjam us.
> 	else
> 		/* everything looks ok so let IO run to this path again */
> 		unblock rport()
> 
> 
>>
>> I think the problem with blk_abort_queue might be that it stops all IO
>> to the entire host where you probably just want to work on the remote
>> port/path. For that you could call something like
>> recover_transient_error. Maybe it could just be a call to
>> terminate_rport_io from a workqueue though.
> 
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-01  8:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-12  8:09 [PATCH] scsi_transport_fc: handle transient error on multipath environment Tomohiro Kusumi
2010-02-12 15:27 ` James Smart
2010-04-01  8:42   ` Tomohiro Kusumi
2010-02-12 17:46 ` Mike Christie
2010-02-12 18:03   ` Mike Christie
2010-04-01  8:42     ` Tomohiro Kusumi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox