linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Make scsi error recovery play nice with devices blocked by transport
@ 2005-12-16 23:58 Michael Reed
  2005-12-28 16:04 ` James Smart
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Reed @ 2005-12-16 23:58 UTC (permalink / raw)
  To: linux-scsi; +Cc: James.Smart, James Bottomley, Christoph Hellwig, Jeremy Higdon

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

As no one has commented, I'm reposting this rfc as a patch.  I've
been using it for all my testing during development of the LSI MPT Fusion
fc transport attribute patch and it has shown no ill effect.

--

Error recovery doesn't interact very well with fc targets which
have been blocked by the fc transport.  Error recovery continues
to attempt to recover the target and ends up marking the fc target
offline.  Once offline, if the target returns before the remote port is
removed, commands which could have been successfully reissued instead
are completed with an error status due to the offline status
of the target.

This patch makes a couple of hopefully minor tweaks to the error
recovery logic to work better with targets which have been blocked
by the transport.

First, if the target is blocked and error recovery gives up, don't
put the device offline.  Either the transport will delete the target
thus disposing of any queued requests or it will unblock the target and
requests will be reissued.

Second, if a device is blocked, queue up commands being flushed from
the done queue for retry instead of completing them with an error.

Thanks,
 Mike Reed


[-- Attachment #2: scsi_fc_recovery.patch --]
[-- Type: text/x-patch, Size: 2164 bytes --]

diff -ru linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c
--- linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c	2005-12-16 16:48:19.000000000 -0600
+++ linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c	2005-12-16 17:42:07.000000000 -0600
@@ -1130,10 +1130,14 @@
 	struct scsi_cmnd *scmd, *next;
 
 	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
-		sdev_printk(KERN_INFO, scmd->device,
-			    "scsi: Device offlined - not"
-			    " ready after error recovery\n");
-		scsi_device_set_state(scmd->device, SDEV_OFFLINE);
+		/* if blocked, transport will provide final device disposition */
+		if (!scsi_device_blocked(scmd->device)) {
+			sdev_printk(KERN_INFO, scmd->device,
+				    "scsi: Device offlined - not"
+				    " ready after error recovery\n");
+			scsi_device_set_state(scmd->device, SDEV_OFFLINE);
+		}
+
 		if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
 			/*
 			 * FIXME: Handle lost cmds.
@@ -1460,9 +1464,10 @@
 
 	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
 		list_del_init(&scmd->eh_entry);
-		if (scsi_device_online(scmd->device) &&
+		if (scsi_device_blocked(scmd->device) ||
+		    (scsi_device_online(scmd->device) &&
 		    !blk_noretry_request(scmd->request) &&
-		    (++scmd->retries < scmd->allowed)) {
+		    (++scmd->retries < scmd->allowed))) {
 			SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush"
 							  " retry cmd: %p\n",
 							  current->comm,
diff -ru linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h
--- linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h	2005-12-05 12:39:40.000000000 -0600
+++ linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h	2005-12-16 17:42:07.000000000 -0600
@@ -275,6 +275,11 @@
 			    int data_direction, void *buffer, unsigned bufflen,
 			    struct scsi_sense_hdr *, int timeout, int retries);
 
+static inline int scsi_device_blocked(struct scsi_device *sdev)
+{
+	return sdev->sdev_state == SDEV_BLOCK;
+}
+
 static inline unsigned int sdev_channel(struct scsi_device *sdev)
 {
 	return sdev->channel;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2005-12-16 23:58 [PATCH] Make scsi error recovery play nice with devices blocked by transport Michael Reed
@ 2005-12-28 16:04 ` James Smart
  2006-01-06 21:33   ` Michael Reed
  0 siblings, 1 reply; 8+ messages in thread
From: James Smart @ 2005-12-28 16:04 UTC (permalink / raw)
  To: Michael Reed
  Cc: linux-scsi, James Bottomley, Christoph Hellwig, Jeremy Higdon

This patch looks ok. It's certainly an issue that we (Emulex) have been
fighting, with the answer being : always ensure that the device i/o timeout
is 2x (+ some fudge?) the devloss_tmo. As we recommend setting the i/o
timeout to 60 seconds for enterprise arrays, and as devloss_tmo defaults
to 30, this generally worked.

One other point though that should be addressed. This doesn't affect the
successive "bigger hammer" recovery steps in scsi_eh_ready_devs().  We
should have the function recognize that the scsi_eh_stu(),
scsi_eh_bus_device_reset(), scsi_eh_bus_reset() functions can fail due to
a cable pull (e.g. devices being blocked), which is ok, and does not
require the next level of reset. Causing a host reset because of a
temporary cable pull, or a bus reset (all targets on a bus to be reset)
because an unrelated one temporarily went away - are not good things.

-- james s


Michael Reed wrote:
> As no one has commented, I'm reposting this rfc as a patch.  I've
> been using it for all my testing during development of the LSI MPT Fusion
> fc transport attribute patch and it has shown no ill effect.
> 
> --
> 
> Error recovery doesn't interact very well with fc targets which
> have been blocked by the fc transport.  Error recovery continues
> to attempt to recover the target and ends up marking the fc target
> offline.  Once offline, if the target returns before the remote port is
> removed, commands which could have been successfully reissued instead
> are completed with an error status due to the offline status
> of the target.
> 
> This patch makes a couple of hopefully minor tweaks to the error
> recovery logic to work better with targets which have been blocked
> by the transport.
> 
> First, if the target is blocked and error recovery gives up, don't
> put the device offline.  Either the transport will delete the target
> thus disposing of any queued requests or it will unblock the target and
> requests will be reissued.
> 
> Second, if a device is blocked, queue up commands being flushed from
> the done queue for retry instead of completing them with an error.
> 
> Thanks,
>  Mike Reed
> 
> 
> 
> ------------------------------------------------------------------------
> 
> diff -ru linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c
> --- linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c	2005-12-16 16:48:19.000000000 -0600
> +++ linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c	2005-12-16 17:42:07.000000000 -0600
> @@ -1130,10 +1130,14 @@
>  	struct scsi_cmnd *scmd, *next;
>  
>  	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
> -		sdev_printk(KERN_INFO, scmd->device,
> -			    "scsi: Device offlined - not"
> -			    " ready after error recovery\n");
> -		scsi_device_set_state(scmd->device, SDEV_OFFLINE);
> +		/* if blocked, transport will provide final device disposition */
> +		if (!scsi_device_blocked(scmd->device)) {
> +			sdev_printk(KERN_INFO, scmd->device,
> +				    "scsi: Device offlined - not"
> +				    " ready after error recovery\n");
> +			scsi_device_set_state(scmd->device, SDEV_OFFLINE);
> +		}
> +
>  		if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
>  			/*
>  			 * FIXME: Handle lost cmds.
> @@ -1460,9 +1464,10 @@
>  
>  	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>  		list_del_init(&scmd->eh_entry);
> -		if (scsi_device_online(scmd->device) &&
> +		if (scsi_device_blocked(scmd->device) ||
> +		    (scsi_device_online(scmd->device) &&
>  		    !blk_noretry_request(scmd->request) &&
> -		    (++scmd->retries < scmd->allowed)) {
> +		    (++scmd->retries < scmd->allowed))) {
>  			SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush"
>  							  " retry cmd: %p\n",
>  							  current->comm,
> diff -ru linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h
> --- linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h	2005-12-05 12:39:40.000000000 -0600
> +++ linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h	2005-12-16 17:42:07.000000000 -0600
> @@ -275,6 +275,11 @@
>  			    int data_direction, void *buffer, unsigned bufflen,
>  			    struct scsi_sense_hdr *, int timeout, int retries);
>  
> +static inline int scsi_device_blocked(struct scsi_device *sdev)
> +{
> +	return sdev->sdev_state == SDEV_BLOCK;
> +}
> +
>  static inline unsigned int sdev_channel(struct scsi_device *sdev)
>  {
>  	return sdev->channel;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2005-12-28 16:04 ` James Smart
@ 2006-01-06 21:33   ` Michael Reed
  2006-01-09 15:01     ` James Smart
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Reed @ 2006-01-06 21:33 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi, James Bottomley, Christoph Hellwig, Jeremy Higdon



James Smart wrote:
> This patch looks ok. It's certainly an issue that we (Emulex) have been
> fighting, with the answer being : always ensure that the device i/o timeout
> is 2x (+ some fudge?) the devloss_tmo. As we recommend setting the i/o
> timeout to 60 seconds for enterprise arrays, and as devloss_tmo defaults
> to 30, this generally worked.
> 
> One other point though that should be addressed. This doesn't affect the
> successive "bigger hammer" recovery steps in scsi_eh_ready_devs().  We
> should have the function recognize that the scsi_eh_stu(),
> scsi_eh_bus_device_reset(), scsi_eh_bus_reset() functions can fail due to
> a cable pull (e.g. devices being blocked), which is ok, and does not
> require the next level of reset. Causing a host reset because of a
> temporary cable pull, or a bus reset (all targets on a bus to be reset)
> because an unrelated one temporarily went away - are not good things.

Well, a device can go away for reasons other than a cable pull.
Sometimes (in my experience with SGI's driver for qlogic fc cards)
harder resets can bring it back whereas waiting won't.  Think firmware
problems....

I think letting the harder resets happen is a good thing (or at least
not a bad thing) as long as recovery waits for the driver to report that
the drive is gone (offline).

Mike

> 
> -- james s
> 
> 
> Michael Reed wrote:
>>As no one has commented, I'm reposting this rfc as a patch.  I've
>>been using it for all my testing during development of the LSI MPT Fusion
>>fc transport attribute patch and it has shown no ill effect.
>>
>>--
>>
>>Error recovery doesn't interact very well with fc targets which
>>have been blocked by the fc transport.  Error recovery continues
>>to attempt to recover the target and ends up marking the fc target
>>offline.  Once offline, if the target returns before the remote port is
>>removed, commands which could have been successfully reissued instead
>>are completed with an error status due to the offline status
>>of the target.
>>
>>This patch makes a couple of hopefully minor tweaks to the error
>>recovery logic to work better with targets which have been blocked
>>by the transport.
>>
>>First, if the target is blocked and error recovery gives up, don't
>>put the device offline.  Either the transport will delete the target
>>thus disposing of any queued requests or it will unblock the target and
>>requests will be reissued.
>>
>>Second, if a device is blocked, queue up commands being flushed from
>>the done queue for retry instead of completing them with an error.
>>
>>Thanks,
>> Mike Reed
>>
>>
>>
>>------------------------------------------------------------------------
>>
>>diff -ru linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c
>>--- linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c	2005-12-16 16:48:19.000000000 -0600
>>+++ linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c	2005-12-16 17:42:07.000000000 -0600
>>@@ -1130,10 +1130,14 @@
>> 	struct scsi_cmnd *scmd, *next;
>> 
>> 	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>>-		sdev_printk(KERN_INFO, scmd->device,
>>-			    "scsi: Device offlined - not"
>>-			    " ready after error recovery\n");
>>-		scsi_device_set_state(scmd->device, SDEV_OFFLINE);
>>+		/* if blocked, transport will provide final device disposition */
>>+		if (!scsi_device_blocked(scmd->device)) {
>>+			sdev_printk(KERN_INFO, scmd->device,
>>+				    "scsi: Device offlined - not"
>>+				    " ready after error recovery\n");
>>+			scsi_device_set_state(scmd->device, SDEV_OFFLINE);
>>+		}
>>+
>> 		if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
>> 			/*
>> 			 * FIXME: Handle lost cmds.
>>@@ -1460,9 +1464,10 @@
>> 
>> 	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>> 		list_del_init(&scmd->eh_entry);
>>-		if (scsi_device_online(scmd->device) &&
>>+		if (scsi_device_blocked(scmd->device) ||
>>+		    (scsi_device_online(scmd->device) &&
>> 		    !blk_noretry_request(scmd->request) &&
>>-		    (++scmd->retries < scmd->allowed)) {
>>+		    (++scmd->retries < scmd->allowed))) {
>> 			SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush"
>> 							  " retry cmd: %p\n",
>> 							  current->comm,
>>diff -ru linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h
>>--- linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h	2005-12-05 12:39:40.000000000 -0600
>>+++ linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h	2005-12-16 17:42:07.000000000 -0600
>>@@ -275,6 +275,11 @@
>> 			    int data_direction, void *buffer, unsigned bufflen,
>> 			    struct scsi_sense_hdr *, int timeout, int retries);
>> 
>>+static inline int scsi_device_blocked(struct scsi_device *sdev)
>>+{
>>+	return sdev->sdev_state == SDEV_BLOCK;
>>+}
>>+
>> static inline unsigned int sdev_channel(struct scsi_device *sdev)
>> {
>> 	return sdev->channel;
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2006-01-06 21:33   ` Michael Reed
@ 2006-01-09 15:01     ` James Smart
  2006-01-09 15:39       ` James Bottomley
  0 siblings, 1 reply; 8+ messages in thread
From: James Smart @ 2006-01-09 15:01 UTC (permalink / raw)
  To: Michael Reed
  Cc: linux-scsi, James Bottomley, Christoph Hellwig, Jeremy Higdon



Michael Reed wrote:
> 
> James Smart wrote:
>> This patch looks ok. It's certainly an issue that we (Emulex) have been
>> fighting, with the answer being : always ensure that the device i/o timeout
>> is 2x (+ some fudge?) the devloss_tmo. As we recommend setting the i/o
>> timeout to 60 seconds for enterprise arrays, and as devloss_tmo defaults
>> to 30, this generally worked.
>>
>> One other point though that should be addressed. This doesn't affect the
>> successive "bigger hammer" recovery steps in scsi_eh_ready_devs().  We
>> should have the function recognize that the scsi_eh_stu(),
>> scsi_eh_bus_device_reset(), scsi_eh_bus_reset() functions can fail due to
>> a cable pull (e.g. devices being blocked), which is ok, and does not
>> require the next level of reset. Causing a host reset because of a
>> temporary cable pull, or a bus reset (all targets on a bus to be reset)
>> because an unrelated one temporarily went away - are not good things.
> 
> Well, a device can go away for reasons other than a cable pull.
> Sometimes (in my experience with SGI's driver for qlogic fc cards)
> harder resets can bring it back whereas waiting won't.  Think firmware
> problems....

True - which is why you let this error path kick in when the device is not
in the blocked state.

> I think letting the harder resets happen is a good thing (or at least
> not a bad thing) as long as recovery waits for the driver to report that
> the drive is gone (offline).

Well, in thinking through this further after my initial reply...

I think we really do want to leave scsi_eh_ready_devs() logic with the bigger
hammer steps alone. Ultimately, they are trying to regain the resources for an
i/o that is trying to be killed but the LLDD (or device) isn't cooperating.
I still believe in not resetting everyone just because a device is temporarily
blocked. However, we need to intercept it at a earlier point... Ultimately,
to reach this path, it starts with an i/o timing out, and the eh_abort handler
failing. In Emulex's case, we are planning on never failing the eh_abort
handler if we're in this temporarily blocked state, even at the expense of a
long wait. This is actually too much to ask of an LLDD - and is hokey. The
logic really should be to intercept the timeout handler, note that the device
is blocked, and delay the abort request until the device has been given a
chance to return (e.g. just restart the i/o abort timer for the amount of 
devloss_tmo that remains). Otherwise, we're always guaranteeing a failure from
the abort handlers (for i/o and device) as there's no device to talk to.

This should remove the need for your if-blocked test in scsi_error.c,
replacing it with the logic in the i/o timeout handler.

-- james s


> 
> Mike
> 
>> -- james s
>>
>>
>> Michael Reed wrote:
>>> As no one has commented, I'm reposting this rfc as a patch.  I've
>>> been using it for all my testing during development of the LSI MPT Fusion
>>> fc transport attribute patch and it has shown no ill effect.
>>>
>>> --
>>>
>>> Error recovery doesn't interact very well with fc targets which
>>> have been blocked by the fc transport.  Error recovery continues
>>> to attempt to recover the target and ends up marking the fc target
>>> offline.  Once offline, if the target returns before the remote port is
>>> removed, commands which could have been successfully reissued instead
>>> are completed with an error status due to the offline status
>>> of the target.
>>>
>>> This patch makes a couple of hopefully minor tweaks to the error
>>> recovery logic to work better with targets which have been blocked
>>> by the transport.
>>>
>>> First, if the target is blocked and error recovery gives up, don't
>>> put the device offline.  Either the transport will delete the target
>>> thus disposing of any queued requests or it will unblock the target and
>>> requests will be reissued.
>>>
>>> Second, if a device is blocked, queue up commands being flushed from
>>> the done queue for retry instead of completing them with an error.
>>>
>>> Thanks,
>>> Mike Reed
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> diff -ru linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c
>>> --- linux-2.6.15-rc5-git6-patched/drivers/scsi/scsi_error.c	2005-12-16 16:48:19.000000000 -0600
>>> +++ linux-2.6.15-rc5-git6-patched-mdr/drivers/scsi/scsi_error.c	2005-12-16 17:42:07.000000000 -0600
>>> @@ -1130,10 +1130,14 @@
>>> 	struct scsi_cmnd *scmd, *next;
>>>
>>> 	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
>>> -		sdev_printk(KERN_INFO, scmd->device,
>>> -			    "scsi: Device offlined - not"
>>> -			    " ready after error recovery\n");
>>> -		scsi_device_set_state(scmd->device, SDEV_OFFLINE);
>>> +		/* if blocked, transport will provide final device disposition */
>>> +		if (!scsi_device_blocked(scmd->device)) {
>>> +			sdev_printk(KERN_INFO, scmd->device,
>>> +				    "scsi: Device offlined - not"
>>> +				    " ready after error recovery\n");
>>> +			scsi_device_set_state(scmd->device, SDEV_OFFLINE);
>>> +		}
>>> +
>>> 		if (scmd->eh_eflags & SCSI_EH_CANCEL_CMD) {
>>> 			/*
>>> 			 * FIXME: Handle lost cmds.
>>> @@ -1460,9 +1464,10 @@
>>>
>>> 	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>>> 		list_del_init(&scmd->eh_entry);
>>> -		if (scsi_device_online(scmd->device) &&
>>> +		if (scsi_device_blocked(scmd->device) ||
>>> +		    (scsi_device_online(scmd->device) &&
>>> 		    !blk_noretry_request(scmd->request) &&
>>> -		    (++scmd->retries < scmd->allowed)) {
>>> +		    (++scmd->retries < scmd->allowed))) {
>>> 			SCSI_LOG_ERROR_RECOVERY(3, printk("%s: flush"
>>> 							  " retry cmd: %p\n",
>>> 							  current->comm,
>>> diff -ru linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h
>>> --- linux-2.6.15-rc5-git6-patched/include/scsi/scsi_device.h	2005-12-05 12:39:40.000000000 -0600
>>> +++ linux-2.6.15-rc5-git6-patched-mdr/include/scsi/scsi_device.h	2005-12-16 17:42:07.000000000 -0600
>>> @@ -275,6 +275,11 @@
>>> 			    int data_direction, void *buffer, unsigned bufflen,
>>> 			    struct scsi_sense_hdr *, int timeout, int retries);
>>>
>>> +static inline int scsi_device_blocked(struct scsi_device *sdev)
>>> +{
>>> +	return sdev->sdev_state == SDEV_BLOCK;
>>> +}
>>> +
>>> static inline unsigned int sdev_channel(struct scsi_device *sdev)
>>> {
>>> 	return sdev->channel;
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2006-01-09 15:01     ` James Smart
@ 2006-01-09 15:39       ` James Bottomley
  2006-01-13 19:29         ` Michael Reed
  0 siblings, 1 reply; 8+ messages in thread
From: James Bottomley @ 2006-01-09 15:39 UTC (permalink / raw)
  To: James.Smart; +Cc: Michael Reed, linux-scsi, Christoph Hellwig, Jeremy Higdon

On Mon, 2006-01-09 at 10:01 -0500, James Smart wrote:
> > I think letting the harder resets happen is a good thing (or at least
> > not a bad thing) as long as recovery waits for the driver to report that
> > the drive is gone (offline).
> 
> Well, in thinking through this further after my initial reply...
> 
> I think we really do want to leave scsi_eh_ready_devs() logic with the bigger
> hammer steps alone. Ultimately, they are trying to regain the resources for an
> i/o that is trying to be killed but the LLDD (or device) isn't cooperating.
> I still believe in not resetting everyone just because a device is temporarily
> blocked. However, we need to intercept it at a earlier point... Ultimately,
> to reach this path, it starts with an i/o timing out, and the eh_abort handler
> failing. In Emulex's case, we are planning on never failing the eh_abort
> handler if we're in this temporarily blocked state, even at the expense of a
> long wait. This is actually too much to ask of an LLDD - and is hokey. The
> logic really should be to intercept the timeout handler, note that the device
> is blocked, and delay the abort request until the device has been given a
> chance to return (e.g. just restart the i/o abort timer for the amount of 
> devloss_tmo that remains). Otherwise, we're always guaranteeing a failure from
> the abort handlers (for i/o and device) as there's no device to talk to.
> 
> This should remove the need for your if-blocked test in scsi_error.c,
> replacing it with the logic in the i/o timeout handler.

Actually, there is another thing you can do even earlier:  implement
scsi_eh_timer_return() in the host template (probably with a generic
routine from the fc class).  This would allow you to hold off the
timeout at least for the length of the user specified timeout and all
the retries.  Probably the routine would simply check to see if the
device is in a devloss timeout and if it is return EH_RESET_TIMER;
otherwise return EH_NOT_HANDLED.

James



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2006-01-09 15:39       ` James Bottomley
@ 2006-01-13 19:29         ` Michael Reed
  2006-01-13 19:38           ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Reed @ 2006-01-13 19:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: James.Smart, linux-scsi, Christoph Hellwig, Jeremy Higdon,
	Gary Hagensen

I'm planning on working on this once the fusion fc_transport code finally
makes it in.  What do you see as a deadline for code addressing this problem?

Perhaps someone else already has something to address this in progress?
It's a significant issue with our fibre channel customers.

James Bottomley wrote:
> On Mon, 2006-01-09 at 10:01 -0500, James Smart wrote:
>>>I think letting the harder resets happen is a good thing (or at least
>>>not a bad thing) as long as recovery waits for the driver to report that
>>>the drive is gone (offline).
>>Well, in thinking through this further after my initial reply...
>>
>>I think we really do want to leave scsi_eh_ready_devs() logic with the bigger
>>hammer steps alone. Ultimately, they are trying to regain the resources for an
>>i/o that is trying to be killed but the LLDD (or device) isn't cooperating.
>>I still believe in not resetting everyone just because a device is temporarily
>>blocked. However, we need to intercept it at a earlier point... Ultimately,
>>to reach this path, it starts with an i/o timing out, and the eh_abort handler
>>failing. In Emulex's case, we are planning on never failing the eh_abort
>>handler if we're in this temporarily blocked state, even at the expense of a
>>long wait. This is actually too much to ask of an LLDD - and is hokey. The
>>logic really should be to intercept the timeout handler, note that the device
>>is blocked, and delay the abort request until the device has been given a
>>chance to return (e.g. just restart the i/o abort timer for the amount of 
>>devloss_tmo that remains). Otherwise, we're always guaranteeing a failure from
>>the abort handlers (for i/o and device) as there's no device to talk to.
>>
>>This should remove the need for your if-blocked test in scsi_error.c,
>>replacing it with the logic in the i/o timeout handler.

This makes sense.  While there are times when the big hammer might
have results, properly operating firmware should in make that the exception.

> 
> Actually, there is another thing you can do even earlier:  implement
> scsi_eh_timer_return() in the host template (probably with a generic
> routine from the fc class).  This would allow you to hold off the
> timeout at least for the length of the user specified timeout and all
> the retries.  Probably the routine would simply check to see if the
> device is in a devloss timeout and if it is return EH_RESET_TIMER;
> otherwise return EH_NOT_HANDLED.

We have to be able to handle the case that error recovery gets started
before the device is blocked by the transport as well as the other way
around.  Not being fluent in the timeout / error handling code, do you
see this suggestion being able to handle both cases?  SMOP?

Other ideas?

Thanks,
 Mike


> 
> James
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2006-01-13 19:29         ` Michael Reed
@ 2006-01-13 19:38           ` Christoph Hellwig
  2006-01-13 19:50             ` Michael Reed
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2006-01-13 19:38 UTC (permalink / raw)
  To: Michael Reed
  Cc: James Bottomley, James.Smart, linux-scsi, Christoph Hellwig,
	Jeremy Higdon, Gary Hagensen

On Fri, Jan 13, 2006 at 01:29:56PM -0600, Michael Reed wrote:
> I'm planning on working on this once the fusion fc_transport code finally
> makes it in.  What do you see as a deadline for code addressing this problem?
> 
> Perhaps someone else already has something to address this in progress?
> It's a significant issue with our fibre channel customers.

There's not at all.  Sunday is feature deadline for 2.6.16, but we'll
happily accept everything good for 2.6.17.  OTOH this probably falls
into fixes that can go in after -rc1 anyway.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Make scsi error recovery play nice with devices blocked by transport
  2006-01-13 19:38           ` Christoph Hellwig
@ 2006-01-13 19:50             ` Michael Reed
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Reed @ 2006-01-13 19:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, James.Smart, linux-scsi, Jeremy Higdon,
	Gary Hagensen



Christoph Hellwig wrote:
> On Fri, Jan 13, 2006 at 01:29:56PM -0600, Michael Reed wrote:
>>I'm planning on working on this once the fusion fc_transport code finally
>>makes it in.  What do you see as a deadline for code addressing this problem?
>>
>>Perhaps someone else already has something to address this in progress?
>>It's a significant issue with our fibre channel customers.
> 
> There's not at all.  Sunday is feature deadline for 2.6.16, but we'll
> happily accept everything good for 2.6.17.  OTOH this probably falls
> into fixes that can go in after -rc1 anyway.
> 

I was hoping that was the case.  It's pretty broken as it is today.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-01-13 19:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-16 23:58 [PATCH] Make scsi error recovery play nice with devices blocked by transport Michael Reed
2005-12-28 16:04 ` James Smart
2006-01-06 21:33   ` Michael Reed
2006-01-09 15:01     ` James Smart
2006-01-09 15:39       ` James Bottomley
2006-01-13 19:29         ` Michael Reed
2006-01-13 19:38           ` Christoph Hellwig
2006-01-13 19:50             ` Michael Reed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).