Re: [PATCH] libsas: fix error handling

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] libsas: fix error handling
@ 2008-02-22 12:19 Luben Tuikov
  0 siblings, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2008-02-22 12:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

--- On Thu, 2/21/08, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> Look, Luben; I don't mind you making an arse of
> yourself on the mailing
> lists; that's about par for the course nowadays.

No, you just did in this thread by pretending to be any kind of
authority or have an expertise on how to fix this problem, while in
fact to properly address it, one really needs knowledge of the
protocol, how the HW works and how the sequencer works, and
a protocol trace of this specific problem.  And then fixing the
problem after it's been fully understood is yet another feat.

It's unfortunate that you feel the need to attack me and call
me names.  You need to self-examine yourself and figure out
and fix your hostile attitude.

I've mostly stopped exposing your complete and utter incompetence
in SCSI infrastructure and left the weed grow, but calling me names
on public lists is very unprofessional.

Speaking of "arse", a recent example of one of your "pearls" is this:
http://marc.info/?l=linux-ide&m=120291905515015&w=2
"However, I'm happy to be proven wrong ... anyone on this thread is
 welcome to come up with a userland enclosure infrastructure.  Once it
 does everything the in-kernel one does (which is really about the
 minimal possible set), I'll be glad to erase the in-kernel one."

You introduce unnecessary code, the vendors tell you they don't
agree with you, you respond with "I'm happy to be proven wrong" in spite
of everyone's opinion differing yours.  You clearly decide to do what
you think is right without listening to vendors or people in the
industry.  Also, you completely ignore sg_ses(8).

There are other examples, this one is just more recent.

> However, failing to report serious bugs you knew about in
> your code, and
> have known about for two years, is tantamount to sabotage
> ... especially

In 2005, you decided to take a working code out of a GIT tree,
change it privately, and then submit it to GIT as new. Since of
course GIT history is immutable, there is no way to sync back to the
solution I had for users to inspect, evaluate and compare to your
derived code, since you took it out of GIT and changed it privately
before submitting it back to the kernel (I had a GIT repo with
already integrated working code, synced daily to Linus' kernel).  Then
a lot of people started asking me: "It doesn't work" and "SATA
doesn't work", etc, and I had to explain over and over again
what's happened.  So please do not talk about sabotage.

You changed the code quite a bit, to the point where patches
between what I maintain and what you decided to fork off
and make it your own play-ground were different and patches wouldn't
apply cleanly or at all.

For this thread, I've looked at the "error handling" code of what's
in the kernel for aic94xx and your version of the "libsas" and
frankly, I cannot submit patches to something which completely does
NOT follow the Sequencer Interface Spec for AIC94xx series of chips.
For example, handling of REQ_TASK_ABORT -- its implementation is
quite naive and out of spec, i.e. already broken.  I cannot submit
patches to something that I know is already broken, which had
been changed from a per-spec code version.

You know, people are still asking me to sync up to kernel
x.y.z the SAS/SATL+aic94xx I maintain, because they tell me "have been
running it for 2 years on production hardware [i.e. IBM] and no
problems and I just need to upgrade the kernel".

Will you also attack me for having a SATL as part of my SAS code
which I didn't submit to the kernel?  Do you know how much code
is out there that vendors don't bother to try to integrate because
of your attitude such as this?

> when you've been reading the bug reports on the mailing
> list.  I have to
> wonder how many more such bugs you've left in the code
> for users to
> find.

I didn't "leave" any bugs.  Bugs are just part of the development
cycle.  You should know this.

aic94xx is an old product, most customers have upgraded.
The ones who still have IBM servers with this chip, have had the
need to upgrade their kernels for various other reasons and since the
in-kernel solution has failed them, have requested to run my code.

> > BTW, the problem you're "debugging" may
> require a protocol
> > trace and a sequencer update by Adaptec.
> 
> I think you're fully aware that my test rig consists of
> a couple of SAS
> cards and drives donated by IBM and two expanders donated
> by LSI, plus a bit of equipment scavenged from Intel.

No, I'm not aware of this.  I have absolutely no interest in what your
"test rig" consists of.

Here is a question though: why do you even have SAS HW?  You don't
work for a SAS HW vendor and you don't have expertise in SAS
or SCSI in general.  Do you have hardware for each and every
LLDD that's in the kernel?

> I have no access to sophisticated tools like protocol analyzers.

See above.  You don't need to have access to such things unless
you worked for a vendor, and your job is to make your employer's
customers happy, by looking at traces everyday and fixing bugs.
And this itself isn't an easy matter, you need to talk to the
HW engineers, understand the protocol, understand the hardware,
how it implements the protocol, maybe need to read RTL code, all
things which are really out of your domain.

Now if you don't have access to such things, don't give "expert"
advice, and leave it to the respective vendor to get involved.
They have the equipment and the interest to debug their customers'
problems.  Then those vendors would submit patches for you
to integrate.

>  However, thanks for the advice and help.

Hey, no problem, anything for you James.

    Luben

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] libsas: fix error handling
@ 2008-02-20 16:07 James Bottomley
  2008-02-21  5:51 ` Luben Tuikov
  2008-02-22 19:46 ` James Bottomley
  0 siblings, 2 replies; 5+ messages in thread
From: James Bottomley @ 2008-02-20 16:07 UTC (permalink / raw)
  To: linux-scsi

The libsas error handler has two fairly fatal bugs

1. scsi_sas_task_done calls scsi_eh_finish_cmd() too early.  This
   happens if the task completes after it has been aborted but before
   the error handler starts up.  Because scsi_eh_finish_cmd()
   decrements host_failed and adds the task to the done list, the
   error handler start check (host_failed == host_busy) never passes
   and the eh never starts.

2. The multiple task completion paths sas_scsi_clear_queue_... all
   simply delete the task from the error queue.  This causes it to
   disappear into the ether, since a command must be placed on the
   done queue to be finished off by the error handler.  This behaviour
   causes the HBA to hang on pending commands.

Fix 1. by removing the SAS_TASK_STATE_ABORTED check and calling
->scsi_done() unconditionally (it is a nop if the timer has fired).
This keeps the task in the error handling queue until the eh starts.

Fix 2. by making sure every task goes through task complete followed
by scsi_eh_finish_cmd().

Tested this by firing resets across a disk running a hammer test (now
it actually survives without hanging the system)

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/scsi/libsas/sas_scsi_host.c |   53 +++++++++++++++++++----------------
 1 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..b656e29 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task)
 {
 	struct task_status_struct *ts = &task->task_status;
 	struct scsi_cmnd *sc = task->uldd_task;
-	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc->device->host);
-	unsigned ts_flags = task->task_state_flags;
 	int hs = 0, stat = 0;
 
 	if (unlikely(!sc)) {
@@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task)
 	sc->result = (hs << 16) | stat;
 	list_del_init(&task->list);
 	sas_free_task(task);
-	/* This is very ugly but this is how SCSI Core works. */
-	if (ts_flags & SAS_TASK_STATE_ABORTED)
-		scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q);
-	else
-		sc->scsi_done(sc);
+	sc->scsi_done(sc);
 }
 
 static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
@@ -255,13 +249,27 @@ out:
 	return res;
 }
 
+static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
+{
+	struct sas_task *task = TO_SAS_TASK(cmd);
+	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd->device->host);
+
+	/* First off call task_done.  However, task will
+	 * be free'd after this */
+	task->task_done(task);
+	/* now finish the command and move it on to the error
+	 * handler done list, this also takes it off the
+	 * error handler pending list */
+	scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q);
+}
+
 static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd)
 {
 	struct scsi_cmnd *cmd, *n;
 
 	list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
 		if (cmd == my_cmd)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q,
 		struct domain_device *x = cmd_to_domain_dev(cmd);
 
 		if (x == dev)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q,
 		struct asd_sas_port *x = dev->port;
 
 		if (x == port)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -528,14 +536,14 @@ Again:
 		case TASK_IS_DONE:
 			SAS_DPRINTK("%s: task 0x%p is done\n", __FUNCTION__,
 				    task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
 		case TASK_IS_ABORTED:
 			SAS_DPRINTK("%s: task 0x%p is aborted\n",
 				    __FUNCTION__, task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
@@ -547,7 +555,7 @@ Again:
 					    "recovered\n",
 					    SAS_ADDR(task->dev),
 					    cmd->device->lun);
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_lu(work_q, cmd);
@@ -562,7 +570,7 @@ Again:
 			if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
 				SAS_DPRINTK("I_T %016llx recovered\n",
 					    SAS_ADDR(task->dev->sas_addr));
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_I_T(work_q, task->dev);
@@ -577,7 +585,7 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus port:%d "
 						    "succeeded\n", port->id);
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
 					sas_scsi_clear_queue_port(work_q,
@@ -591,10 +599,10 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus ha "
 						    "succeeded\n");
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
-					goto out;
+					goto clear_q;
 				}
 			}
 			/* If we are here -- this means that no amount
@@ -606,21 +614,18 @@ Again:
 				    SAS_ADDR(task->dev->sas_addr),
 				    cmd->device->lun);
 
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			goto clear_q;
 		}
 	}
-out:
 	return list_empty(work_q);
 clear_q:
 	SAS_DPRINTK("--- Exit %s -- clear_q\n", __FUNCTION__);
-	list_for_each_entry_safe(cmd, n, work_q, eh_entry) {
-		struct sas_task *task = TO_SAS_TASK(cmd);
-		list_del_init(&cmd->eh_entry);
-		task->task_done(task);
-	}
+	list_for_each_entry_safe(cmd, n, work_q, eh_entry)
+		sas_eh_finish_cmd(cmd);
+
 	return list_empty(work_q);
 }
 
-- 
1.5.4.1




^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] libsas: fix error handling
  2008-02-20 16:07 James Bottomley
@ 2008-02-21  5:51 ` Luben Tuikov
  2008-02-21 15:33   ` James Bottomley
  2008-02-22 19:46 ` James Bottomley
  1 sibling, 1 reply; 5+ messages in thread
From: Luben Tuikov @ 2008-02-21  5:51 UTC (permalink / raw)
  To: linux-scsi, James Bottomley

A similar change went into the SAS/SATL layer I support
shortly after I posted the version which you forked off.
The git date of the commit is: Tue Feb 7 22:14:54 2006 -0800.

BTW, the problem you're "debugging" may require a protocol
trace and a sequencer update by Adaptec.

    Luben

--- On Wed, 2/20/08, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> From: James Bottomley <James.Bottomley@HansenPartnership.com>
> Subject: [PATCH] libsas: fix error handling
> To: "linux-scsi" <linux-scsi@vger.kernel.org>
> Date: Wednesday, February 20, 2008, 8:07 AM
> The libsas error handler has two fairly fatal bugs
> 
> 1. scsi_sas_task_done calls scsi_eh_finish_cmd() too early.
>  This
>    happens if the task completes after it has been aborted
> but before
>    the error handler starts up.  Because
> scsi_eh_finish_cmd()
>    decrements host_failed and adds the task to the done
> list, the
>    error handler start check (host_failed == host_busy)
> never passes
>    and the eh never starts.
> 
> 2. The multiple task completion paths
> sas_scsi_clear_queue_... all
>    simply delete the task from the error queue.  This
> causes it to
>    disappear into the ether, since a command must be placed
> on the
>    done queue to be finished off by the error handler. 
> This behaviour
>    causes the HBA to hang on pending commands.
> 
> Fix 1. by removing the SAS_TASK_STATE_ABORTED check and
> calling
> ->scsi_done() unconditionally (it is a nop if the timer
> has fired).
> This keeps the task in the error handling queue until the
> eh starts.
> 
> Fix 2. by making sure every task goes through task complete
> followed
> by scsi_eh_finish_cmd().
> 
> Tested this by firing resets across a disk running a hammer
> test (now
> it actually survives without hanging the system)
> 
> Signed-off-by: James Bottomley
> <James.Bottomley@HansenPartnership.com>
> ---
>  drivers/scsi/libsas/sas_scsi_host.c |   53
> +++++++++++++++++++----------------
>  1 files changed, 29 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/scsi/libsas/sas_scsi_host.c
> b/drivers/scsi/libsas/sas_scsi_host.c
> index f869fba..b656e29 100644
> --- a/drivers/scsi/libsas/sas_scsi_host.c
> +++ b/drivers/scsi/libsas/sas_scsi_host.c
> @@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct
> sas_task *task)
>  {
>  	struct task_status_struct *ts =
> &task->task_status;
>  	struct scsi_cmnd *sc = task->uldd_task;
> -	struct sas_ha_struct *sas_ha =
> SHOST_TO_SAS_HA(sc->device->host);
> -	unsigned ts_flags = task->task_state_flags;
>  	int hs = 0, stat = 0;
>  
>  	if (unlikely(!sc)) {
> @@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct
> sas_task *task)
>  	sc->result = (hs << 16) | stat;
>  	list_del_init(&task->list);
>  	sas_free_task(task);
> -	/* This is very ugly but this is how SCSI Core works. */
> -	if (ts_flags & SAS_TASK_STATE_ABORTED)
> -		scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q);
> -	else
> -		sc->scsi_done(sc);
> +	sc->scsi_done(sc);
>  }
>  
>  static enum task_attribute sas_scsi_get_task_attr(struct
> scsi_cmnd *cmd)
> @@ -255,13 +249,27 @@ out:
>  	return res;
>  }
>  
> +static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
> +{
> +	struct sas_task *task = TO_SAS_TASK(cmd);
> +	struct sas_ha_struct *sas_ha =
> SHOST_TO_SAS_HA(cmd->device->host);
> +
> +	/* First off call task_done.  However, task will
> +	 * be free'd after this */
> +	task->task_done(task);
> +	/* now finish the command and move it on to the error
> +	 * handler done list, this also takes it off the
> +	 * error handler pending list */
> +	scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q);
> +}
> +
>  static void sas_scsi_clear_queue_lu(struct list_head
> *error_q, struct scsi_cmnd *my_cmd)
>  {
>  	struct scsi_cmnd *cmd, *n;
>  
>  	list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
>  		if (cmd == my_cmd)
> -			list_del_init(&cmd->eh_entry);
> +			sas_eh_finish_cmd(cmd);
>  	}
>  }
>  
> @@ -274,7 +282,7 @@ static void
> sas_scsi_clear_queue_I_T(struct list_head *error_q,
>  		struct domain_device *x = cmd_to_domain_dev(cmd);
>  
>  		if (x == dev)
> -			list_del_init(&cmd->eh_entry);
> +			sas_eh_finish_cmd(cmd);
>  	}
>  }
>  
> @@ -288,7 +296,7 @@ static void
> sas_scsi_clear_queue_port(struct list_head *error_q,
>  		struct asd_sas_port *x = dev->port;
>  
>  		if (x == port)
> -			list_del_init(&cmd->eh_entry);
> +			sas_eh_finish_cmd(cmd);
>  	}
>  }
>  
> @@ -528,14 +536,14 @@ Again:
>  		case TASK_IS_DONE:
>  			SAS_DPRINTK("%s: task 0x%p is done\n",
> __FUNCTION__,
>  				    task);
> -			task->task_done(task);
> +			sas_eh_finish_cmd(cmd);
>  			if (need_reset)
>  				try_to_reset_cmd_device(shost, cmd);
>  			continue;
>  		case TASK_IS_ABORTED:
>  			SAS_DPRINTK("%s: task 0x%p is aborted\n",
>  				    __FUNCTION__, task);
> -			task->task_done(task);
> +			sas_eh_finish_cmd(cmd);
>  			if (need_reset)
>  				try_to_reset_cmd_device(shost, cmd);
>  			continue;
> @@ -547,7 +555,7 @@ Again:
>  					    "recovered\n",
>  					    SAS_ADDR(task->dev),
>  					    cmd->device->lun);
> -				task->task_done(task);
> +				sas_eh_finish_cmd(cmd);
>  				if (need_reset)
>  					try_to_reset_cmd_device(shost, cmd);
>  				sas_scsi_clear_queue_lu(work_q, cmd);
> @@ -562,7 +570,7 @@ Again:
>  			if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
>  				SAS_DPRINTK("I_T %016llx recovered\n",
>  					    SAS_ADDR(task->dev->sas_addr));
> -				task->task_done(task);
> +				sas_eh_finish_cmd(cmd);
>  				if (need_reset)
>  					try_to_reset_cmd_device(shost, cmd);
>  				sas_scsi_clear_queue_I_T(work_q, task->dev);
> @@ -577,7 +585,7 @@ Again:
>  				if (res == TMF_RESP_FUNC_COMPLETE) {
>  					SAS_DPRINTK("clear nexus port:%d "
>  						    "succeeded\n", port->id);
> -					task->task_done(task);
> +					sas_eh_finish_cmd(cmd);
>  					if (need_reset)
>  						try_to_reset_cmd_device(shost, cmd);
>  					sas_scsi_clear_queue_port(work_q,
> @@ -591,10 +599,10 @@ Again:
>  				if (res == TMF_RESP_FUNC_COMPLETE) {
>  					SAS_DPRINTK("clear nexus ha "
>  						    "succeeded\n");
> -					task->task_done(task);
> +					sas_eh_finish_cmd(cmd);
>  					if (need_reset)
>  						try_to_reset_cmd_device(shost, cmd);
> -					goto out;
> +					goto clear_q;
>  				}
>  			}
>  			/* If we are here -- this means that no amount
> @@ -606,21 +614,18 @@ Again:
>  				    SAS_ADDR(task->dev->sas_addr),
>  				    cmd->device->lun);
>  
> -			task->task_done(task);
> +			sas_eh_finish_cmd(cmd);
>  			if (need_reset)
>  				try_to_reset_cmd_device(shost, cmd);
>  			goto clear_q;
>  		}
>  	}
> -out:
>  	return list_empty(work_q);
>  clear_q:
>  	SAS_DPRINTK("--- Exit %s -- clear_q\n",
> __FUNCTION__);
> -	list_for_each_entry_safe(cmd, n, work_q, eh_entry) {
> -		struct sas_task *task = TO_SAS_TASK(cmd);
> -		list_del_init(&cmd->eh_entry);
> -		task->task_done(task);
> -	}
> +	list_for_each_entry_safe(cmd, n, work_q, eh_entry)
> +		sas_eh_finish_cmd(cmd);
> +
>  	return list_empty(work_q);
>  }
>  
> -- 
> 1.5.4.1
> 
> 
> 
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] libsas: fix error handling
  2008-02-21  5:51 ` Luben Tuikov
@ 2008-02-21 15:33   ` James Bottomley
  0 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2008-02-21 15:33 UTC (permalink / raw)
  To: ltuikov; +Cc: linux-scsi

On Wed, 2008-02-20 at 21:51 -0800, Luben Tuikov wrote:
> A similar change went into the SAS/SATL layer I support
> shortly after I posted the version which you forked off.
> The git date of the commit is: Tue Feb 7 22:14:54 2006 -0800.

Look, Luben; I don't mind you making an arse of yourself on the mailing
lists; that's about par for the course nowadays.

However, failing to report serious bugs you knew about in your code, and
have known about for two years, is tantamount to sabotage ... especially
when you've been reading the bug reports on the mailing list.  I have to
wonder how many more such bugs you've left in the code for users to
find.

> BTW, the problem you're "debugging" may require a protocol
> trace and a sequencer update by Adaptec.

I think you're fully aware that my test rig consists of a couple of SAS
cards and drives donated by IBM and two expanders donated by LSI, plus a
bit of equipment scavenged from Intel.  I have no access to
sophisticated tools like protocol analyzers.  However, thanks for the
advice and help.

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] libsas: fix error handling
  2008-02-20 16:07 James Bottomley
  2008-02-21  5:51 ` Luben Tuikov
@ 2008-02-22 19:46 ` James Bottomley
  1 sibling, 0 replies; 5+ messages in thread
From: James Bottomley @ 2008-02-22 19:46 UTC (permalink / raw)
  To: linux-scsi

On Wed, 2008-02-20 at 10:07 -0600, James Bottomley wrote:
> The libsas error handler has two fairly fatal bugs

A bug in this one was picked up in testing.  There's still a double
completion race from the time a task is aborted to the time the eh
actually starts.  If the task returns in that interval it will get
completed twice, so we key off the SAS_TASK_STATE_ABORTED to prevent
this (or at least narrow the race window to being almost untriggerable).

James

---

Subject: [SCSI] libsas: fix error handling

The libsas error handler has two fairly fatal bugs

1. scsi_sas_task_done calls scsi_eh_finish_cmd() too early.  This
   happens if the task completes after it has been aborted but before
   the error handler starts up.  Because scsi_eh_finish_cmd()
   decrements host_failed and adds the task to the done list, the
   error handler start check (host_failed == host_busy) never passes
   and the eh never starts.

2. The multiple task completion paths sas_scsi_clear_queue_... all
   simply delete the task from the error queue.  This causes it to
   disappear into the ether, since a command must be placed on the
   done queue to be finished off by the error handler.  This behaviour
   causes the HBA to hang on pending commands.

Fix 1. by moving the SAS_TASK_STATE_ABORTED check to an exit clause at
the top of the routine and calling ->scsi_done() unconditionally (it
is a nop if the timer has fired).  This keeps the task in the error
handling queue until the eh starts.

Fix 2. by making sure every task goes through task complete followed
by scsi_eh_finish_cmd().

Tested this by firing resets across a disk running a hammer test (now
it actually survives without hanging the system)

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/scsi/libsas/sas_scsi_host.c |   65 ++++++++++++++++++++++-------------
 1 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index 583d249..b796d99 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -53,10 +53,14 @@ static void sas_scsi_task_done(struct sas_task *task)
 {
 	struct task_status_struct *ts = &task->task_status;
 	struct scsi_cmnd *sc = task->uldd_task;
-	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc->device->host);
-	unsigned ts_flags = task->task_state_flags;
 	int hs = 0, stat = 0;
 
+	if (unlikely(task->task_state_flags & SAS_TASK_STATE_ABORTED)) {
+		/* Aborted tasks will be completed by the error handler */
+		SAS_DPRINTK("task done but aborted\n");
+		return;
+	}
+
 	if (unlikely(!sc)) {
 		SAS_DPRINTK("task_done called with non existing SCSI cmnd!\n");
 		list_del_init(&task->list);
@@ -122,11 +126,7 @@ static void sas_scsi_task_done(struct sas_task *task)
 	sc->result = (hs << 16) | stat;
 	list_del_init(&task->list);
 	sas_free_task(task);
-	/* This is very ugly but this is how SCSI Core works. */
-	if (ts_flags & SAS_TASK_STATE_ABORTED)
-		scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q);
-	else
-		sc->scsi_done(sc);
+	sc->scsi_done(sc);
 }
 
 static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
@@ -257,13 +257,33 @@ out:
 	return res;
 }
 
+static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
+{
+	struct sas_task *task = TO_SAS_TASK(cmd);
+	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd->device->host);
+
+	/* remove the aborted task flag to allow the task to be
+	 * completed now. At this point, we only get called following
+	 * an actual abort of the task, so we should be guaranteed not
+	 * to be racing with any completions from the LLD (hence we
+	 * don't need the task state lock to clear the flag) */
+	task->task_state_flags &= ~SAS_TASK_STATE_ABORTED;
+	/* Now call task_done.  However, task will be free'd after
+	 * this */
+	task->task_done(task);
+	/* now finish the command and move it on to the error
+	 * handler done list, this also takes it off the
+	 * error handler pending list */
+	scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q);
+}
+
 static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd)
 {
 	struct scsi_cmnd *cmd, *n;
 
 	list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
 		if (cmd == my_cmd)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -276,7 +296,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q,
 		struct domain_device *x = cmd_to_domain_dev(cmd);
 
 		if (x == dev)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -290,7 +310,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q,
 		struct asd_sas_port *x = dev->port;
 
 		if (x == port)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -530,14 +550,14 @@ Again:
 		case TASK_IS_DONE:
 			SAS_DPRINTK("%s: task 0x%p is done\n", __FUNCTION__,
 				    task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
 		case TASK_IS_ABORTED:
 			SAS_DPRINTK("%s: task 0x%p is aborted\n",
 				    __FUNCTION__, task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
@@ -549,7 +569,7 @@ Again:
 					    "recovered\n",
 					    SAS_ADDR(task->dev),
 					    cmd->device->lun);
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_lu(work_q, cmd);
@@ -564,7 +584,7 @@ Again:
 			if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
 				SAS_DPRINTK("I_T %016llx recovered\n",
 					    SAS_ADDR(task->dev->sas_addr));
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_I_T(work_q, task->dev);
@@ -579,7 +599,7 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus port:%d "
 						    "succeeded\n", port->id);
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
 					sas_scsi_clear_queue_port(work_q,
@@ -593,10 +613,10 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus ha "
 						    "succeeded\n");
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
-					goto out;
+					goto clear_q;
 				}
 			}
 			/* If we are here -- this means that no amount
@@ -608,21 +628,18 @@ Again:
 				    SAS_ADDR(task->dev->sas_addr),
 				    cmd->device->lun);
 
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			goto clear_q;
 		}
 	}
-out:
 	return list_empty(work_q);
 clear_q:
 	SAS_DPRINTK("--- Exit %s -- clear_q\n", __FUNCTION__);
-	list_for_each_entry_safe(cmd, n, work_q, eh_entry) {
-		struct sas_task *task = TO_SAS_TASK(cmd);
-		list_del_init(&cmd->eh_entry);
-		task->task_done(task);
-	}
+	list_for_each_entry_safe(cmd, n, work_q, eh_entry)
+		sas_eh_finish_cmd(cmd);
+
 	return list_empty(work_q);
 }
 
-- 
1.5.3.7




^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-02-22 19:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-22 12:19 [PATCH] libsas: fix error handling Luben Tuikov
  -- strict thread matches above, loose matches on Subject: below --
2008-02-20 16:07 James Bottomley
2008-02-21  5:51 ` Luben Tuikov
2008-02-21 15:33   ` James Bottomley
2008-02-22 19:46 ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox