public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] libsas: fix error handling
@ 2008-02-22 12:19 Luben Tuikov
  0 siblings, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2008-02-22 12:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

--- On Thu, 2/21/08, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> Look, Luben; I don't mind you making an arse of
> yourself on the mailing
> lists; that's about par for the course nowadays.

No, you just did in this thread by pretending to be any kind of
authority or have an expertise on how to fix this problem, while in
fact to properly address it, one really needs knowledge of the
protocol, how the HW works and how the sequencer works, and
a protocol trace of this specific problem.  And then fixing the
problem after it's been fully understood is yet another feat.

It's unfortunate that you feel the need to attack me and call
me names.  You need to self-examine yourself and figure out
and fix your hostile attitude.

I've mostly stopped exposing your complete and utter incompetence
in SCSI infrastructure and left the weed grow, but calling me names
on public lists is very unprofessional.

Speaking of "arse", a recent example of one of your "pearls" is this:
http://marc.info/?l=linux-ide&m=120291905515015&w=2
"However, I'm happy to be proven wrong ... anyone on this thread is
 welcome to come up with a userland enclosure infrastructure.  Once it
 does everything the in-kernel one does (which is really about the
 minimal possible set), I'll be glad to erase the in-kernel one."

You introduce unnecessary code, the vendors tell you they don't
agree with you, you respond with "I'm happy to be proven wrong" in spite
of everyone's opinion differing yours.  You clearly decide to do what
you think is right without listening to vendors or people in the
industry.  Also, you completely ignore sg_ses(8).

There are other examples, this one is just more recent.

> However, failing to report serious bugs you knew about in
> your code, and
> have known about for two years, is tantamount to sabotage
> ... especially

In 2005, you decided to take a working code out of a GIT tree,
change it privately, and then submit it to GIT as new. Since of
course GIT history is immutable, there is no way to sync back to the
solution I had for users to inspect, evaluate and compare to your
derived code, since you took it out of GIT and changed it privately
before submitting it back to the kernel (I had a GIT repo with
already integrated working code, synced daily to Linus' kernel).  Then
a lot of people started asking me: "It doesn't work" and "SATA
doesn't work", etc, and I had to explain over and over again
what's happened.  So please do not talk about sabotage.

You changed the code quite a bit, to the point where patches
between what I maintain and what you decided to fork off
and make it your own play-ground were different and patches wouldn't
apply cleanly or at all.

For this thread, I've looked at the "error handling" code of what's
in the kernel for aic94xx and your version of the "libsas" and
frankly, I cannot submit patches to something which completely does
NOT follow the Sequencer Interface Spec for AIC94xx series of chips.
For example, handling of REQ_TASK_ABORT -- its implementation is
quite naive and out of spec, i.e. already broken.  I cannot submit
patches to something that I know is already broken, which had
been changed from a per-spec code version.

You know, people are still asking me to sync up to kernel
x.y.z the SAS/SATL+aic94xx I maintain, because they tell me "have been
running it for 2 years on production hardware [i.e. IBM] and no
problems and I just need to upgrade the kernel".

Will you also attack me for having a SATL as part of my SAS code
which I didn't submit to the kernel?  Do you know how much code
is out there that vendors don't bother to try to integrate because
of your attitude such as this?

> when you've been reading the bug reports on the mailing
> list.  I have to
> wonder how many more such bugs you've left in the code
> for users to
> find.

I didn't "leave" any bugs.  Bugs are just part of the development
cycle.  You should know this.

aic94xx is an old product, most customers have upgraded.
The ones who still have IBM servers with this chip, have had the
need to upgrade their kernels for various other reasons and since the
in-kernel solution has failed them, have requested to run my code.

> > BTW, the problem you're "debugging" may
> require a protocol
> > trace and a sequencer update by Adaptec.
> 
> I think you're fully aware that my test rig consists of
> a couple of SAS
> cards and drives donated by IBM and two expanders donated
> by LSI, plus a bit of equipment scavenged from Intel.

No, I'm not aware of this.  I have absolutely no interest in what your
"test rig" consists of.

Here is a question though: why do you even have SAS HW?  You don't
work for a SAS HW vendor and you don't have expertise in SAS
or SCSI in general.  Do you have hardware for each and every
LLDD that's in the kernel?

> I have no access to sophisticated tools like protocol analyzers.

See above.  You don't need to have access to such things unless
you worked for a vendor, and your job is to make your employer's
customers happy, by looking at traces everyday and fixing bugs.
And this itself isn't an easy matter, you need to talk to the
HW engineers, understand the protocol, understand the hardware,
how it implements the protocol, maybe need to read RTL code, all
things which are really out of your domain.

Now if you don't have access to such things, don't give "expert"
advice, and leave it to the respective vendor to get involved.
They have the equipment and the interest to debug their customers'
problems.  Then those vendors would submit patches for you
to integrate.

>  However, thanks for the advice and help.

Hey, no problem, anything for you James.

    Luben


^ permalink raw reply	[flat|nested] 5+ messages in thread
* [PATCH] libsas: fix error handling
@ 2008-02-20 16:07 James Bottomley
  2008-02-21  5:51 ` Luben Tuikov
  2008-02-22 19:46 ` James Bottomley
  0 siblings, 2 replies; 5+ messages in thread
From: James Bottomley @ 2008-02-20 16:07 UTC (permalink / raw)
  To: linux-scsi

The libsas error handler has two fairly fatal bugs

1. scsi_sas_task_done calls scsi_eh_finish_cmd() too early.  This
   happens if the task completes after it has been aborted but before
   the error handler starts up.  Because scsi_eh_finish_cmd()
   decrements host_failed and adds the task to the done list, the
   error handler start check (host_failed == host_busy) never passes
   and the eh never starts.

2. The multiple task completion paths sas_scsi_clear_queue_... all
   simply delete the task from the error queue.  This causes it to
   disappear into the ether, since a command must be placed on the
   done queue to be finished off by the error handler.  This behaviour
   causes the HBA to hang on pending commands.

Fix 1. by removing the SAS_TASK_STATE_ABORTED check and calling
->scsi_done() unconditionally (it is a nop if the timer has fired).
This keeps the task in the error handling queue until the eh starts.

Fix 2. by making sure every task goes through task complete followed
by scsi_eh_finish_cmd().

Tested this by firing resets across a disk running a hammer test (now
it actually survives without hanging the system)

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/scsi/libsas/sas_scsi_host.c |   53 +++++++++++++++++++----------------
 1 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..b656e29 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task)
 {
 	struct task_status_struct *ts = &task->task_status;
 	struct scsi_cmnd *sc = task->uldd_task;
-	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc->device->host);
-	unsigned ts_flags = task->task_state_flags;
 	int hs = 0, stat = 0;
 
 	if (unlikely(!sc)) {
@@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task)
 	sc->result = (hs << 16) | stat;
 	list_del_init(&task->list);
 	sas_free_task(task);
-	/* This is very ugly but this is how SCSI Core works. */
-	if (ts_flags & SAS_TASK_STATE_ABORTED)
-		scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q);
-	else
-		sc->scsi_done(sc);
+	sc->scsi_done(sc);
 }
 
 static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
@@ -255,13 +249,27 @@ out:
 	return res;
 }
 
+static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
+{
+	struct sas_task *task = TO_SAS_TASK(cmd);
+	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd->device->host);
+
+	/* First off call task_done.  However, task will
+	 * be free'd after this */
+	task->task_done(task);
+	/* now finish the command and move it on to the error
+	 * handler done list, this also takes it off the
+	 * error handler pending list */
+	scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q);
+}
+
 static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd)
 {
 	struct scsi_cmnd *cmd, *n;
 
 	list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
 		if (cmd == my_cmd)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q,
 		struct domain_device *x = cmd_to_domain_dev(cmd);
 
 		if (x == dev)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q,
 		struct asd_sas_port *x = dev->port;
 
 		if (x == port)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -528,14 +536,14 @@ Again:
 		case TASK_IS_DONE:
 			SAS_DPRINTK("%s: task 0x%p is done\n", __FUNCTION__,
 				    task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
 		case TASK_IS_ABORTED:
 			SAS_DPRINTK("%s: task 0x%p is aborted\n",
 				    __FUNCTION__, task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
@@ -547,7 +555,7 @@ Again:
 					    "recovered\n",
 					    SAS_ADDR(task->dev),
 					    cmd->device->lun);
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_lu(work_q, cmd);
@@ -562,7 +570,7 @@ Again:
 			if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
 				SAS_DPRINTK("I_T %016llx recovered\n",
 					    SAS_ADDR(task->dev->sas_addr));
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_I_T(work_q, task->dev);
@@ -577,7 +585,7 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus port:%d "
 						    "succeeded\n", port->id);
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
 					sas_scsi_clear_queue_port(work_q,
@@ -591,10 +599,10 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus ha "
 						    "succeeded\n");
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
-					goto out;
+					goto clear_q;
 				}
 			}
 			/* If we are here -- this means that no amount
@@ -606,21 +614,18 @@ Again:
 				    SAS_ADDR(task->dev->sas_addr),
 				    cmd->device->lun);
 
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			goto clear_q;
 		}
 	}
-out:
 	return list_empty(work_q);
 clear_q:
 	SAS_DPRINTK("--- Exit %s -- clear_q\n", __FUNCTION__);
-	list_for_each_entry_safe(cmd, n, work_q, eh_entry) {
-		struct sas_task *task = TO_SAS_TASK(cmd);
-		list_del_init(&cmd->eh_entry);
-		task->task_done(task);
-	}
+	list_for_each_entry_safe(cmd, n, work_q, eh_entry)
+		sas_eh_finish_cmd(cmd);
+
 	return list_empty(work_q);
 }
 
-- 
1.5.4.1




^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-02-22 19:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-22 12:19 [PATCH] libsas: fix error handling Luben Tuikov
  -- strict thread matches above, loose matches on Subject: below --
2008-02-20 16:07 James Bottomley
2008-02-21  5:51 ` Luben Tuikov
2008-02-21 15:33   ` James Bottomley
2008-02-22 19:46 ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox