public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* Need help with another aic94xx sequencer problem
@ 2006-08-22 20:28 James Bottomley
  2006-08-22 20:44 ` Tarte, Robert
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: James Bottomley @ 2006-08-22 20:28 UTC (permalink / raw)
  To: Tarte, Robert, Hammer, Jack; +Cc: linux-scsi

While abusing my sas topology (to try to get it to give me errors) I
came across this one with STP tasks:

When the aic94xx loses a task in sas_execute_tasks(), the timeout fires
and wakes the waiter (this leaves the task set pending and aborted).  In
response, sas_execute_task() tries to call lldd_abort_task()
(asd_abort_task()) on it.  Here we panic failing the BUG_ON(!list_empty)
check in aic94xx_hwi.h:asd_ascb_free().

What happens is that the abort comes back with TF_TMF_NO_CTX + 0xFF00
from the sequencer, which asd_abort_task() treats as success and then
panics because the original task is still active.

Either the abort function or the sequencer code is clearly wrong, but
not having access to the sequencer to look, I can't tell.  What should 
this return mean from the sequencer?  My suspicion is that it means the
STP task abort isn't actually formulated properly.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Need help with another aic94xx sequencer problem
  2006-08-22 20:28 Need help with another aic94xx sequencer problem James Bottomley
@ 2006-08-22 20:44 ` Tarte, Robert
  2006-08-22 20:53 ` Mike Anderson
  2006-08-24  8:58 ` Luben Tuikov
  2 siblings, 0 replies; 9+ messages in thread
From: Tarte, Robert @ 2006-08-22 20:44 UTC (permalink / raw)
  To: James Bottomley, Hammer, Jack; +Cc: linux-scsi

Hi James,

Jack will be helping these issues in the future.  I doubt there is a
problem with the sequencer, it is shared with many other drivers.  I
think that you are probably correct that either the abort is probably
not being formulated or (more likely?) the response code from the abort
is not being interpreted correctly.  We will get back to you.

p.s. For some reason this email ended up in both my and Jack's spam
folders.  I'll have to be careful about dumping my spam :-)

Rob

> -----Original Message-----
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Sent: Tuesday, August 22, 2006 1:29 PM
> To: Tarte, Robert; Hammer, Jack
> Cc: linux-scsi
> Subject: Need help with another aic94xx sequencer problem
> 
> While abusing my sas topology (to try to get it to give me errors) I
> came across this one with STP tasks:
> 
> When the aic94xx loses a task in sas_execute_tasks(), the timeout
fires
> and wakes the waiter (this leaves the task set pending and aborted).
In
> response, sas_execute_task() tries to call lldd_abort_task()
> (asd_abort_task()) on it.  Here we panic failing the
BUG_ON(!list_empty)
> check in aic94xx_hwi.h:asd_ascb_free().
> 
> What happens is that the abort comes back with TF_TMF_NO_CTX + 0xFF00
> from the sequencer, which asd_abort_task() treats as success and then
> panics because the original task is still active.
> 
> Either the abort function or the sequencer code is clearly wrong, but
> not having access to the sequencer to look, I can't tell.  What should
> this return mean from the sequencer?  My suspicion is that it means
the
> STP task abort isn't actually formulated properly.
> 
> James
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-22 20:28 Need help with another aic94xx sequencer problem James Bottomley
  2006-08-22 20:44 ` Tarte, Robert
@ 2006-08-22 20:53 ` Mike Anderson
  2006-08-22 22:44   ` James Bottomley
  2006-08-24  8:58 ` Luben Tuikov
  2 siblings, 1 reply; 9+ messages in thread
From: Mike Anderson @ 2006-08-22 20:53 UTC (permalink / raw)
  To: James Bottomley; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> While abusing my sas topology (to try to get it to give me errors) I
> came across this one with STP tasks:
> 
> When the aic94xx loses a task in sas_execute_tasks(), the timeout fires
> and wakes the waiter (this leaves the task set pending and aborted).  In
> response, sas_execute_task() tries to call lldd_abort_task()
> (asd_abort_task()) on it.  Here we panic failing the BUG_ON(!list_empty)
> check in aic94xx_hwi.h:asd_ascb_free().
> 
> What happens is that the abort comes back with TF_TMF_NO_CTX + 0xFF00
> from the sequencer, which asd_abort_task() treats as success and then
> panics because the original task is still active.
> 
> Either the abort function or the sequencer code is clearly wrong, but
> not having access to the sequencer to look, I can't tell.  What should 
> this return mean from the sequencer?  My suspicion is that it means the
> STP task abort isn't actually formulated properly.
> 

Does this help any? While Alexis and I where working on a expander timeout
issue the abort was never working for us. I compared the adp abort and the
aic94xx abort code and made these changes. This appears to make the abort
work for us now. A few lines of the changes are not related to the abort.
YMMV, a better solution would be to know the exact format of the abort.

-andmike
--
Michael Anderson
andmike@us.ibm.com

[EXPERIMENTAL PATCH]
This patch is a port of some of the abort_task and timeout changes present
in the adp driver, but not present in the aic94xx driver.

Signed-off-by: Mike Anderson <andmike@us.ibm.com>

 drivers/scsi/aic94xx/aic94xx_sas.h |    2 +-
 drivers/scsi/aic94xx/aic94xx_tmf.c |    7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

Index: aic94xx-sas-2.6-patched/drivers/scsi/aic94xx/aic94xx_sas.h
===================================================================
--- aic94xx-sas-2.6-patched.orig/drivers/scsi/aic94xx/aic94xx_sas.h	2006-07-14 14:54:31.000000000 -0700
+++ aic94xx-sas-2.6-patched/drivers/scsi/aic94xx/aic94xx_sas.h	2006-07-14 15:19:38.000000000 -0700
@@ -777,7 +777,7 @@ struct asd_phy {
 
 /* COMINIT timer */
 #define ASD_TEN_MILLISEC_TIMEOUT  0x2710
-#define ASD_COMINIT_TIMEOUT ASD_TEN_MILLISEC_TIMEOUT
+#define ASD_COMINIT_TIMEOUT 	  0x000F4240
 
 /* 1 sec */
 #define ASD_SMP_RCV_TIMEOUT       0x000F4240
Index: aic94xx-sas-2.6-patched/drivers/scsi/aic94xx/aic94xx_tmf.c
===================================================================
--- aic94xx-sas-2.6-patched.orig/drivers/scsi/aic94xx/aic94xx_tmf.c	2006-06-23 11:12:01.000000000 -0700
+++ aic94xx-sas-2.6-patched/drivers/scsi/aic94xx/aic94xx_tmf.c	2006-07-14 15:17:52.000000000 -0700
@@ -375,6 +375,7 @@ int asd_abort_task(struct sas_task *task
 	case SAS_PROTO_SSP:
 		scb->abort_task.proto_conn_rate  = (1 << 4); /* SSP */
 		scb->abort_task.proto_conn_rate |= task->dev->linkrate;
+		scb->abort_task.flags |= (1U << 2);
 		break;
 	case SAS_PROTO_SMP:
 		break;
@@ -399,9 +400,9 @@ int asd_abort_task(struct sas_task *task
 	scb->abort_task.sister_scb = cpu_to_le16(0xFFFF);
 	scb->abort_task.conn_handle = cpu_to_le16(
 		(u16)(unsigned long)task->dev->lldd_dev);
-	scb->abort_task.retry_count = 1;
+	scb->abort_task.retry_count = 3;
 	scb->abort_task.index = cpu_to_le16((u16)tascb->tc_index);
-	scb->abort_task.itnl_to = cpu_to_le16(ITNL_TIMEOUT_CONST);
+/*	scb->abort_task.itnl_to = cpu_to_le16(ITNL_TIMEOUT_CONST); */
 
 	res = asd_enqueue_internal(ascb, asd_tmf_tasklet_complete,
 				   asd_tmf_timedout);
@@ -534,7 +535,7 @@ static int asd_initiate_ssp_tmf(struct d
 	scb->ssp_tmf.conn_handle= cpu_to_le16((u16)(unsigned long)
 					      dev->lldd_dev);
 	scb->ssp_tmf.retry_count = 1;
-	scb->ssp_tmf.itnl_to = cpu_to_le16(ITNL_TIMEOUT_CONST);
+
 	if (tmf == TMF_QUERY_TASK)
 		scb->ssp_tmf.index = cpu_to_le16(index);
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-22 20:53 ` Mike Anderson
@ 2006-08-22 22:44   ` James Bottomley
  2006-08-22 23:58     ` Mike Anderson
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2006-08-22 22:44 UTC (permalink / raw)
  To: Mike Anderson; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

On Tue, 2006-08-22 at 13:53 -0700, Mike Anderson wrote:
> Does this help any? While Alexis and I where working on a expander timeout
> issue the abort was never working for us. I compared the adp abort and the
> aic94xx abort code and made these changes. This appears to make the abort
> work for us now. A few lines of the changes are not related to the abort.
> YMMV, a better solution would be to know the exact format of the abort.

Actually, no.  I still seem to get the same problem (at least it BUGs in
the same place ... I haven't dug down to see if I'm getting the same
return value).

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-22 22:44   ` James Bottomley
@ 2006-08-22 23:58     ` Mike Anderson
  2006-08-24  9:08       ` Luben Tuikov
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Anderson @ 2006-08-22 23:58 UTC (permalink / raw)
  To: James Bottomley; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> On Tue, 2006-08-22 at 13:53 -0700, Mike Anderson wrote:
> > Does this help any? While Alexis and I where working on a expander timeout
> > issue the abort was never working for us. I compared the adp abort and the
> > aic94xx abort code and made these changes. This appears to make the abort
> > work for us now. A few lines of the changes are not related to the abort.
> > YMMV, a better solution would be to know the exact format of the abort.
> 
> Actually, no.  I still seem to get the same problem (at least it BUGs in
> the same place ... I haven't dug down to see if I'm getting the same
> return value).

Well the adp driver had a comment that the 0x1D error code means that it
cannot find the command in its execution queue as it already has sent the
command to the target(if I mapped this right between the two drivers).
What looks odd is in the adp driver that move to a higher level of
recovery (i.e., lun reset) if they receive this code, but in the aic94xx
we mark the task TMF_RESP_FUNC_COMPLETE which appears wrong as you found
out with the BUG_ON.

-andmike
--
Michael Anderson
andmike@us.ibm.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-22 20:28 Need help with another aic94xx sequencer problem James Bottomley
  2006-08-22 20:44 ` Tarte, Robert
  2006-08-22 20:53 ` Mike Anderson
@ 2006-08-24  8:58 ` Luben Tuikov
  2006-08-24 14:22   ` James Bottomley
  2 siblings, 1 reply; 9+ messages in thread
From: Luben Tuikov @ 2006-08-24  8:58 UTC (permalink / raw)
  To: James Bottomley, Tarte, Robert, Hammer, Jack; +Cc: linux-scsi

--- James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> While abusing my sas topology (to try to get it to give me errors) I
> came across this one with STP tasks:
> 
> When the aic94xx loses a task in sas_execute_tasks(), the timeout fires
> and wakes the waiter (this leaves the task set pending and aborted).  In
> response, sas_execute_task() tries to call lldd_abort_task()
> (asd_abort_task()) on it.  Here we panic failing the BUG_ON(!list_empty)
> check in aic94xx_hwi.h:asd_ascb_free().
> 
> What happens is that the abort comes back with TF_TMF_NO_CTX + 0xFF00
> from the sequencer, which asd_abort_task() treats as success and then
> panics because the original task is still active.

The sequencer response is correct but incomplete.  The interpretation
is a bad compromize between SSP and STP.

TF_TMF_NO_CTX should be interpreted as TMF_RESP_FUNC_ESUPP to get
"best of both worlds", and then upper layers would do the proper
thing depending on the protocol.

Of course your version of my code and my version of my code differ
somewhat.

The fix in the sequencer is messy as is the fix in software.

    Luben


> 
> Either the abort function or the sequencer code is clearly wrong, but
> not having access to the sequencer to look, I can't tell.  What should 
> this return mean from the sequencer?  My suspicion is that it means the
> STP task abort isn't actually formulated properly.
> 
> James
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-22 23:58     ` Mike Anderson
@ 2006-08-24  9:08       ` Luben Tuikov
  0 siblings, 0 replies; 9+ messages in thread
From: Luben Tuikov @ 2006-08-24  9:08 UTC (permalink / raw)
  To: Mike Anderson, James Bottomley; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

--- Mike Anderson <andmike@us.ibm.com> wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> > On Tue, 2006-08-22 at 13:53 -0700, Mike Anderson wrote:
> > > Does this help any? While Alexis and I where working on a expander timeout
> > > issue the abort was never working for us. I compared the adp abort and the
> > > aic94xx abort code and made these changes. This appears to make the abort
> > > work for us now. A few lines of the changes are not related to the abort.
> > > YMMV, a better solution would be to know the exact format of the abort.
> > 
> > Actually, no.  I still seem to get the same problem (at least it BUGs in
> > the same place ... I haven't dug down to see if I'm getting the same
> > return value).
> 
> Well the adp driver had a comment that the 0x1D error code means that it
> cannot find the command in its execution queue as it already has sent the
> command to the target(if I mapped this right between the two drivers).
> What looks odd is in the adp driver that move to a higher level of
> recovery (i.e., lun reset) if they receive this code, but in the aic94xx
> we mark the task TMF_RESP_FUNC_COMPLETE which appears wrong as you found
> out with the BUG_ON.

There is much more to that. If you followed my previous email... SSP is
handled differently so you cannot just do LU Reset.  Depening on other
factors, you may or you may not.  If you always did then you're "error
recovery thrashing", and performance would suffer, so you might as well
use SATA...

    Luben


> 
> -andmike
> --
> Michael Anderson
> andmike@us.ibm.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-24  8:58 ` Luben Tuikov
@ 2006-08-24 14:22   ` James Bottomley
  2006-08-28  3:26     ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2006-08-24 14:22 UTC (permalink / raw)
  To: ltuikov; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

On Thu, 2006-08-24 at 01:58 -0700, Luben Tuikov wrote:
> The sequencer response is correct but incomplete.  The interpretation
> is a bad compromize between SSP and STP.
> 
> TF_TMF_NO_CTX should be interpreted as TMF_RESP_FUNC_ESUPP to get
> "best of both worlds", and then upper layers would do the proper
> thing depending on the protocol.

OK ... I can do that easily; that should cause the error handler to try
with a bigger hammer.

The next problem is that this actual issue is occurring in
sas_execute_task() which takes no further action if an abort fails.  I
can easily wire in a next level ... would you recommend CLEAR_TASK_SET,
or move straight on to LOGICAL_UNIT_RESET?

James


> Of course your version of my code and my version of my code differ
> somewhat.
> 
> The fix in the sequencer is messy as is the fix in software.
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Need help with another aic94xx sequencer problem
  2006-08-24 14:22   ` James Bottomley
@ 2006-08-28  3:26     ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2006-08-28  3:26 UTC (permalink / raw)
  To: ltuikov; +Cc: Tarte, Robert, Hammer, Jack, linux-scsi

On Thu, 2006-08-24 at 09:22 -0500, James Bottomley wrote:
> On Thu, 2006-08-24 at 01:58 -0700, Luben Tuikov wrote:
> > The sequencer response is correct but incomplete.  The interpretation
> > is a bad compromize between SSP and STP.
> > 
> > TF_TMF_NO_CTX should be interpreted as TMF_RESP_FUNC_ESUPP to get
> > "best of both worlds", and then upper layers would do the proper
> > thing depending on the protocol.
> 
> OK ... I can do that easily; that should cause the error handler to try
> with a bigger hammer.

So, let's try this as the first order fix.

James

Index: BUILD-2.6/drivers/scsi/aic94xx/aic94xx_tmf.c
===================================================================
--- BUILD-2.6.orig/drivers/scsi/aic94xx/aic94xx_tmf.c	2006-08-26 09:35:05.000000000 -0500
+++ BUILD-2.6/drivers/scsi/aic94xx/aic94xx_tmf.c	2006-08-26 09:37:54.000000000 -0500
@@ -462,13 +462,15 @@ int asd_abort_task(struct sas_task *task
 		if (task->task_state_flags & SAS_TASK_STATE_DONE)
 			res = TMF_RESP_FUNC_COMPLETE;
 		spin_unlock_irqrestore(&task->task_state_lock, flags);
-		goto out_done; break;
+		goto out_done;
 	case TF_TMF_NO_TAG + 0xFF00:
 	case TF_TMF_TAG_FREE + 0xFF00: /* the tag is in the free list */
-	case TF_TMF_NO_CTX + 0xFF00: /* not in seq, or proto != SSP */
 	case TF_TMF_NO_CONN_HANDLE + 0xFF00: /* no such device */
 		res = TMF_RESP_FUNC_COMPLETE;
-		goto out_done; break;
+		goto out_done;
+	case TF_TMF_NO_CTX + 0xFF00: /* not in seq, or proto != SSP */
+		res = TMF_RESP_FUNC_ESUPP;
+		goto out;
 	}
 out_done:
 	if (res == TMF_RESP_FUNC_COMPLETE) {
@@ -558,10 +560,12 @@ static int asd_initiate_ssp_tmf(struct d
 		break;
 	case TF_TMF_NO_TAG + 0xFF00:
 	case TF_TMF_TAG_FREE + 0xFF00: /* the tag is in the free list */
-	case TF_TMF_NO_CTX + 0xFF00: /* not in seq, or proto != SSP */
 	case TF_TMF_NO_CONN_HANDLE + 0xFF00: /* no such device */
 		res = TMF_RESP_FUNC_COMPLETE;
 		break;
+	case TF_TMF_NO_CTX + 0xFF00: /* not in seq, or proto != SSP */
+		res = TMF_RESP_FUNC_ESUPP;
+		break;
 	default:
 		ASD_DPRINTK("%s: converting result 0x%x to TMF_RESP_FUNC_FAILED\n",
 			    __FUNCTION__, res);



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-08-28  3:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-22 20:28 Need help with another aic94xx sequencer problem James Bottomley
2006-08-22 20:44 ` Tarte, Robert
2006-08-22 20:53 ` Mike Anderson
2006-08-22 22:44   ` James Bottomley
2006-08-22 23:58     ` Mike Anderson
2006-08-24  9:08       ` Luben Tuikov
2006-08-24  8:58 ` Luben Tuikov
2006-08-24 14:22   ` James Bottomley
2006-08-28  3:26     ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox