public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [Bug 9775]  HOST_MSG_LOOP invalid SCB ff
@ 2008-01-18 22:27 James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2008-01-18 22:27 UTC (permalink / raw)
  To: bugme-daemon; +Cc: linux-scsi


> Latest working kernel version:
> Earliest failing kernel version: 
> Distribution: Gentoo
> Hardware Environment: ML150G3, (2Core cpu, 64Bit)  AHA3944AUWD card, Storagetek
> L80 +2x DLT8000
> Software Environment: gentoo
> Problem Description: kernel panic 
> 
> Steps to reproduce:
> Panic if the L80 is powered up when the kernel boots. 100% on any failing
> kernel.
> Not all kernels fail but most do.
> Git Bisect across linus's tree did not produce a convincing patch.
> Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
> I have joined the linux-scsi list and will
> 
> The event that brought the problem to light was the installation of a
> secondhand Storagetek L80
> tape library. This has two DLT8000 drives on a HV-Differential bus.
> This needed special card, an adaptec 3944AUWD.
> The kernel I was running at that time was 2.6.22-gentoo-r8.
> It worked fine. Then when -r9 came out and this error manifested, the
> assumption
> was that -r9 was broken.
> 
> I no longer think this to be the case.
> 
> I think they are _ALL_ broken, possibly going way back toward the start of the
> 2.6 series.
> I think that the bug may or may not manifest depending on the internal layout
> of data in the kernel
> --A true heisenbug--
> 
> All that the git bisect did was to change the internal layout, not add/remove a
> bad patch.
> 
> This explains why I could take the 2.6.23.8 kernel and compile for SMP and have
> it fail.
> Compile it for UP and have it work. Initially I thought that meant a locking or
> race issue.
> Now I think its was just another case of altering the internal kernel layout.

Actually, I'd investigate either your tapes or the SCSI bus.

The message is produced deep in the heart of the aic7xxx driver.  It
happens when the driver gets reselected with a tag that doesn't exist.
However, in this case, I think your device is untagged, in which case
this is some handling issue with SCB_LIST_NULL (the value 0xff).

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
@ 2008-01-18 22:28 ` bugme-daemon
  2008-01-18 22:35 ` bugme-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:28 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #5 from anonymous@kernel-bugs.osdl.org  2008-01-18 14:27 -------
Reply-To: James.Bottomley@HansenPartnership.com


> Latest working kernel version:
> Earliest failing kernel version: 
> Distribution: Gentoo
> Hardware Environment: ML150G3, (2Core cpu, 64Bit)  AHA3944AUWD card, Storagetek
> L80 +2x DLT8000
> Software Environment: gentoo
> Problem Description: kernel panic 
> 
> Steps to reproduce:
> Panic if the L80 is powered up when the kernel boots. 100% on any failing
> kernel.
> Not all kernels fail but most do.
> Git Bisect across linus's tree did not produce a convincing patch.
> Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
> I have joined the linux-scsi list and will
> 
> The event that brought the problem to light was the installation of a
> secondhand Storagetek L80
> tape library. This has two DLT8000 drives on a HV-Differential bus.
> This needed special card, an adaptec 3944AUWD.
> The kernel I was running at that time was 2.6.22-gentoo-r8.
> It worked fine. Then when -r9 came out and this error manifested, the
> assumption
> was that -r9 was broken.
> 
> I no longer think this to be the case.
> 
> I think they are _ALL_ broken, possibly going way back toward the start of the
> 2.6 series.
> I think that the bug may or may not manifest depending on the internal layout
> of data in the kernel
> --A true heisenbug--
> 
> All that the git bisect did was to change the internal layout, not add/remove a
> bad patch.
> 
> This explains why I could take the 2.6.23.8 kernel and compile for SMP and have
> it fail.
> Compile it for UP and have it work. Initially I thought that meant a locking or
> race issue.
> Now I think its was just another case of altering the internal kernel layout.

Actually, I'd investigate either your tapes or the SCSI bus.

The message is produced deep in the heart of the aic7xxx driver.  It
happens when the driver gets reselected with a tag that doesn't exist.
However, in this case, I think your device is untagged, in which case
this is some handling issue with SCB_LIST_NULL (the value 0xff).

James


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
  2008-01-18 22:28 ` bugme-daemon
@ 2008-01-18 22:35 ` bugme-daemon
  2008-01-18 22:36 ` bugme-daemon
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:35 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #6 from john@mib-infotech.co.nz  2008-01-18 14:35 -------
Thanks, I've just done some more testing.
There are no tapes in the drives.
Normally, there is the L80 and a DLT8000 on channel B
and a DLT8000 on channel A

Both busses have external terminators.

If Ch B is used alone the system is fine!
If Ch A is used alone it will fail.

If you you are thinking of some hardware problem, its possible to boot with the
L80 off, cause the scsi bus to rescan and have everything work fine.
Regards,
john


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
  2008-01-18 22:28 ` bugme-daemon
  2008-01-18 22:35 ` bugme-daemon
@ 2008-01-18 22:36 ` bugme-daemon
  2008-02-09  2:52 ` bugme-daemon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:36 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #7 from john@mib-infotech.co.nz  2008-01-18 14:36 -------
Duh! I mean boot with it off, power it up and rescan.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
                   ` (2 preceding siblings ...)
  2008-01-18 22:36 ` bugme-daemon
@ 2008-02-09  2:52 ` bugme-daemon
  2008-02-12 21:55   ` James Bottomley
  2008-02-09  2:54 ` bugme-daemon
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: bugme-daemon @ 2008-02-09  2:52 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #8 from john@mib-infotech.co.nz  2008-02-08 18:52 -------
Ok, I've spent some time trying different combinations of devices.

Against kernel 2.6.24
T0 is Quantum DLT8000 ID0
T1 is Quantum DLT8000 ID1
MTX     is STK L80  ID 15
Terminators A, B

Channel                         A                       B
                                        T0,T1,MTX,B     Nil                    
                Crash
                                        Nil                     T0,T1,MTX,B    
                Parity Error in Data-in Phase
                                        Nil                     T0,MTX,B       
                Ok, Tar test ok, MTX ok
                                        Nil                     T1,MTX,B       
                Ok, Tar test ok, MTX ok 
-- Both drives work ok          
                                        T1,MTX,B        Nil                    
                Ok   Skipped Tests
                                        T1,MTX,A        Nil                    
                Ok   Skipped Tests
                                        T0,MTX,B        Nil                    
                Crash
                                        T0,MTX,A        Nil                    
                Crash
-- Not the terminator


--Test on two channels
                                        T0,MTX,A        T1,B                   
        Crash
                                        T1,B            T0,MTX,A               
        Parity Error in Data-in Phase                                   

It really doesn't like three devices, on two busses or one.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
                   ` (3 preceding siblings ...)
  2008-02-09  2:52 ` bugme-daemon
@ 2008-02-09  2:54 ` bugme-daemon
  2008-02-12 21:56 ` bugme-daemon
  2008-02-17  2:40 ` bugme-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-09  2:54 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #9 from john@mib-infotech.co.nz  2008-02-08 18:54 -------
Wrap around doesn't help..

I've also the the 'old' AIC78XX driver.
That driver hangs even with no devices attached.

So now what?

--john


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 9775] HOST_MSG_LOOP invalid SCB ff
  2008-02-09  2:52 ` bugme-daemon
@ 2008-02-12 21:55   ` James Bottomley
  0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2008-02-12 21:55 UTC (permalink / raw)
  To: bugme-daemon, john; +Cc: linux-scsi

On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> Ok, I've spent some time trying different combinations of devices.
> 
> Against kernel 2.6.24
> T0 is Quantum DLT8000 ID0
> T1 is Quantum DLT8000 ID1
> MTX     is STK L80  ID 15
> Terminators A, B
> 
> Channel                         A                       B
>                                         T0,T1,MTX,B     Nil                    
>                 Crash
>                                         Nil                     T0,T1,MTX,B    
>                 Parity Error in Data-in Phase
>                                         Nil                     T0,MTX,B       
>                 Ok, Tar test ok, MTX ok
>                                         Nil                     T1,MTX,B       
>                 Ok, Tar test ok, MTX ok 
> -- Both drives work ok          
>                                         T1,MTX,B        Nil                    
>                 Ok   Skipped Tests
>                                         T1,MTX,A        Nil                    
>                 Ok   Skipped Tests
>                                         T0,MTX,B        Nil                    
>                 Crash
>                                         T0,MTX,A        Nil                    
>                 Crash
> -- Not the terminator
> 
> 
> --Test on two channels
>                                         T0,MTX,A        T1,B                   
>         Crash
>                                         T1,B            T0,MTX,A               
>         Parity Error in Data-in Phase                                   
> 
> It really doesn't like three devices, on two busses or one.

Well, I still think you have some type of bus instability, but that said
we need to get rid of the panic.

I'm afraid this is going to be a long process.  For the first attempt,
let's see if this is an unsolicited msgin ... it looks like the driver
handling for those is wrong.  Can you try this patch?

Thanks,

James

---

diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c b/drivers/scsi/aic7xxx/aic7xxx_core.c
index 6d2ae64..64e62ce 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_core.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_core.c
@@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat)
 			scb_index = ahc_inb(ahc, SCB_TAG);
 			scb = ahc_lookup_scb(ahc, scb_index);
 			if (devinfo.role == ROLE_INITIATOR) {
-				if (scb == NULL)
-					panic("HOST_MSG_LOOP with "
-					      "invalid SCB %x\n", scb_index);
+				if (bus_phase == P_MESGOUT) {
+					if (scb == NULL)
+						panic("HOST_MSG_LOOP with "
+						      "invalid SCB %x\n",
+						      scb_index);
 
-				if (bus_phase == P_MESGOUT)
 					ahc_setup_initiator_msgout(ahc,
 								   &devinfo,
 								   scb);
-				else {
+				} else {
 					ahc->msg_type =
 					    MSG_TYPE_INITIATOR_MSGIN;
 					ahc->msgin_index = 0;



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
                   ` (4 preceding siblings ...)
  2008-02-09  2:54 ` bugme-daemon
@ 2008-02-12 21:56 ` bugme-daemon
  2008-02-17  2:40 ` bugme-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-12 21:56 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775





------- Comment #10 from anonymous@kernel-bugs.osdl.org  2008-02-12 13:56 -------
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> Ok, I've spent some time trying different combinations of devices.
> 
> Against kernel 2.6.24
> T0 is Quantum DLT8000 ID0
> T1 is Quantum DLT8000 ID1
> MTX     is STK L80  ID 15
> Terminators A, B
> 
> Channel                         A                       B
>                                         T0,T1,MTX,B     Nil                    
>                 Crash
>                                         Nil                     T0,T1,MTX,B    
>                 Parity Error in Data-in Phase
>                                         Nil                     T0,MTX,B       
>                 Ok, Tar test ok, MTX ok
>                                         Nil                     T1,MTX,B       
>                 Ok, Tar test ok, MTX ok 
> -- Both drives work ok          
>                                         T1,MTX,B        Nil                    
>                 Ok   Skipped Tests
>                                         T1,MTX,A        Nil                    
>                 Ok   Skipped Tests
>                                         T0,MTX,B        Nil                    
>                 Crash
>                                         T0,MTX,A        Nil                    
>                 Crash
> -- Not the terminator
> 
> 
> --Test on two channels
>                                         T0,MTX,A        T1,B                   
>         Crash
>                                         T1,B            T0,MTX,A               
>         Parity Error in Data-in Phase                                   
> 
> It really doesn't like three devices, on two busses or one.

Well, I still think you have some type of bus instability, but that said
we need to get rid of the panic.

I'm afraid this is going to be a long process.  For the first attempt,
let's see if this is an unsolicited msgin ... it looks like the driver
handling for those is wrong.  Can you try this patch?

Thanks,

James

---

diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c
b/drivers/scsi/aic7xxx/aic7xxx_core.c
index 6d2ae64..64e62ce 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_core.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_core.c
@@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat)
                        scb_index = ahc_inb(ahc, SCB_TAG);
                        scb = ahc_lookup_scb(ahc, scb_index);
                        if (devinfo.role == ROLE_INITIATOR) {
-                               if (scb == NULL)
-                                       panic("HOST_MSG_LOOP with "
-                                             "invalid SCB %x\n", scb_index);
+                               if (bus_phase == P_MESGOUT) {
+                                       if (scb == NULL)
+                                               panic("HOST_MSG_LOOP with "
+                                                     "invalid SCB %x\n",
+                                                     scb_index);

-                               if (bus_phase == P_MESGOUT)
                                        ahc_setup_initiator_msgout(ahc,
                                                                   &devinfo,
                                                                   scb);
-                               else {
+                               } else {
                                        ahc->msg_type =
                                            MSG_TYPE_INITIATOR_MSGIN;
                                        ahc->msgin_index = 0;


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
       [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
                   ` (5 preceding siblings ...)
  2008-02-12 21:56 ` bugme-daemon
@ 2008-02-17  2:40 ` bugme-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-17  2:40 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=9775


john@mib-infotech.co.nz changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |CLOSED
         Resolution|                            |CODE_FIX




------- Comment #11 from john@mib-infotech.co.nz  2008-02-16 18:40 -------
Thanks James,
I've spent an afternoon rebooting now and finally discovered  I had a faulty
external SSCI cable.

Initial tests suggest its ok.

However I remain perplexed. The problem initially manifested when I upgraded my
kernel, not when I diddled with my hardware.

This now seems to have fixed udev bug
http://bugs.gentoo.org/show_bug.cgi?id=200437

as well

how bizarre!

Thanks for your help everyone.

Regards
John


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-02-17  2:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-18 22:27 [Bug 9775] HOST_MSG_LOOP invalid SCB ff James Bottomley
     [not found] <bug-9775-11613@http.bugzilla.kernel.org/>
2008-01-18 22:28 ` bugme-daemon
2008-01-18 22:35 ` bugme-daemon
2008-01-18 22:36 ` bugme-daemon
2008-02-09  2:52 ` bugme-daemon
2008-02-12 21:55   ` James Bottomley
2008-02-09  2:54 ` bugme-daemon
2008-02-12 21:56 ` bugme-daemon
2008-02-17  2:40 ` bugme-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox