* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
@ 2008-01-18 22:27 James Bottomley
0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2008-01-18 22:27 UTC (permalink / raw)
To: bugme-daemon; +Cc: linux-scsi
> Latest working kernel version:
> Earliest failing kernel version:
> Distribution: Gentoo
> Hardware Environment: ML150G3, (2Core cpu, 64Bit) AHA3944AUWD card, Storagetek
> L80 +2x DLT8000
> Software Environment: gentoo
> Problem Description: kernel panic
>
> Steps to reproduce:
> Panic if the L80 is powered up when the kernel boots. 100% on any failing
> kernel.
> Not all kernels fail but most do.
> Git Bisect across linus's tree did not produce a convincing patch.
> Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
> I have joined the linux-scsi list and will
>
> The event that brought the problem to light was the installation of a
> secondhand Storagetek L80
> tape library. This has two DLT8000 drives on a HV-Differential bus.
> This needed special card, an adaptec 3944AUWD.
> The kernel I was running at that time was 2.6.22-gentoo-r8.
> It worked fine. Then when -r9 came out and this error manifested, the
> assumption
> was that -r9 was broken.
>
> I no longer think this to be the case.
>
> I think they are _ALL_ broken, possibly going way back toward the start of the
> 2.6 series.
> I think that the bug may or may not manifest depending on the internal layout
> of data in the kernel
> --A true heisenbug--
>
> All that the git bisect did was to change the internal layout, not add/remove a
> bad patch.
>
> This explains why I could take the 2.6.23.8 kernel and compile for SMP and have
> it fail.
> Compile it for UP and have it work. Initially I thought that meant a locking or
> race issue.
> Now I think its was just another case of altering the internal kernel layout.
Actually, I'd investigate either your tapes or the SCSI bus.
The message is produced deep in the heart of the aic7xxx driver. It
happens when the driver gets reselected with a tag that doesn't exist.
However, in this case, I think your device is untagged, in which case
this is some handling issue with SCB_LIST_NULL (the value 0xff).
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
@ 2008-01-18 22:28 ` bugme-daemon
2008-01-18 22:35 ` bugme-daemon
` (5 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:28 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #5 from anonymous@kernel-bugs.osdl.org 2008-01-18 14:27 -------
Reply-To: James.Bottomley@HansenPartnership.com
> Latest working kernel version:
> Earliest failing kernel version:
> Distribution: Gentoo
> Hardware Environment: ML150G3, (2Core cpu, 64Bit) AHA3944AUWD card, Storagetek
> L80 +2x DLT8000
> Software Environment: gentoo
> Problem Description: kernel panic
>
> Steps to reproduce:
> Panic if the L80 is powered up when the kernel boots. 100% on any failing
> kernel.
> Not all kernels fail but most do.
> Git Bisect across linus's tree did not produce a convincing patch.
> Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
> I have joined the linux-scsi list and will
>
> The event that brought the problem to light was the installation of a
> secondhand Storagetek L80
> tape library. This has two DLT8000 drives on a HV-Differential bus.
> This needed special card, an adaptec 3944AUWD.
> The kernel I was running at that time was 2.6.22-gentoo-r8.
> It worked fine. Then when -r9 came out and this error manifested, the
> assumption
> was that -r9 was broken.
>
> I no longer think this to be the case.
>
> I think they are _ALL_ broken, possibly going way back toward the start of the
> 2.6 series.
> I think that the bug may or may not manifest depending on the internal layout
> of data in the kernel
> --A true heisenbug--
>
> All that the git bisect did was to change the internal layout, not add/remove a
> bad patch.
>
> This explains why I could take the 2.6.23.8 kernel and compile for SMP and have
> it fail.
> Compile it for UP and have it work. Initially I thought that meant a locking or
> race issue.
> Now I think its was just another case of altering the internal kernel layout.
Actually, I'd investigate either your tapes or the SCSI bus.
The message is produced deep in the heart of the aic7xxx driver. It
happens when the driver gets reselected with a tag that doesn't exist.
However, in this case, I think your device is untagged, in which case
this is some handling issue with SCB_LIST_NULL (the value 0xff).
James
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
2008-01-18 22:28 ` [Bug 9775] HOST_MSG_LOOP invalid SCB ff bugme-daemon
@ 2008-01-18 22:35 ` bugme-daemon
2008-01-18 22:36 ` bugme-daemon
` (4 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:35 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #6 from john@mib-infotech.co.nz 2008-01-18 14:35 -------
Thanks, I've just done some more testing.
There are no tapes in the drives.
Normally, there is the L80 and a DLT8000 on channel B
and a DLT8000 on channel A
Both busses have external terminators.
If Ch B is used alone the system is fine!
If Ch A is used alone it will fail.
If you you are thinking of some hardware problem, its possible to boot with the
L80 off, cause the scsi bus to rescan and have everything work fine.
Regards,
john
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
2008-01-18 22:28 ` [Bug 9775] HOST_MSG_LOOP invalid SCB ff bugme-daemon
2008-01-18 22:35 ` bugme-daemon
@ 2008-01-18 22:36 ` bugme-daemon
2008-02-09 2:52 ` bugme-daemon
` (3 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-01-18 22:36 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #7 from john@mib-infotech.co.nz 2008-01-18 14:36 -------
Duh! I mean boot with it off, power it up and rescan.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
` (2 preceding siblings ...)
2008-01-18 22:36 ` bugme-daemon
@ 2008-02-09 2:52 ` bugme-daemon
2008-02-12 21:55 ` James Bottomley
2008-02-09 2:54 ` bugme-daemon
` (2 subsequent siblings)
6 siblings, 1 reply; 9+ messages in thread
From: bugme-daemon @ 2008-02-09 2:52 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #8 from john@mib-infotech.co.nz 2008-02-08 18:52 -------
Ok, I've spent some time trying different combinations of devices.
Against kernel 2.6.24
T0 is Quantum DLT8000 ID0
T1 is Quantum DLT8000 ID1
MTX is STK L80 ID 15
Terminators A, B
Channel A B
T0,T1,MTX,B Nil
Crash
Nil T0,T1,MTX,B
Parity Error in Data-in Phase
Nil T0,MTX,B
Ok, Tar test ok, MTX ok
Nil T1,MTX,B
Ok, Tar test ok, MTX ok
-- Both drives work ok
T1,MTX,B Nil
Ok Skipped Tests
T1,MTX,A Nil
Ok Skipped Tests
T0,MTX,B Nil
Crash
T0,MTX,A Nil
Crash
-- Not the terminator
--Test on two channels
T0,MTX,A T1,B
Crash
T1,B T0,MTX,A
Parity Error in Data-in Phase
It really doesn't like three devices, on two busses or one.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
` (3 preceding siblings ...)
2008-02-09 2:52 ` bugme-daemon
@ 2008-02-09 2:54 ` bugme-daemon
2008-02-12 21:56 ` bugme-daemon
2008-02-17 2:40 ` bugme-daemon
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-09 2:54 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #9 from john@mib-infotech.co.nz 2008-02-08 18:54 -------
Wrap around doesn't help..
I've also the the 'old' AIC78XX driver.
That driver hangs even with no devices attached.
So now what?
--john
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bug 9775] HOST_MSG_LOOP invalid SCB ff
2008-02-09 2:52 ` bugme-daemon
@ 2008-02-12 21:55 ` James Bottomley
0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2008-02-12 21:55 UTC (permalink / raw)
To: bugme-daemon, john; +Cc: linux-scsi
On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> Ok, I've spent some time trying different combinations of devices.
>
> Against kernel 2.6.24
> T0 is Quantum DLT8000 ID0
> T1 is Quantum DLT8000 ID1
> MTX is STK L80 ID 15
> Terminators A, B
>
> Channel A B
> T0,T1,MTX,B Nil
> Crash
> Nil T0,T1,MTX,B
> Parity Error in Data-in Phase
> Nil T0,MTX,B
> Ok, Tar test ok, MTX ok
> Nil T1,MTX,B
> Ok, Tar test ok, MTX ok
> -- Both drives work ok
> T1,MTX,B Nil
> Ok Skipped Tests
> T1,MTX,A Nil
> Ok Skipped Tests
> T0,MTX,B Nil
> Crash
> T0,MTX,A Nil
> Crash
> -- Not the terminator
>
>
> --Test on two channels
> T0,MTX,A T1,B
> Crash
> T1,B T0,MTX,A
> Parity Error in Data-in Phase
>
> It really doesn't like three devices, on two busses or one.
Well, I still think you have some type of bus instability, but that said
we need to get rid of the panic.
I'm afraid this is going to be a long process. For the first attempt,
let's see if this is an unsolicited msgin ... it looks like the driver
handling for those is wrong. Can you try this patch?
Thanks,
James
---
diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c b/drivers/scsi/aic7xxx/aic7xxx_core.c
index 6d2ae64..64e62ce 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_core.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_core.c
@@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat)
scb_index = ahc_inb(ahc, SCB_TAG);
scb = ahc_lookup_scb(ahc, scb_index);
if (devinfo.role == ROLE_INITIATOR) {
- if (scb == NULL)
- panic("HOST_MSG_LOOP with "
- "invalid SCB %x\n", scb_index);
+ if (bus_phase == P_MESGOUT) {
+ if (scb == NULL)
+ panic("HOST_MSG_LOOP with "
+ "invalid SCB %x\n",
+ scb_index);
- if (bus_phase == P_MESGOUT)
ahc_setup_initiator_msgout(ahc,
&devinfo,
scb);
- else {
+ } else {
ahc->msg_type =
MSG_TYPE_INITIATOR_MSGIN;
ahc->msgin_index = 0;
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
` (4 preceding siblings ...)
2008-02-09 2:54 ` bugme-daemon
@ 2008-02-12 21:56 ` bugme-daemon
2008-02-17 2:40 ` bugme-daemon
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-12 21:56 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
------- Comment #10 from anonymous@kernel-bugs.osdl.org 2008-02-12 13:56 -------
Reply-To: James.Bottomley@HansenPartnership.com
On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> Ok, I've spent some time trying different combinations of devices.
>
> Against kernel 2.6.24
> T0 is Quantum DLT8000 ID0
> T1 is Quantum DLT8000 ID1
> MTX is STK L80 ID 15
> Terminators A, B
>
> Channel A B
> T0,T1,MTX,B Nil
> Crash
> Nil T0,T1,MTX,B
> Parity Error in Data-in Phase
> Nil T0,MTX,B
> Ok, Tar test ok, MTX ok
> Nil T1,MTX,B
> Ok, Tar test ok, MTX ok
> -- Both drives work ok
> T1,MTX,B Nil
> Ok Skipped Tests
> T1,MTX,A Nil
> Ok Skipped Tests
> T0,MTX,B Nil
> Crash
> T0,MTX,A Nil
> Crash
> -- Not the terminator
>
>
> --Test on two channels
> T0,MTX,A T1,B
> Crash
> T1,B T0,MTX,A
> Parity Error in Data-in Phase
>
> It really doesn't like three devices, on two busses or one.
Well, I still think you have some type of bus instability, but that said
we need to get rid of the panic.
I'm afraid this is going to be a long process. For the first attempt,
let's see if this is an unsolicited msgin ... it looks like the driver
handling for those is wrong. Can you try this patch?
Thanks,
James
---
diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c
b/drivers/scsi/aic7xxx/aic7xxx_core.c
index 6d2ae64..64e62ce 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_core.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_core.c
@@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat)
scb_index = ahc_inb(ahc, SCB_TAG);
scb = ahc_lookup_scb(ahc, scb_index);
if (devinfo.role == ROLE_INITIATOR) {
- if (scb == NULL)
- panic("HOST_MSG_LOOP with "
- "invalid SCB %x\n", scb_index);
+ if (bus_phase == P_MESGOUT) {
+ if (scb == NULL)
+ panic("HOST_MSG_LOOP with "
+ "invalid SCB %x\n",
+ scb_index);
- if (bus_phase == P_MESGOUT)
ahc_setup_initiator_msgout(ahc,
&devinfo,
scb);
- else {
+ } else {
ahc->msg_type =
MSG_TYPE_INITIATOR_MSGIN;
ahc->msgin_index = 0;
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Bug 9775] HOST_MSG_LOOP invalid SCB ff
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
` (5 preceding siblings ...)
2008-02-12 21:56 ` bugme-daemon
@ 2008-02-17 2:40 ` bugme-daemon
6 siblings, 0 replies; 9+ messages in thread
From: bugme-daemon @ 2008-02-17 2:40 UTC (permalink / raw)
To: linux-scsi
http://bugzilla.kernel.org/show_bug.cgi?id=9775
john@mib-infotech.co.nz changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |CLOSED
Resolution| |CODE_FIX
------- Comment #11 from john@mib-infotech.co.nz 2008-02-16 18:40 -------
Thanks James,
I've spent an afternoon rebooting now and finally discovered I had a faulty
external SSCI cable.
Initial tests suggest its ok.
However I remain perplexed. The problem initially manifested when I upgraded my
kernel, not when I diddled with my hardware.
This now seems to have fixed udev bug
http://bugs.gentoo.org/show_bug.cgi?id=200437
as well
how bizarre!
Thanks for your help everyone.
Regards
John
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2008-02-17 2:40 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bug-9775-11613@http.bugzilla.kernel.org/>
2008-01-18 22:28 ` [Bug 9775] HOST_MSG_LOOP invalid SCB ff bugme-daemon
2008-01-18 22:35 ` bugme-daemon
2008-01-18 22:36 ` bugme-daemon
2008-02-09 2:52 ` bugme-daemon
2008-02-12 21:55 ` James Bottomley
2008-02-09 2:54 ` bugme-daemon
2008-02-12 21:56 ` bugme-daemon
2008-02-17 2:40 ` bugme-daemon
2008-01-18 22:27 James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox