sym53c8xx parity errors on SuSE 9.1's hwscan?

All of lore.kernel.org
 help / color / mirror / Atom feed

* sym53c8xx parity errors on SuSE 9.1's hwscan?
@ 2004-09-22 23:16 Matthias Andree
  2004-09-23  2:39 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Matthias Andree @ 2004-09-22 23:16 UTC (permalink / raw)
  To: linux-scsi; +Cc: Matthew Wilcox

Greetings,

SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
tool to scan for hardware.

Whenever this tool, probing the PCI bus, hits my Tekram DC-390U or
DC-310U, the box logs a SCSI parity error, some timed out abort
messages, then the usual reset escalation; device reset (times out), bus
reset (times out), finally HBA reset (succeeds). The whole procedure
takes about 1 to 2 minutes before the bus is usable again.

If the machine is idle and has warm caches, I may occasionally see just
the parity error message, so it may look a bit like a race in sym53c8xx
or perhaps the hardware.

The problem seems to persist through current versions, although they can
_usually_ just fix up a phase error and continue immediately.

The problem did not exist when I was running 2.6.7 on SuSE 8.2.

I find it a bit intimidating that user-space (albeit with root
permissions) causes "SCSI" parity errors, and given the 2.6.9 logging
towards the end of the mail, I am wondering if SuSE's hwinfo stuff
triggers some race condition or manages to bypass the SCSI phase state
machine or if the probe confuses the chip. I haven't yet managed to
isolate (with strace) the cause.

Is there a useful debug setting for sym53c8xx that could shed some light
on what the user-space has attempted that led to the SCSI parity error?

Log from SuSE's 2.6.5-7.108-default during system boot-up:

Sep 22 15:15:09 merlin kernel: sym0: SCSI parity error detected: SCR1=3 DBC=50000000 SBCL=0
Sep 22 15:15:39 merlin kernel: sym0:1:0: ABORT operation started.
Sep 22 15:15:44 merlin kernel: sym0:1:0: ABORT operation timed-out.
Sep 22 15:15:44 merlin kernel: sym0:1:0: ABORT operation started.
Sep 22 15:15:49 merlin kernel: sym0:1:0: ABORT operation timed-out.
Sep 22 15:15:49 merlin kernel: sym0:1:0: DEVICE RESET operation started.
Sep 22 15:15:54 merlin kernel: sym0:1:0: DEVICE RESET operation timed-out.
Sep 22 15:15:54 merlin kernel: sym0:1:0: BUS RESET operation started.
Sep 22 15:15:54 merlin kernel: sym0: SCSI BUS reset detected.
Sep 22 15:15:54 merlin kernel: sym0: SCSI BUS has been reset.
Sep 22 15:15:54 merlin kernel: sym0:1:0: BUS RESET operation complete.

and a while later:

Sep 22 15:18:32 merlin kernel: sym0: SCSI parity error detected: SCR1=3 DBC=50000000 SBCL=0
Sep 22 15:18:32 merlin kernel: sym0: interrupted SCRIPT address not found.
Sep 22 15:18:32 merlin kernel: sym0: SCSI BUS reset detected.
Sep 22 15:18:32 merlin kernel: sym0: SCSI BUS has been reset.

2.6.9-rc2-mm1, although otherwise not useful for me (UDP networking)
logs this instead:

Sep 22 14:00:05 merlin kernel: sym0: SCSI parity error detected: SCR1=3 DBC=50000000 SBCL=0
Sep 22 14:00:05 merlin kernel: sym0: SCSI phase error fixup: CCB already dequeued.
Sep 22 14:00:05 merlin kernel: sym0: SCSI BUS reset detected.
Sep 22 14:00:05 merlin kernel: sym0: SCSI BUS has been reset.

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sym53c8xx parity errors on SuSE 9.1's hwscan?
  2004-09-22 23:16 sym53c8xx parity errors on SuSE 9.1's hwscan? Matthias Andree
@ 2004-09-23  2:39 ` Matthew Wilcox
  2004-09-23  7:42   ` Olaf Hering
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Matthew Wilcox @ 2004-09-23  2:39 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-scsi, Matthew Wilcox

On Thu, Sep 23, 2004 at 01:16:51AM +0200, Matthias Andree wrote:
> SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
> milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
> tool to scan for hardware.

Does anyone have the source?  I'd be interested to see what it's up to.

> I find it a bit intimidating that user-space (albeit with root
> permissions) causes "SCSI" parity errors, and given the 2.6.9 logging
> towards the end of the mail, I am wondering if SuSE's hwinfo stuff
> triggers some race condition or manages to bypass the SCSI phase state
> machine or if the probe confuses the chip. I haven't yet managed to
> isolate (with strace) the cause.

I have a suggestion.  If the probe attempts to size the BARs of the
chip, this is a destructive process that could well cause the chip to
start spewing errors are require a reset to work again.

> Is there a useful debug setting for sym53c8xx that could shed some light
> on what the user-space has attempted that led to the SCSI parity error?

The trouble is that I suspect the probe is completely bypassing the driver.
It might be worth instrumenting drivers/pci/proc.c to see if it's writing
to any of the BARs (particularly the second memory BAR, the one that's 8k).

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sym53c8xx parity errors on SuSE 9.1's hwscan?
  2004-09-23  2:39 ` Matthew Wilcox
@ 2004-09-23  7:42   ` Olaf Hering
  2004-09-23  8:51     ` Matthias Andree
  2004-09-23  8:45   ` Matthias Andree
  2004-10-25  8:16   ` Matthias Andree
  2 siblings, 1 reply; 6+ messages in thread
From: Olaf Hering @ 2004-09-23  7:42 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthias Andree, linux-scsi

 On Thu, Sep 23, Matthew Wilcox wrote:

> On Thu, Sep 23, 2004 at 01:16:51AM +0200, Matthias Andree wrote:
> > SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
> > milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
> > tool to scan for hardware.
> 
> Does anyone have the source?  I'd be interested to see what it's up to.

ftp.suse.com/pub/projects/kernel/kotd

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, nÜRNBERG
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sym53c8xx parity errors on SuSE 9.1's hwscan?
  2004-09-23  7:42   ` Olaf Hering
@ 2004-09-23  8:51     ` Matthias Andree
  0 siblings, 0 replies; 6+ messages in thread
From: Matthias Andree @ 2004-09-23  8:51 UTC (permalink / raw)
  To: Olaf Hering; +Cc: Matthew Wilcox, Matthias Andree, linux-scsi

Olaf Hering <olh@suse.de> writes:

>  On Thu, Sep 23, Matthew Wilcox wrote:
>
>> On Thu, Sep 23, 2004 at 01:16:51AM +0200, Matthias Andree wrote:
>> > SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
>> > milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
>> > tool to scan for hardware.
>> 
>> Does anyone have the source?  I'd be interested to see what it's up to.
>
> ftp.suse.com/pub/projects/kernel/kotd

Given that a vanilla 2.6.7 kernel also shows the problem, I presume
Matthew was interested in the source of hwinfo rather than the kernel
URL. Thanks anyways.

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sym53c8xx parity errors on SuSE 9.1's hwscan?
  2004-09-23  2:39 ` Matthew Wilcox
  2004-09-23  7:42   ` Olaf Hering
@ 2004-09-23  8:45   ` Matthias Andree
  2004-10-25  8:16   ` Matthias Andree
  2 siblings, 0 replies; 6+ messages in thread
From: Matthias Andree @ 2004-09-23  8:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthias Andree, linux-scsi

Matthew Wilcox <matthew@wil.cx> writes:

> On Thu, Sep 23, 2004 at 01:16:51AM +0200, Matthias Andree wrote:
>> SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
>> milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
>> tool to scan for hardware.
>
> Does anyone have the source?  I'd be interested to see what it's up
> to.

It's GPL'd stuff, get the source (505,597 bytes) from
ftp://ftp.suse.com/pub/suse/i386/9.1/suse/src/hwinfo-8.38-0.src.rpm

Just run dd reading from some SCSI drive and hwinfo --all.  Don't do
that if you need the machine in the following two minutes...

While testing, I got one more variant today:

Sep 23 10:13:58 merlin kernel: sym0: SCSI parity error detected: SCR1=132 DBC=50000000 SBCL=0
Sep 23 10:13:58 merlin kernel: sym0:1: ERROR (81:0) (8-0-0) (10/9d/0) @ (mem 48000818:ffffffff).
Sep 23 10:13:58 merlin kernel: sym0: regdump: da 00 00 9d 47 10 01 07 00 08 81 00 80 00 0f 0a ff 90 e4 0d 02 ff ff ff.
Sep 23 10:13:58 merlin kernel: sym0: SCSI BUS reset detected.
Sep 23 10:13:58 merlin kernel: sym0: SCSI BUS has been reset.

> I have a suggestion.  If the probe attempts to size the BARs of the
> chip, this is a destructive process that could well cause the chip to
> start spewing errors are require a reset to work again.

The "BAR" stuff goes well over my head at this time.

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sym53c8xx parity errors on SuSE 9.1's hwscan?
  2004-09-23  2:39 ` Matthew Wilcox
  2004-09-23  7:42   ` Olaf Hering
  2004-09-23  8:45   ` Matthias Andree
@ 2004-10-25  8:16   ` Matthias Andree
  2 siblings, 0 replies; 6+ messages in thread
From: Matthias Andree @ 2004-10-25  8:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthias Andree, linux-scsi

Matthew Wilcox <matthew@wil.cx> writes:

> On Thu, Sep 23, 2004 at 01:16:51AM +0200, Matthias Andree wrote:
>> SuSE Linux 9.1 (kernel 2.6.5 + SuSE patch set, but also 2.6.7 or a bit
>> milder in 2.6.9-rc2-mm1) uses some SuSE-specific "hwprobe" or "hwinfo"
>> tool to scan for hardware.
>
> Does anyone have the source?  I'd be interested to see what it's up
> to.

Matthew,

this thread apparently wasn't followed up to within the last four weeks,
so I'd thought I'd ask if you had the chance to make any progress on
this one.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-10-25  8:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-22 23:16 sym53c8xx parity errors on SuSE 9.1's hwscan? Matthias Andree
2004-09-23  2:39 ` Matthew Wilcox
2004-09-23  7:42   ` Olaf Hering
2004-09-23  8:51     ` Matthias Andree
2004-09-23  8:45   ` Matthias Andree
2004-10-25  8:16   ` Matthias Andree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.