Linux PARISC architecture development
 help / color / mirror / Atom feed
* [parisc-linux] PIM after c3k crash w/ VisEG PCI card
@ 2002-02-02 19:06 Helge Deller
  2002-02-03  6:57 ` Grant Grundler
  0 siblings, 1 reply; 2+ messages in thread
From: Helge Deller @ 2002-02-02 19:06 UTC (permalink / raw)
  To: parisc-linux

[-- Attachment #1: Type: text/plain, Size: 317 bytes --]

Hi all,

the attached file shows the PIM of a 64bit kernel after my 
machine crashed while trying to initialize the STI with a 
VisEG PCI card in PCI slot 2. So it's the same problem
as with a 32bit kernel.

Hopefully/Maybe this log may be usefull for someone of 
you helping me to debug this problem ?

TIA,
Helge



[-- Attachment #2: hpmc.txt --]
[-- Type: text/plain, Size: 8301 bytes --]

pim

PROCESSOR PIM INFORMATION

-----------------  Processor 0 HPMC Information ------------------

Timestamp = 
  Sat Feb  2 18:48:51 GMT 2002    (20:02:02:02:18:48:51)

HPMC Chassis Codes = 2cbf0  2500b  27821  2cbf4  2cbfc  

General Registers 0 - 31
00-03   0000000000000000  ffffffffffffffff  00000000001072a0  00000000004c5240
04-07   000000007fb70da8  00000000fa100000  00000000fa380000  00000000003376c8
08-11   0000000010494740  00000000000000c1  0000000010494740  000000001040084c
12-15   0000000010494740  00000000fb000000  0000000010494740  00000000f0400004
16-19   0000000010494740  00000000f000017c  0000000010494740  000000000804000e
20-23   00000000f0000174  000000008fb70ef0  0000000010494740  00000000004c5240
24-27   000000007fb70da8  00000000fa380000  00000000000000f0  00000000103ebd20
28-31   0000000000496788  0000000000000038  0000000000494958  0000000000000000

<Press any key to continue (q to quit)> 

Control Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11   0000000000000000  0000000000000000  00000000000000c0  000000000000003f
12-15   0000000000000000  0000000000000000  0000000000104000  ffffffffffffffff
16-19   0000001c8128124b  0000000000000000  000000007fb31878  0000000048c20008
20-23   00000000a627ffe8  c0000000e0380004  000000ff0000ff08  8000000000000000
24-27   0000000000380000  0000000000380000  00000000ffffffff  00000000ffffffff
28-31   00000000ffffffff  00000000ffffffff  000000008fb70000  0000000010480000
Space Registers 0 - 7

00-03   00000000          00000000          00000000          00000000
04-07   00000000          00000000          00000000          00000000

<Press any key to continue (q to quit)> 

IIA Space                    = 0x0000000000000000
IIA Offset                   = 0x000000007fb3187c
Check Type                   = 0x20000000
CPU State                    = 0x9e000004
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x0030103b
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x000000fffa380004
System Requestor Address     = 0xfffffffffffa0000

Floating-Point Registers 0 - 31
00-03   0000001f00000000  0000000000000000  0000000000000000  0000000000000000
04-07   00003d091037f038  000000788fb70000  00000000001c9c38  000000000800000f
08-11   00000000103ebd20  00000000ffffffff  00000000103ebd20  0000000010474e88
12-15   0000000010374cf8  0000000000000001  000000001016c2c8  000000000804000e
16-19   000000001037f038  0000000000000002  0000000000000002  000000001040084c
20-23   000000000804000e  00000000000000c1  0000000000000000  0000000000000000
24-27   834e0b5f0800000f  00000000103ebd20  00000000103ebd20  0000000000000000
28-31   0000000000000802  0000000000000000  0000000010158288  0000000000000002

<Press any key to continue (q to quit)> 


'9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:

Check Summary                = 0xcb81045028000000
Available Memory             = 0x0000000080000000
CPU Diagnose Register 2      = 0x0203000000802004
CPU Status Register 0        = 0x2420c20000000000
CPU Status Register 1        = 0x8002000000000000
SADD LOG                     = 0xaf115ebd36f73fff
Read Short LOG               = 0xc1a0f0fffa380004
ERROR_STATUS                 = 0x0000000000500050
MEM_ADDR                     = 0x000001ff3fffffff
MEM_SYND                     = 0x0000000000000000
MEM_ADDR_CORR                = 0x000000100000442f
MEM_SYND_CORR                = 0x8c008c0000008c00
RUN_DATA_HIGH                = 0xc1bff0fffed08040
RUN_DATA_LOW                 = 0xc1bff0fffed08040
RUN_CTRL                     = 0x0000021c00001418
RUN_ADDR                     = 0xc1bff0fffed08040
System Responder Path        = 0x00ffffff0a060200


HPMC PIM Analysis Information:

Timestamp = 
  Sat Feb  2 18:48:51 GMT 2002    (20:02:02:02:18:48:51)


'9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:

A Data I/O Fetch Timeout occurred while CPU 0 was
requesting information from a device at the path 10/6/2/0 (PCI slot 2).


Memory/IO Controller Error Analysis Information:

There were multiple correctable memory errors.  See 'Memory Error Log Info'.

<Press any key to continue (q to quit)> 

-----------------  Processor 0 LPMC Information ------------------

Check Type                   = 0x00000000
I/D Cache Parity Info        = 0x00000000
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x00000000
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x0000000000000000
System Requestor Address     = 0x0000000000000000


-----------------  Processor 0 TOC Information -------------------

General Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11   0000000000000000  0000000000000000  0000000000000000  0000000000000000
12-15   0000000000000000  0000000000000000  0000000000000000  0000000000000000
16-19   0000000000000000  0000000000000000  0000000000000000  0000000000000000
20-23   0000000000000000  0000000000000000  0000000000000000  0000000000000000
24-27   0000000000000000  0000000000000000  0000000000000000  0000000000000000
28-31   0000000000000000  0000000000000000  0000000000000000  0000000000000000

<Press any key to continue (q to quit)> 

Control Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11   0000000000000000  0000000000000000  0000000000000000  0000000000000000
12-15   0000000000000000  0000000000000000  0000000000000000  0000000000000000
16-19   0000000000000000  0000000000000000  0000000000000000  0000000000000000
20-23   0000000000000000  0000000000000000  0000000000000000  0000000000000000
24-27   0000000000000000  0000000000000000  0000000000000000  0000000000000000
28-31   0000000000000000  0000000000000000  0000000000000000  0000000000000000
Space Registers 0 - 7

00-03   00000000          00000000          00000000          00000000
04-07   00000000          00000000          00000000          00000000

IIA Space                    = 0x0000000000000000
IIA Offset                   = 0x0000000000000000
CPU State                    = 0x00000000


<Press any key to continue (q to quit)> 

Memory Error Log Information:

Timestamp = 
  Sat Feb  2 18:48:51 GMT 2002    (20:02:02:02:18:48:51)


'9000/785 B,C,J Workstation Memory Error Log', rev 0, 64 bytes:

 This log displays the contents of memory specific registers when the
 HPMC occurred.  If there are multiple memory errors, the order they are
 listed is not indicative of the order they occurred.

                                   Trans  Addr
   Memory Error Type(s)  OV  MID    ID    par  CP   DIMM       Runway Address
   --------------------  --  ---  -----  ----  --  -------  -------------------
1) Correctable Mem       1   0x0  0x10   na    na  01       0x       0000110bc0

                                                Syndrome
                                           ------------------
                                        1) 0x8c008c0000008c00
<Press any key to continue (q to quit)> 

I/O Module Error Log Information:

Timestamp = 
  Sat Feb  2 18:48:51 GMT 2002    (20:02:02:02:18:48:51)


'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:

 Rope     Word1        Word2            Word3
------ ------------ ------------
   0    0x00000000   0x0e0cc009   0x00000000fed30048
   1    0x00000000   0x1e0cc009   0x00000000fed32048
   2    ----------   0x2e0cc009   ------------------
   3    ----------   0x3e0cc009   ------------------
   4    0x00000000   0x4e0cc009   0x00000000fed38048
   5    ----------   0x5e0cc009   ------------------
   6    0x0000e000   0x6e0cc009   0x00000000fa38003c
   7    ----------   0x7e0cc009   ------------------
Service Menu: Enter command >  

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [parisc-linux] PIM after c3k crash w/ VisEG PCI card
  2002-02-02 19:06 [parisc-linux] PIM after c3k crash w/ VisEG PCI card Helge Deller
@ 2002-02-03  6:57 ` Grant Grundler
  0 siblings, 0 replies; 2+ messages in thread
From: Grant Grundler @ 2002-02-03  6:57 UTC (permalink / raw)
  To: Helge Deller; +Cc: parisc-linux

Helge Deller wrote:
> Hi all,
> 
> the attached file shows the PIM of a 64bit kernel after my 
> machine crashed while trying to initialize the STI with a 
> VisEG PCI card in PCI slot 2. So it's the same problem
> as with a 32bit kernel.
> 
> Hopefully/Maybe this log may be usefull for someone of 
> you helping me to debug this problem ?

I can try.
notes/thoughts mixed in.
Much of the original text deleted.

> Timestamp = 
>   Sat Feb  2 18:48:51 GMT 2002    (20:02:02:02:18:48:51)

You should verify the timestamp actually matches the incident.
(This looks ok)

> HPMC Chassis Codes = 2cbf0  2500b  27821  2cbf4  2cbfc  

Normally these are useful - if you have the magic decoder 
for them. I don't know what the first digit "2" means.
The cbf0/500b/7821/cbf4/cbfc look familiar.

Here's what I *think* these mean based on some *really old* notes:
cbf0: HPMC
500b: Bus Timeout
7821: 782x == Mem Correctable Err, 1 == DIMM 1

	This seems to match the "clear" text that was printed later.
	Sounds like an orthogonal problem. Perhaps swap DIMM 1 with
	one of the other DIMMS?

cbf4: invalid OS HPMC checksum - page zero OS entry ptr was invalid
cbfc: couldn't call OS HPMC handler

	So fixing this would probably help get more info to console
	when it dies. Perhaps we try to setup the console before
	enabling the OS HPMC handler?

	But no console means no output unless EARLY_BOOTUP_DEBUG
	is defined in pdc_cons.c.


> General Registers 0 - 31
> 00-03   0000000000000000  ffffffffffffffff  00000000001072a0  00000000004c524
>   0

I'm going to guess GR02 is a realmode address (matching
virtual addr would be 101072a0).
Or perhaps a "double" HPMC occurred?

First one happened in STI code and then the OS HPMC handler
tripped again when it tried to output?


> IIA Space                    = 0x0000000000000000
> IIA Offset                   = 0x000000007fb3187c

Is this were STI gets loaded?
Looks like an awefully high address.
Artifact of no OS HPMC handler?

I'd hope STI would work the same on all boxes.

> Check Type                   = 0x20000000
> CPU State                    = 0x9e000004
> Cache Check                  = 0x00000000
> TLB Check                    = 0x00000000
> Bus Check                    = 0x0030103b
> Assists Check                = 0x00000000
> Assist State                 = 0x00000000
> Path Info                    = 0x00000000
> System Responder Address     = 0x000000fffa380004

Address CPU was trying to read.

> System Requestor Address     = 0xfffffffffffa0000

HPA of CPU that timed out.

...
> '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:
> 
> Check Summary                = 0xcb81045028000000
> Available Memory             = 0x0000000080000000
> CPU Diagnose Register 2      = 0x0203000000802004
> CPU Status Register 0        = 0x2420c20000000000
> CPU Status Register 1        = 0x8002000000000000
> SADD LOG                     = 0xaf115ebd36f73fff
> Read Short LOG               = 0xc1a0f0fffa380004
> ERROR_STATUS                 = 0x0000000000500050
> MEM_ADDR                     = 0x000001ff3fffffff
> MEM_SYND                     = 0x0000000000000000
> MEM_ADDR_CORR                = 0x000000100000442f
> MEM_SYND_CORR                = 0x8c008c0000008c00
> RUN_DATA_HIGH                = 0xc1bff0fffed08040
> RUN_DATA_LOW                 = 0xc1bff0fffed08040
> RUN_CTRL                     = 0x0000021c00001418
> RUN_ADDR                     = 0xc1bff0fffed08040
> System Responder Path        = 0x00ffffff0a060200

Much is actually interesting - but I think Read Short LOG
was the address of the most recent sub-cacheline read.
Not sure if this is only IO.

...
> A Data I/O Fetch Timeout occurred while CPU 0 was
> requesting information from a device at the path 10/6/2/0 (PCI slot 2).

Typical of two scenarios:
o device wasn't initialized/enabled
  (ie PCI CMD Bus Master and/or MMIO Enable bits not set)

o Some Bridge chip betwen CPU and PCI Device was already Fatal
  (eg DMA to invalid address with cause Astro/U2 to go fatal
   because of unresolved IO TLB fault)

> Memory/IO Controller Error Analysis Information:
> 
> There were multiple correctable memory errors.  See 'Memory Error Log Info'.

I'm wondering if this is related. Do these happen with out Viz-EG
enabled too?
You can "ser clearpim", boot, build a kernel or something, reboot
and check PIM info again.

> -----------------  Processor 0 LPMC Information ------------------

FWIW, typically LPMC is for correctable memory errors.
I believe the OS gets notified of these since it may chose
to evacuate the memory page that's getting those.

>  This log displays the contents of memory specific registers when the
>  HPMC occurred.  If there are multiple memory errors, the order they are
>  listed is not indicative of the order they occurred.
> 
>                                    Trans  Addr
>    Memory Error Type(s)  OV  MID    ID    par  CP   DIMM       Runway Address
>    --------------------  --  ---  -----  ----  --  -------  -----------------
>   --
> 1) Correctable Mem       1   0x0  0x10   na    na  01       0x       0000110b
>   c0

hmmm...is 00110bc0 a kernel address?
that's not far off from GR02.

> '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
> 
>  Rope     Word1        Word2            Word3
> ------ ------------ ------------
>    0    0x00000000   0x0e0cc009   0x00000000fed30048
>    1    0x00000000   0x1e0cc009   0x00000000fed32048
>    2    ----------   0x2e0cc009   ------------------
>    3    ----------   0x3e0cc009   ------------------
>    4    0x00000000   0x4e0cc009   0x00000000fed38048
>    5    ----------   0x5e0cc009   ------------------
>    6    0x0000e000   0x6e0cc009   0x00000000fa38003c
>    7    ----------   0x7e0cc009   ------------------

Rope 6 went fatal (0xe). Forgot  what word3 is - offending address?


hth,
grant

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-02-03  6:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-02 19:06 [parisc-linux] PIM after c3k crash w/ VisEG PCI card Helge Deller
2002-02-03  6:57 ` Grant Grundler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox