HPMC running CMake Nightly tests

linux-parisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* HPMC running CMake Nightly tests
@ 2011-09-27  7:32 Rolf Eike Beer
  2011-10-12  4:32 ` Grant Grundler
  0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-09-27  7:32 UTC (permalink / raw)
  To: linux-parisc

I'm running the CMake tests every night. This is the second time in a row
that my C3600 did not survive this. Since I was warned I connected a
serial console.

The first things are expected crashes from CMake as detecting a crashed
process is part of the tests. I wonder if these shouldn't be silenced as a
userspace crash could otherwise too easily be used to flood the logs.

do_page_fault() pid=16799 command='kwsysTestProces' type=15
address=0x00000000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001101111111100001111 Not tainted
r00-03  0006ff0f 1041b000 00011ee3 fb46f4c0
r04-07  4072adc0 00000000 fb36b02c 0001cd60
r08-11  00000000 000e6e20 00000000 00000000
r12-15  00000000 000e59c0 000e2d24 ffffffff
r16-19  000e2d14 000e3e20 00000000 4072adc0
r20-23  1054f020 00000000 406427e4 ffffffff
r24-27  fffffff5 ffffffd3 4072cd24 0001b0e4
r28-31  00000000 00000001 fb46f500 4063192b
sr00-03  00000030 00000017 00000000 00000030
sr04-07  00000030 00000030 00000030 00000030

      VZOUICununcqcqcqcqcqcrmunTDVZOUI
FPSR: 00000000000000000000000000000000
FPER1: 00000000
fr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000000000
fr04-07  41d3a02318ce6e4c 2800000000000000 0000000190000000 402e000000000000
fr08-11  000000601be80000 1059900010544330 0000000000000000 105fbd602fc260c8
fr12-15  41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0
fr16-19  0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8
fr20-23  2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7418c00000
fr24-27  410b865000000000 3fe0000000000000 412e848000000000 1029341c102c3848
fr28-31  ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264

IASQ: 00000030 00000030 IAOQ: 00011ee7 00011eeb
 IIR: 0f801280    ISR: 00000030  IOR: 00000000
 CPU:        0   CR30: 2ed5c000 CR31: ffffdffe
 ORIG_R28: 00000000
 IAOQ[0]: 00011ee7
 IAOQ[1]: 00011eeb
 RP(r2): 00011ee3

do_page_fault() pid=16827 command='kwsysTestProces' type=15
address=0x00000000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001101111111100001111 Not tainted
r00-03  0006ff0f 1041b000 00011ee3 fb2a74c0
r04-07  4072adc0 00000000 fb61702c 0001cd60
r08-11  00000000 000e6e20 00000000 00000000
r12-15  00000000 000e59c0 000e2d24 ffffffff
r16-19  000e2d14 000e3e20 00000000 4072adc0
r20-23  1054f020 00000000 406427e4 ffffffff
r24-27  fffffff5 ffffffd3 4072cd24 0001b0e4
r28-31  00000000 00000001 fb2a7500 4063192b
sr00-03  00000032 00000017 00000000 00000032
sr04-07  00000032 00000032 00000032 00000032

      VZOUICununcqcqcqcqcqcrmunTDVZOUI
FPSR: 00000000000000000000000000000000
FPER1: 00000000
fr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000000000
fr04-07  41d3a02319c7bbbf 2800000000000000 0000000190000000 403e000000000000
fr08-11  0000005c3e100000 1059900010544330 0000000000000000 105fbd602fc260c8
fr12-15  41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0
fr16-19  0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8
fr20-23  2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7419c00000
fr24-27  40fd802000000000 3fe0000000000000 412e848000000000 1029341c102c3848
fr28-31  ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264

IASQ: 00000032 00000032 IAOQ: 00011ee7 00011eeb
 IIR: 0f801280    ISR: 00000032  IOR: 00000000
 CPU:        0   CR30: 2095c000 CR31: ffffdffe
 ORIG_R28: 00000000
 IAOQ[0]: 00011ee7
 IAOQ[1]: 00011eeb
 RP(r2): 00011ee3

But then the machine got killed:

Backtrace:
 [<1030b9ec>] tulip_get_stats+0x34/0x5c
 [<1038ac20>] dev_get_stats+0x98/0xe8
 [<102946b4>] led_work_func+0x11c/0x310
 [<10145204>] process_one_work+0x120/0x3ac
 [<10147110>] worker_thread+0x174/0x338
 [<1014b0b4>] kthread+0x9c/0xa4
 [<10102c5c>] ret_from_kernel_thread+0x1c/0x24


High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001001111111100001110 Not tainted
r00-03  0004ff0e 105bf000 1030b9ec 2fc72000
r04-07  0000000f 00000000 00000000 00000000
r08-11  2fc72000 105bf600 2fea4208 7f000000
r12-15  2fea4210 105ba000 10544000 2fc2f408
r16-19  1041d1dc f000017c f0000174 2fea4210
r20-23  0099f055 0099f050 1030b9b8 00000000
r24-27  2ff57008 2fea4210 0004a040 10544000
r28-31  0004a040 f68e066d 2fea4400 1038ac20
sr00-03  00000000 00000000 00000000 00000017
sr04-07  00000000 00000000 00000000 00000000

IASQ: 00000000 00000000 IAOQ: 10284394 10284398
 IIR: 0f80109c    ISR: a627ffd0  IOR: 0204a040
 CPU:        0   CR30: 2fea4000 CR31: ffffdffe
 ORIG_R28: 00000000
 IAOQ[0]: ioread32+0xc/0x4c
 IAOQ[1]: ioread32+0x10/0x4c
 RP(r2): tulip_get_stats+0x34/0x5c
Backtrace:
 [<1030b9ec>] tulip_get_stats+0x34/0x5c
 [<1038ac20>] dev_get_stats+0x98/0xe8
 [<102946b4>] led_work_func+0x11c/0x310
 [<10145204>] process_one_work+0x120/0x3ac
 [<10147110>] worker_thread+0x174/0x338
 [<1014b0b4>] kthread+0x9c/0xa4
 [<10102c5c>] ret_from_kernel_thread+0x1c/0x24

Kernel panic - not syncing: High Priority Machine Check (HPMC)
Backtrace:
 [<1010edec>] panic+0x90/0x23c
 [<101143b8>] parisc_terminate+0xbc/0xd4
 [<1011458c>] handle_interruption+0x1bc/0x718
 [<10103078>] intr_check_sig+0x0/0x34
 [<10284398>] ioread32+0x10/0x4c
 [<103e8fc0>] bictcp_acked+0x0/0x228

I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted.

Any hints?

Eike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: HPMC running CMake Nightly tests
  2011-09-27  7:32 HPMC running CMake Nightly tests Rolf Eike Beer
@ 2011-10-12  4:32 ` Grant Grundler
  2011-10-17  7:18   ` Rolf Eike Beer
  0 siblings, 1 reply; 6+ messages in thread
From: Grant Grundler @ 2011-10-12  4:32 UTC (permalink / raw)
  To: Rolf Eike Beer; +Cc: linux-parisc

On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
> I'm running the CMake tests every night. This is the second time in a row
> that my C3600 did not survive this. Since I was warned I connected a
> serial console.
...

> But then the machine got killed:
> 
> Backtrace:
>  [<1030b9ec>] tulip_get_stats+0x34/0x5c
>  [<1038ac20>] dev_get_stats+0x98/0xe8
>  [<102946b4>] led_work_func+0x11c/0x310
>  [<10145204>] process_one_work+0x120/0x3ac
>  [<10147110>] worker_thread+0x174/0x338
>  [<1014b0b4>] kthread+0x9c/0xa4
>  [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
> 
> 
> High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)
> 
>      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00000000000001001111111100001110 Not tainted
> r00-03  0004ff0e 105bf000 1030b9ec 2fc72000
> r04-07  0000000f 00000000 00000000 00000000
> r08-11  2fc72000 105bf600 2fea4208 7f000000
> r12-15  2fea4210 105ba000 10544000 2fc2f408
> r16-19  1041d1dc f000017c f0000174 2fea4210
> r20-23  0099f055 0099f050 1030b9b8 00000000
> r24-27  2ff57008 2fea4210 0004a040 10544000
> r28-31  0004a040 f68e066d 2fea4400 1038ac20
> sr00-03  00000000 00000000 00000000 00000017
> sr04-07  00000000 00000000 00000000 00000000
> 
> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
>  IIR: 0f80109c    ISR: a627ffd0  IOR: 0204a040
>  CPU:        0   CR30: 2fea4000 CR31: ffffdffe
>  ORIG_R28: 00000000
>  IAOQ[0]: ioread32+0xc/0x4c

Usually the HMPC means tulip tried to read something
from MMIO space that didn't respond and this
resulted in a "Master Abort" (PCI bus controller
had to abort the transaction). On PCs that's not
fatal but is on many RISC architectures.

If you can decode the instruction pointer (ioread32+0x10) to figure out
which register is used to dereference the MMIO address, it would
be obvious what the offending address is - just to confirm the
pointer isn't pointing off into the weeds. It will be one of the
registers that contains a 0xfnnnnnnn address.

Also possible is something before already offended the SBA 
("System Bus Adapter" : has IOMMU and mem controller in it)
by trying to DMA to an unmapped address. SBA is "fatal"
at that point and the next MMIO read causes the CPU to
recognize the fatal state of the SBA. Decoding the HPMC (see
below) can help determine that.


>  IAOQ[1]: ioread32+0x10/0x4c
>  RP(r2): tulip_get_stats+0x34/0x5c
> Backtrace:
>  [<1030b9ec>] tulip_get_stats+0x34/0x5c
>  [<1038ac20>] dev_get_stats+0x98/0xe8
>  [<102946b4>] led_work_func+0x11c/0x310
>  [<10145204>] process_one_work+0x120/0x3ac
>  [<10147110>] worker_thread+0x174/0x338
>  [<1014b0b4>] kthread+0x9c/0xa4
>  [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
> 
> Kernel panic - not syncing: High Priority Machine Check (HPMC)
> Backtrace:
>  [<1010edec>] panic+0x90/0x23c
>  [<101143b8>] parisc_terminate+0xbc/0xd4
>  [<1011458c>] handle_interruption+0x1bc/0x718
>  [<10103078>] intr_check_sig+0x0/0x34
>  [<10284398>] ioread32+0x10/0x4c
>  [<103e8fc0>] bictcp_acked+0x0/0x228
> 
> I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted.
> 
> Any hints?

Interrupt the boot process and collect the HPMC dump as described:
   http://www.parisc-linux.org/faq/kernelbug-howto.html> 

The output will include the offending address that the ioread32 was
trying to access to confirm the instruction was decoded correctly.
If anyone has access to the magic decoder ring, we might be able to tell more.

cheers,
grant

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: HPMC running CMake Nightly tests
  2011-10-12  4:32 ` Grant Grundler
@ 2011-10-17  7:18   ` Rolf Eike Beer
  2011-10-21  8:26     ` Rolf Eike Beer
  0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-17  7:18 UTC (permalink / raw)
  To: linux-parisc

> On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
>> I'm running the CMake tests every night. This is the second time in a
>> row
>> that my C3600 did not survive this. Since I was warned I connected a
>> serial console.
> ...
>
>> But then the machine got killed:
>>
>> Backtrace:
>>  [<1030b9ec>] tulip_get_stats+0x34/0x5c
>>  [<1038ac20>] dev_get_stats+0x98/0xe8
>>  [<102946b4>] led_work_func+0x11c/0x310
>>  [<10145204>] process_one_work+0x120/0x3ac
>>  [<10147110>] worker_thread+0x174/0x338
>>  [<1014b0b4>] kthread+0x9c/0xa4
>>  [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>>
>>
>> High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)
>>
>>      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>> PSW: 00000000000001001111111100001110 Not tainted
>> r00-03  0004ff0e 105bf000 1030b9ec 2fc72000
>> r04-07  0000000f 00000000 00000000 00000000
>> r08-11  2fc72000 105bf600 2fea4208 7f000000
>> r12-15  2fea4210 105ba000 10544000 2fc2f408
>> r16-19  1041d1dc f000017c f0000174 2fea4210
>> r20-23  0099f055 0099f050 1030b9b8 00000000
>> r24-27  2ff57008 2fea4210 0004a040 10544000
>> r28-31  0004a040 f68e066d 2fea4400 1038ac20
>> sr00-03  00000000 00000000 00000000 00000017
>> sr04-07  00000000 00000000 00000000 00000000
>>
>> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
>>  IIR: 0f80109c    ISR: a627ffd0  IOR: 0204a040
>>  CPU:        0   CR30: 2fea4000 CR31: ffffdffe
>>  ORIG_R28: 00000000
>>  IAOQ[0]: ioread32+0xc/0x4c
>
> Usually the HMPC means tulip tried to read something
> from MMIO space that didn't respond and this
> resulted in a "Master Abort" (PCI bus controller
> had to abort the transaction). On PCs that's not
> fatal but is on many RISC architectures.
>
> If you can decode the instruction pointer (ioread32+0x10) to figure out
> which register is used to dereference the MMIO address, it would
> be obvious what the offending address is - just to confirm the
> pointer isn't pointing off into the weeds. It will be one of the
> registers that contains a 0xfnnnnnnn address.

I will have a look.

> Interrupt the boot process and collect the HPMC dump as described:
>    http://www.parisc-linux.org/faq/kernelbug-howto.html>
>
> The output will include the offending address that the ioread32 was
> trying to access to confirm the instruction was decoded correctly.
> If anyone has access to the magic decoder ring, we might be able to tell
> more.

-----------------  Processor 0 HPMC Information ------------------

Timestamp =
  Fri Oct  14 12:18:23 GMT 2011    (20:11:10:14:12:18:23)

HPMC Chassis Codes = 2cbf0  2500b  2cbfb

General Registers 0 - 31
00-03   0000000000000000  00000000105bf000  000000001030bbd4 
000000002fc26000
04-07   000000000000000f  0000000000000000  0000000000000000 
0000000000000000
08-11   000000002fc26000  00000000105bf600  000000002fc50208 
000000007f000000
12-15   000000002fc50210  00000000105ba000  0000000010544000 
000000002fc2e628
16-19   000000001041d1dc  00000000f000017c  00000000f0000174 
000000002fc50210
20-23   000000000209f184  000000000209f17f  000000001030bba0 
0000000000000000
24-27   000000000000f424  000000002fc50210  000000000004a040 
0000000010544000
28-31   000000000004a040  0000000000000000  000000002fc50400 
000000001038ae40

Control Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
08-11   000000000000006e  0000000000000000  00000000000000c0 
000000000000003d
12-15   0000000000000000  0000000000000000  0000000000102000 
00000000fe000000
16-19   000044dd642070fc  0000000000000000  0000000010284504 
000000000f80109c
20-23   00000000a627ffd0  000000000204a040  000000ff0004fc0e 
0000000080000000
24-27   0000000000594000  000000011df90000  00000000fffff5f7 
00000000fffffdfe
28-31   00000000fffff7f4  00000000fffff7f6  000000002fc50000 
00000000ffffdffe
Space Registers 0 - 7

00-03   00000000          00000000          00000000          00000037
04-07   00000000          00000000          00000000          00000000

IIA Space                    = 0x0000000000000000
IIA Offset                   = 0x0000000010284508
Check Type                   = 0x20000000
CPU State                    = 0x9e000004
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x0030103b
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x000000fff4008040
System Requestor Address     = 0xfffffffffffa0000

Floating-Point Registers 0 - 31
00-03   0000001f00000000  0000000000000000  0000000000000000 
0000000000000000
04-07   41bf636000000000  41bf636000000000  00000002625a0000 
0000000000000000
08-11   0000000000000000  1059900010544330  0000000000000000 
105fbd602fde70c8
12-15   ffffffffad401040  ffffddb6f5fc38f8  fffffffffdfc38d0 
fffffffff5fc3ad0
16-19   ffffff8effffffff  ffffffcff5fc3ad0  ffffffb3f1dc38c0 
ffffffff21041800
20-23   ffffffffa5401040  fffffffff5fc38d0  0000000000000000 
0000000100000000
24-27   0000000000000000  0000000000090a6e  0000000000000015 
1029358c102c3a38
28-31   ffffffff0000313d  1055f1d010544000  0000000100000228 
2fc302001011a234

'9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:

Check Summary                = 0xcb81041008000000
Available Memory             = 0x0000000020000000
CPU Diagnose Register 2      = 0x0301000000000004
CPU Status Register 0        = 0x2420c20000000000
CPU Status Register 1        = 0x8002000000000000
SADD LOG                     = 0x4b023fd9e8190951
Read Short LOG               = 0xc1af00fff4008040
ERROR_STATUS                 = 0x0000000000100010
MEM_ADDR                     = 0x000001ff3fffffff
MEM_SYND                     = 0x0000000000000000
MEM_ADDR_CORR                = 0x000001ff3fffffff
MEM_SYND_CORR                = 0x0000000000000000
RUN_DATA_HIGH                = 0xc1bff0fffed08040
RUN_DATA_LOW                 = 0xc1bff0fffed08040
RUN_CTRL                     = 0x0000021c00001418
RUN_ADDR                     = 0xc1bff0fffed08040
System Responder Path        = 0x00ffffff0a000c00


HPMC PIM Analysis Information:

Timestamp =
  Fri Oct  14 12:18:23 GMT 2011    (20:11:10:14:12:18:23)


'9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:

A Data I/O Fetch Timeout occurred while CPU 0 was
requesting information from a device at the path 10/0/12/0 (built-in PCI
device).


Memory/IO Controller Error Analysis Information:

The Memory/IO Controller only observed the Broadcast Error.  It did not log
any additional information about the HPMC.

-----------------  Processor 0 LPMC Information ------------------

Check Type                   = 0x00000000
I/D Cache Parity Info        = 0x00000000
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x00000000
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x0000000000000000
System Requestor Address     = 0x0000000000000000


-----------------  Processor 0 TOC Information -------------------

General Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
08-11   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
12-15   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
16-19   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
20-23   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
24-27   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
28-31   0000000000000000  0000000000000000  0000000000000000 
0000000000000000

Control Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
08-11   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
12-15   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
16-19   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
20-23   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
24-27   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
28-31   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
Space Registers 0 - 7

00-03   00000000          00000000          00000000          00000000
04-07   00000000          00000000          00000000          00000000

IIA Space                    = 0x0000000000000000
IIA Offset                   = 0x0000000000000000
CPU State                    = 0x00000000


I/O Module Error Log Information:

Timestamp =
  Fri Oct  14 12:18:23 GMT 2011    (20:11:10:14:12:18:23)


'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:

 Rope     Word1        Word2            Word3
------ ------------ ------------
   0    0x00000000   0x0e0cc2a9   0x00000000fed30048
   1    0x00000000   0x1e0cc009   0x00000000fed32048
   2    ----------   0x2e0cc009   ------------------
   3    ----------   0x3e0cc009   ------------------
   4    0x00000000   0x4e0cc009   0x00000000fed38048
   5    ----------   0x5e0cc009   ------------------
   6    0x00000000   0x6e0cc009   0x00000000fed3c048
   7    ----------   0x7e0cc009   ------------------

Eike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: HPMC running CMake Nightly tests
  2011-10-17  7:18   ` Rolf Eike Beer
@ 2011-10-21  8:26     ` Rolf Eike Beer
  2011-10-26 16:16       ` Grant Grundler
  0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-21  8:26 UTC (permalink / raw)
  To: linux-parisc

>> On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
>>> I'm running the CMake tests every night. This is the second time in a
>>> row
>>> that my C3600 did not survive this. Since I was warned I connected a
>>> serial console.
>> ...
>>
>>> But then the machine got killed:
>>>
>>> Backtrace:
>>>  [<1030b9ec>] tulip_get_stats+0x34/0x5c
>>>  [<1038ac20>] dev_get_stats+0x98/0xe8
>>>  [<102946b4>] led_work_func+0x11c/0x310
>>>  [<10145204>] process_one_work+0x120/0x3ac
>>>  [<10147110>] worker_thread+0x174/0x338
>>>  [<1014b0b4>] kthread+0x9c/0xa4
>>>  [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>>>
>>>
>>> High Priority Machine Check (HPMC): Code=1 regs=10551080
>>> (Addr=00000000)
>>>
>>>      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>>> PSW: 00000000000001001111111100001110 Not tainted
>>> r00-03  0004ff0e 105bf000 1030b9ec 2fc72000
>>> r04-07  0000000f 00000000 00000000 00000000
>>> r08-11  2fc72000 105bf600 2fea4208 7f000000
>>> r12-15  2fea4210 105ba000 10544000 2fc2f408
>>> r16-19  1041d1dc f000017c f0000174 2fea4210
>>> r20-23  0099f055 0099f050 1030b9b8 00000000
>>> r24-27  2ff57008 2fea4210 0004a040 10544000
>>> r28-31  0004a040 f68e066d 2fea4400 1038ac20
>>> sr00-03  00000000 00000000 00000000 00000017
>>> sr04-07  00000000 00000000 00000000 00000000
>>>
>>> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
>>>  IIR: 0f80109c    ISR: a627ffd0  IOR: 0204a040
>>>  CPU:        0   CR30: 2fea4000 CR31: ffffdffe
>>>  ORIG_R28: 00000000
>>>  IAOQ[0]: ioread32+0xc/0x4c
>>
>> Usually the HMPC means tulip tried to read something
>> from MMIO space that didn't respond and this
>> resulted in a "Master Abort" (PCI bus controller
>> had to abort the transaction). On PCs that's not
>> fatal but is on many RISC architectures.
>>
>> If you can decode the instruction pointer (ioread32+0x10) to figure out
>> which register is used to dereference the MMIO address, it would
>> be obvious what the offending address is - just to confirm the
>> pointer isn't pointing off into the weeds. It will be one of the
>> registers that contains a 0xfnnnnnnn address.
>
> I will have a look.
>
>> Interrupt the boot process and collect the HPMC dump as described:
>>    http://www.parisc-linux.org/faq/kernelbug-howto.html>
>>
>> The output will include the offending address that the ioread32 was
>> trying to access to confirm the instruction was decoded correctly.
>> If anyone has access to the magic decoder ring, we might be able to tell
>> more.

Ok, I have another one. I removed all those parts that did not show any
errors or where the register contents were all zeros.

Timestamp =
  Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)

HPMC Chassis Codes = 2cbf0  2500b  2cbfb

General Registers 0 - 31
00-03   0000000000000000  00000000105bf000  000000001030bbd4 
000000002fe46000
04-07   000000000000000f  0000000000000000  0000000000000000 
0000000000000008
08-11   000000002fe46000  00000000105bf600  000000002fec8208 
000000007f000000
12-15   000000002fec8210  00000000105ba000  0000000010544000 
000000002fc2f408
16-19   000000001041d1dc  00000000f000017c  00000000f0000174 
000000002fec8210
20-23   000000000108ce00  000000000108cdf3  000000001030bba0 
0000000000000000
24-27   000000000000f424  000000002fec8210  000000000004a040 
0000000010544000
28-31   000000000004a040  0000000000000000  000000002fec8400 
000000001038ae40

Control Registers 0 - 31
00-03   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
04-07   0000000000000000  0000000000000000  0000000000000000 
0000000000000000
08-11   000000000000004e  0000000000000000  00000000000000c0 
000000000000003d
12-15   0000000000000000  0000000000000000  0000000000102000 
00000000fe000000
16-19   0000230bfe918584  0000000000000000  0000000010284504 
000000000f80109c
20-23   00000000a627ffd0  000000000204a040  000000ff0006fc0e 
0000000080000000
24-27   0000000000594000  000000011ec4a000  00000000ffffffff 
00000000ffffffff
28-31   00000000ffffffff  00000000ffffffff  000000002fec8000 
00000000ffffffff
Space Registers 0 - 7

00-03   00000000          00000000          00000000          00000027
04-07   00000000          00000000          00000000          00000000

IIA Space                    = 0x0000000000000000
IIA Offset                   = 0x0000000010284508
Check Type                   = 0x20000000
CPU State                    = 0x9e000004
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x0030103b
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x000000fff4008040
System Requestor Address     = 0xfffffffffffa0000

Floating-Point Registers 0 - 31
00-03   0000001f00000000  0000000000000000  0000000000000000 
0000000000000000
04-07   0000000a00000000  0000000000000000  0000000000000000 
0000000049ba5e35
08-11   0000000000000000  1059900010544330  0000000000000000 
105fbd602fe470c8
12-15   ffffffff00000000  0000000000000000  0000000000000000 
0000000000000000
16-19   95380000ffffffff  8008000000000000  0010000000000000 
9118000000000000
20-23   8108000000000000  8008000000000000  0000000000000000 
0000000100000000
24-27   0000000000000000  0000000000090a6e  0000000000000015 
0000000000000000
28-31   ffffffff0000313c  1055f1d010544000  0000000100000228 
2fc302001011a234

'9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:

Check Summary                = 0xcb81041008000000
Available Memory             = 0x0000000020000000
CPU Diagnose Register 2      = 0x0301000000000004
CPU Status Register 0        = 0x2420c20000000000
CPU Status Register 1        = 0x8002000000000000
SADD LOG                     = 0x4b023fd9e8190951
Read Short LOG               = 0xc1af00fff4008040
ERROR_STATUS                 = 0x0000000000100010
MEM_ADDR                     = 0x000001ff3fffffff
MEM_SYND                     = 0x0000000000000000
MEM_ADDR_CORR                = 0x000001ff3fffffff
MEM_SYND_CORR                = 0x0000000000000000
RUN_DATA_HIGH                = 0xc1bff0fffed08040
RUN_DATA_LOW                 = 0xc1bff0fffed08040
RUN_CTRL                     = 0x0000021c00001418
RUN_ADDR                     = 0xc1bff0fffed08040
System Responder Path        = 0x00ffffff0a000c00


HPMC PIM Analysis Information:

Timestamp =
  Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)


'9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:

A Data I/O Fetch Timeout occurred while CPU 0 was
requesting information from a device at the path 10/0/12/0 (built-in PCI
device).


I/O Module Error Log Information:

Timestamp =
  Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)


'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:

 Rope     Word1        Word2            Word3
------ ------------ ------------
   0    0x00000000   0x0e0cc2a9   0x00000000fed30048
   1    0x00000000   0x1e0cc009   0x00000000fed32048
   2    ----------   0x2e0cc009   ------------------
   3    ----------   0x3e0cc009   ------------------
   4    0x00000000   0x4e0cc009   0x00000000fed38048
   5    ----------   0x5e0cc009   ------------------
   6    0x00000000   0x6e0cc009   0x00000000fed3c048
   7    ----------   0x7e0cc009   ------------------

Eike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: HPMC running CMake Nightly tests
  2011-10-21  8:26     ` Rolf Eike Beer
@ 2011-10-26 16:16       ` Grant Grundler
  2011-10-26 17:54         ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer
  0 siblings, 1 reply; 6+ messages in thread
From: Grant Grundler @ 2011-10-26 16:16 UTC (permalink / raw)
  To: Rolf Eike Beer; +Cc: linux-parisc

On Fri, Oct 21, 2011 at 10:26:57AM +0200, Rolf Eike Beer wrote:
> Ok, I have another one. I removed all those parts that did not show any
> errors or where the register contents were all zeros.
> 
> Timestamp =
>   Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)
...
> System Responder Address     = 0x000000fff4008040

MMIO Address that wasn't responding.  Note that it's 40 bits.
The 32-bit address used by OS is "F-extended" by HW (CPU I think).


> System Requestor Address     = 0xfffffffffffa0000

Address of CPU that was requesting the MMIO address.

This is enough info to identify what I believe is the "victim".
It's not likely to be the root cause.

Historically, this type of HPMC happens because a device
attempted to DMA to an unmapped address and the IOMMU
went "fatal" (stopped routing traffic to PCI busses).


> '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:
> 
> Check Summary                = 0xcb81041008000000
> Available Memory             = 0x0000000020000000
> CPU Diagnose Register 2      = 0x0301000000000004
> CPU Status Register 0        = 0x2420c20000000000
> CPU Status Register 1        = 0x8002000000000000
> SADD LOG                     = 0x4b023fd9e8190951
> Read Short LOG               = 0xc1af00fff4008040
> ERROR_STATUS                 = 0x0000000000100010
> MEM_ADDR                     = 0x000001ff3fffffff
> MEM_SYND                     = 0x0000000000000000
> MEM_ADDR_CORR                = 0x000001ff3fffffff
> MEM_SYND_CORR                = 0x0000000000000000
> RUN_DATA_HIGH                = 0xc1bff0fffed08040
> RUN_DATA_LOW                 = 0xc1bff0fffed08040
> RUN_CTRL                     = 0x0000021c00001418
> RUN_ADDR                     = 0xc1bff0fffed08040
> System Responder Path        = 0x00ffffff0a000c00

This part could yield another clue if we had the magic decoder ring. :(


> HPMC PIM Analysis Information:
> 
> Timestamp =
>   Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)
> 
> 
> '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:
> 
> A Data I/O Fetch Timeout occurred while CPU 0 was
> requesting information from a device at the path 10/0/12/0 (built-in PCI
> device).

Doing "in io" at the BCH prompt should list all devices including 10/0/12/0
Google search is failing to find a posting with that content. :/


> '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
> 
>  Rope     Word1        Word2            Word3
> ------ ------------ ------------
>    0    0x00000000   0x0e0cc2a9   0x00000000fed30048
>    1    0x00000000   0x1e0cc009   0x00000000fed32048
>    2    ----------   0x2e0cc009   ------------------
>    3    ----------   0x3e0cc009   ------------------
>    4    0x00000000   0x4e0cc009   0x00000000fed38048
>    5    ----------   0x5e0cc009   ------------------
>    6    0x00000000   0x6e0cc009   0x00000000fed3c048
>    7    ----------   0x7e0cc009   ------------------

"HP c3750 | hp workstation c3700 and c3650 - service handbook" in a 
couple of different places says:
 "I/O Error log word 3 contains the error address"

I'm assuming this is just the last accessed address by that PCI bus.

cheers,
grant

^ permalink raw reply	[flat|nested] 6+ messages in thread

* HPMC on network load (was: HPMC running CMake Nightly tests)
  2011-10-26 16:16       ` Grant Grundler
@ 2011-10-26 17:54         ` Rolf Eike Beer
  0 siblings, 0 replies; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-26 17:54 UTC (permalink / raw)
  To: linux-parisc

[-- Attachment #1: Type: text/plain, Size: 784 bytes --]

Grant Grundler write

> > HPMC PIM Analysis Information:
> > 
> > Timestamp =
> > 
> >   Thu Oct  20 09:05:52 GMT 2011    (20:11:10:20:09:05:52)
> > 
> > '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304
> > bytes:
> > 
> > A Data I/O Fetch Timeout occurred while CPU 0 was
> > requesting information from a device at the path 10/0/12/0 (built-in PCI
> > device).
> 
> Doing "in io" at the BCH prompt should list all devices including 10/0/12/0
> Google search is failing to find a posting with that content. :/

IIRC it is the network card.

The last time I saw this was during "emerge --sync", which was hours away from 
the nightly CMake run. Since all traces point at the network card I think this 
really has nothing to do with CMake or CPU load at all.

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-10-26 17:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-27  7:32 HPMC running CMake Nightly tests Rolf Eike Beer
2011-10-12  4:32 ` Grant Grundler
2011-10-17  7:18   ` Rolf Eike Beer
2011-10-21  8:26     ` Rolf Eike Beer
2011-10-26 16:16       ` Grant Grundler
2011-10-26 17:54         ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).