* HPMC running CMake Nightly tests
@ 2011-09-27 7:32 Rolf Eike Beer
2011-10-12 4:32 ` Grant Grundler
0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-09-27 7:32 UTC (permalink / raw)
To: linux-parisc
I'm running the CMake tests every night. This is the second time in a row
that my C3600 did not survive this. Since I was warned I connected a
serial console.
The first things are expected crashes from CMake as detecting a crashed
process is part of the tests. I wonder if these shouldn't be silenced as a
userspace crash could otherwise too easily be used to flood the logs.
do_page_fault() pid=16799 command='kwsysTestProces' type=15
address=0x00000000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001101111111100001111 Not tainted
r00-03 0006ff0f 1041b000 00011ee3 fb46f4c0
r04-07 4072adc0 00000000 fb36b02c 0001cd60
r08-11 00000000 000e6e20 00000000 00000000
r12-15 00000000 000e59c0 000e2d24 ffffffff
r16-19 000e2d14 000e3e20 00000000 4072adc0
r20-23 1054f020 00000000 406427e4 ffffffff
r24-27 fffffff5 ffffffd3 4072cd24 0001b0e4
r28-31 00000000 00000001 fb46f500 4063192b
sr00-03 00000030 00000017 00000000 00000030
sr04-07 00000030 00000030 00000030 00000030
VZOUICununcqcqcqcqcqcrmunTDVZOUI
FPSR: 00000000000000000000000000000000
FPER1: 00000000
fr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000
fr04-07 41d3a02318ce6e4c 2800000000000000 0000000190000000 402e000000000000
fr08-11 000000601be80000 1059900010544330 0000000000000000 105fbd602fc260c8
fr12-15 41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0
fr16-19 0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8
fr20-23 2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7418c00000
fr24-27 410b865000000000 3fe0000000000000 412e848000000000 1029341c102c3848
fr28-31 ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264
IASQ: 00000030 00000030 IAOQ: 00011ee7 00011eeb
IIR: 0f801280 ISR: 00000030 IOR: 00000000
CPU: 0 CR30: 2ed5c000 CR31: ffffdffe
ORIG_R28: 00000000
IAOQ[0]: 00011ee7
IAOQ[1]: 00011eeb
RP(r2): 00011ee3
do_page_fault() pid=16827 command='kwsysTestProces' type=15
address=0x00000000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001101111111100001111 Not tainted
r00-03 0006ff0f 1041b000 00011ee3 fb2a74c0
r04-07 4072adc0 00000000 fb61702c 0001cd60
r08-11 00000000 000e6e20 00000000 00000000
r12-15 00000000 000e59c0 000e2d24 ffffffff
r16-19 000e2d14 000e3e20 00000000 4072adc0
r20-23 1054f020 00000000 406427e4 ffffffff
r24-27 fffffff5 ffffffd3 4072cd24 0001b0e4
r28-31 00000000 00000001 fb2a7500 4063192b
sr00-03 00000032 00000017 00000000 00000032
sr04-07 00000032 00000032 00000032 00000032
VZOUICununcqcqcqcqcqcrmunTDVZOUI
FPSR: 00000000000000000000000000000000
FPER1: 00000000
fr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000
fr04-07 41d3a02319c7bbbf 2800000000000000 0000000190000000 403e000000000000
fr08-11 0000005c3e100000 1059900010544330 0000000000000000 105fbd602fc260c8
fr12-15 41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0
fr16-19 0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8
fr20-23 2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7419c00000
fr24-27 40fd802000000000 3fe0000000000000 412e848000000000 1029341c102c3848
fr28-31 ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264
IASQ: 00000032 00000032 IAOQ: 00011ee7 00011eeb
IIR: 0f801280 ISR: 00000032 IOR: 00000000
CPU: 0 CR30: 2095c000 CR31: ffffdffe
ORIG_R28: 00000000
IAOQ[0]: 00011ee7
IAOQ[1]: 00011eeb
RP(r2): 00011ee3
But then the machine got killed:
Backtrace:
[<1030b9ec>] tulip_get_stats+0x34/0x5c
[<1038ac20>] dev_get_stats+0x98/0xe8
[<102946b4>] led_work_func+0x11c/0x310
[<10145204>] process_one_work+0x120/0x3ac
[<10147110>] worker_thread+0x174/0x338
[<1014b0b4>] kthread+0x9c/0xa4
[<10102c5c>] ret_from_kernel_thread+0x1c/0x24
High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00000000000001001111111100001110 Not tainted
r00-03 0004ff0e 105bf000 1030b9ec 2fc72000
r04-07 0000000f 00000000 00000000 00000000
r08-11 2fc72000 105bf600 2fea4208 7f000000
r12-15 2fea4210 105ba000 10544000 2fc2f408
r16-19 1041d1dc f000017c f0000174 2fea4210
r20-23 0099f055 0099f050 1030b9b8 00000000
r24-27 2ff57008 2fea4210 0004a040 10544000
r28-31 0004a040 f68e066d 2fea4400 1038ac20
sr00-03 00000000 00000000 00000000 00000017
sr04-07 00000000 00000000 00000000 00000000
IASQ: 00000000 00000000 IAOQ: 10284394 10284398
IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040
CPU: 0 CR30: 2fea4000 CR31: ffffdffe
ORIG_R28: 00000000
IAOQ[0]: ioread32+0xc/0x4c
IAOQ[1]: ioread32+0x10/0x4c
RP(r2): tulip_get_stats+0x34/0x5c
Backtrace:
[<1030b9ec>] tulip_get_stats+0x34/0x5c
[<1038ac20>] dev_get_stats+0x98/0xe8
[<102946b4>] led_work_func+0x11c/0x310
[<10145204>] process_one_work+0x120/0x3ac
[<10147110>] worker_thread+0x174/0x338
[<1014b0b4>] kthread+0x9c/0xa4
[<10102c5c>] ret_from_kernel_thread+0x1c/0x24
Kernel panic - not syncing: High Priority Machine Check (HPMC)
Backtrace:
[<1010edec>] panic+0x90/0x23c
[<101143b8>] parisc_terminate+0xbc/0xd4
[<1011458c>] handle_interruption+0x1bc/0x718
[<10103078>] intr_check_sig+0x0/0x34
[<10284398>] ioread32+0x10/0x4c
[<103e8fc0>] bictcp_acked+0x0/0x228
I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted.
Any hints?
Eike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests
2011-09-27 7:32 HPMC running CMake Nightly tests Rolf Eike Beer
@ 2011-10-12 4:32 ` Grant Grundler
2011-10-17 7:18 ` Rolf Eike Beer
0 siblings, 1 reply; 6+ messages in thread
From: Grant Grundler @ 2011-10-12 4:32 UTC (permalink / raw)
To: Rolf Eike Beer; +Cc: linux-parisc
On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
> I'm running the CMake tests every night. This is the second time in a row
> that my C3600 did not survive this. Since I was warned I connected a
> serial console.
...
> But then the machine got killed:
>
> Backtrace:
> [<1030b9ec>] tulip_get_stats+0x34/0x5c
> [<1038ac20>] dev_get_stats+0x98/0xe8
> [<102946b4>] led_work_func+0x11c/0x310
> [<10145204>] process_one_work+0x120/0x3ac
> [<10147110>] worker_thread+0x174/0x338
> [<1014b0b4>] kthread+0x9c/0xa4
> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>
>
> High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)
>
> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00000000000001001111111100001110 Not tainted
> r00-03 0004ff0e 105bf000 1030b9ec 2fc72000
> r04-07 0000000f 00000000 00000000 00000000
> r08-11 2fc72000 105bf600 2fea4208 7f000000
> r12-15 2fea4210 105ba000 10544000 2fc2f408
> r16-19 1041d1dc f000017c f0000174 2fea4210
> r20-23 0099f055 0099f050 1030b9b8 00000000
> r24-27 2ff57008 2fea4210 0004a040 10544000
> r28-31 0004a040 f68e066d 2fea4400 1038ac20
> sr00-03 00000000 00000000 00000000 00000017
> sr04-07 00000000 00000000 00000000 00000000
>
> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
> IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040
> CPU: 0 CR30: 2fea4000 CR31: ffffdffe
> ORIG_R28: 00000000
> IAOQ[0]: ioread32+0xc/0x4c
Usually the HMPC means tulip tried to read something
from MMIO space that didn't respond and this
resulted in a "Master Abort" (PCI bus controller
had to abort the transaction). On PCs that's not
fatal but is on many RISC architectures.
If you can decode the instruction pointer (ioread32+0x10) to figure out
which register is used to dereference the MMIO address, it would
be obvious what the offending address is - just to confirm the
pointer isn't pointing off into the weeds. It will be one of the
registers that contains a 0xfnnnnnnn address.
Also possible is something before already offended the SBA
("System Bus Adapter" : has IOMMU and mem controller in it)
by trying to DMA to an unmapped address. SBA is "fatal"
at that point and the next MMIO read causes the CPU to
recognize the fatal state of the SBA. Decoding the HPMC (see
below) can help determine that.
> IAOQ[1]: ioread32+0x10/0x4c
> RP(r2): tulip_get_stats+0x34/0x5c
> Backtrace:
> [<1030b9ec>] tulip_get_stats+0x34/0x5c
> [<1038ac20>] dev_get_stats+0x98/0xe8
> [<102946b4>] led_work_func+0x11c/0x310
> [<10145204>] process_one_work+0x120/0x3ac
> [<10147110>] worker_thread+0x174/0x338
> [<1014b0b4>] kthread+0x9c/0xa4
> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>
> Kernel panic - not syncing: High Priority Machine Check (HPMC)
> Backtrace:
> [<1010edec>] panic+0x90/0x23c
> [<101143b8>] parisc_terminate+0xbc/0xd4
> [<1011458c>] handle_interruption+0x1bc/0x718
> [<10103078>] intr_check_sig+0x0/0x34
> [<10284398>] ioread32+0x10/0x4c
> [<103e8fc0>] bictcp_acked+0x0/0x228
>
> I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted.
>
> Any hints?
Interrupt the boot process and collect the HPMC dump as described:
http://www.parisc-linux.org/faq/kernelbug-howto.html>
The output will include the offending address that the ioread32 was
trying to access to confirm the instruction was decoded correctly.
If anyone has access to the magic decoder ring, we might be able to tell more.
cheers,
grant
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests
2011-10-12 4:32 ` Grant Grundler
@ 2011-10-17 7:18 ` Rolf Eike Beer
2011-10-21 8:26 ` Rolf Eike Beer
0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-17 7:18 UTC (permalink / raw)
To: linux-parisc
> On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
>> I'm running the CMake tests every night. This is the second time in a
>> row
>> that my C3600 did not survive this. Since I was warned I connected a
>> serial console.
> ...
>
>> But then the machine got killed:
>>
>> Backtrace:
>> [<1030b9ec>] tulip_get_stats+0x34/0x5c
>> [<1038ac20>] dev_get_stats+0x98/0xe8
>> [<102946b4>] led_work_func+0x11c/0x310
>> [<10145204>] process_one_work+0x120/0x3ac
>> [<10147110>] worker_thread+0x174/0x338
>> [<1014b0b4>] kthread+0x9c/0xa4
>> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>>
>>
>> High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000)
>>
>> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>> PSW: 00000000000001001111111100001110 Not tainted
>> r00-03 0004ff0e 105bf000 1030b9ec 2fc72000
>> r04-07 0000000f 00000000 00000000 00000000
>> r08-11 2fc72000 105bf600 2fea4208 7f000000
>> r12-15 2fea4210 105ba000 10544000 2fc2f408
>> r16-19 1041d1dc f000017c f0000174 2fea4210
>> r20-23 0099f055 0099f050 1030b9b8 00000000
>> r24-27 2ff57008 2fea4210 0004a040 10544000
>> r28-31 0004a040 f68e066d 2fea4400 1038ac20
>> sr00-03 00000000 00000000 00000000 00000017
>> sr04-07 00000000 00000000 00000000 00000000
>>
>> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
>> IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040
>> CPU: 0 CR30: 2fea4000 CR31: ffffdffe
>> ORIG_R28: 00000000
>> IAOQ[0]: ioread32+0xc/0x4c
>
> Usually the HMPC means tulip tried to read something
> from MMIO space that didn't respond and this
> resulted in a "Master Abort" (PCI bus controller
> had to abort the transaction). On PCs that's not
> fatal but is on many RISC architectures.
>
> If you can decode the instruction pointer (ioread32+0x10) to figure out
> which register is used to dereference the MMIO address, it would
> be obvious what the offending address is - just to confirm the
> pointer isn't pointing off into the weeds. It will be one of the
> registers that contains a 0xfnnnnnnn address.
I will have a look.
> Interrupt the boot process and collect the HPMC dump as described:
> http://www.parisc-linux.org/faq/kernelbug-howto.html>
>
> The output will include the offending address that the ioread32 was
> trying to access to confirm the instruction was decoded correctly.
> If anyone has access to the magic decoder ring, we might be able to tell
> more.
----------------- Processor 0 HPMC Information ------------------
Timestamp =
Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23)
HPMC Chassis Codes = 2cbf0 2500b 2cbfb
General Registers 0 - 31
00-03 0000000000000000 00000000105bf000 000000001030bbd4
000000002fc26000
04-07 000000000000000f 0000000000000000 0000000000000000
0000000000000000
08-11 000000002fc26000 00000000105bf600 000000002fc50208
000000007f000000
12-15 000000002fc50210 00000000105ba000 0000000010544000
000000002fc2e628
16-19 000000001041d1dc 00000000f000017c 00000000f0000174
000000002fc50210
20-23 000000000209f184 000000000209f17f 000000001030bba0
0000000000000000
24-27 000000000000f424 000000002fc50210 000000000004a040
0000000010544000
28-31 000000000004a040 0000000000000000 000000002fc50400
000000001038ae40
Control Registers 0 - 31
00-03 0000000000000000 0000000000000000 0000000000000000
0000000000000000
04-07 0000000000000000 0000000000000000 0000000000000000
0000000000000000
08-11 000000000000006e 0000000000000000 00000000000000c0
000000000000003d
12-15 0000000000000000 0000000000000000 0000000000102000
00000000fe000000
16-19 000044dd642070fc 0000000000000000 0000000010284504
000000000f80109c
20-23 00000000a627ffd0 000000000204a040 000000ff0004fc0e
0000000080000000
24-27 0000000000594000 000000011df90000 00000000fffff5f7
00000000fffffdfe
28-31 00000000fffff7f4 00000000fffff7f6 000000002fc50000
00000000ffffdffe
Space Registers 0 - 7
00-03 00000000 00000000 00000000 00000037
04-07 00000000 00000000 00000000 00000000
IIA Space = 0x0000000000000000
IIA Offset = 0x0000000010284508
Check Type = 0x20000000
CPU State = 0x9e000004
Cache Check = 0x00000000
TLB Check = 0x00000000
Bus Check = 0x0030103b
Assists Check = 0x00000000
Assist State = 0x00000000
Path Info = 0x00000000
System Responder Address = 0x000000fff4008040
System Requestor Address = 0xfffffffffffa0000
Floating-Point Registers 0 - 31
00-03 0000001f00000000 0000000000000000 0000000000000000
0000000000000000
04-07 41bf636000000000 41bf636000000000 00000002625a0000
0000000000000000
08-11 0000000000000000 1059900010544330 0000000000000000
105fbd602fde70c8
12-15 ffffffffad401040 ffffddb6f5fc38f8 fffffffffdfc38d0
fffffffff5fc3ad0
16-19 ffffff8effffffff ffffffcff5fc3ad0 ffffffb3f1dc38c0
ffffffff21041800
20-23 ffffffffa5401040 fffffffff5fc38d0 0000000000000000
0000000100000000
24-27 0000000000000000 0000000000090a6e 0000000000000015
1029358c102c3a38
28-31 ffffffff0000313d 1055f1d010544000 0000000100000228
2fc302001011a234
'9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:
Check Summary = 0xcb81041008000000
Available Memory = 0x0000000020000000
CPU Diagnose Register 2 = 0x0301000000000004
CPU Status Register 0 = 0x2420c20000000000
CPU Status Register 1 = 0x8002000000000000
SADD LOG = 0x4b023fd9e8190951
Read Short LOG = 0xc1af00fff4008040
ERROR_STATUS = 0x0000000000100010
MEM_ADDR = 0x000001ff3fffffff
MEM_SYND = 0x0000000000000000
MEM_ADDR_CORR = 0x000001ff3fffffff
MEM_SYND_CORR = 0x0000000000000000
RUN_DATA_HIGH = 0xc1bff0fffed08040
RUN_DATA_LOW = 0xc1bff0fffed08040
RUN_CTRL = 0x0000021c00001418
RUN_ADDR = 0xc1bff0fffed08040
System Responder Path = 0x00ffffff0a000c00
HPMC PIM Analysis Information:
Timestamp =
Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23)
'9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:
A Data I/O Fetch Timeout occurred while CPU 0 was
requesting information from a device at the path 10/0/12/0 (built-in PCI
device).
Memory/IO Controller Error Analysis Information:
The Memory/IO Controller only observed the Broadcast Error. It did not log
any additional information about the HPMC.
----------------- Processor 0 LPMC Information ------------------
Check Type = 0x00000000
I/D Cache Parity Info = 0x00000000
Cache Check = 0x00000000
TLB Check = 0x00000000
Bus Check = 0x00000000
Assists Check = 0x00000000
Assist State = 0x00000000
Path Info = 0x00000000
System Responder Address = 0x0000000000000000
System Requestor Address = 0x0000000000000000
----------------- Processor 0 TOC Information -------------------
General Registers 0 - 31
00-03 0000000000000000 0000000000000000 0000000000000000
0000000000000000
04-07 0000000000000000 0000000000000000 0000000000000000
0000000000000000
08-11 0000000000000000 0000000000000000 0000000000000000
0000000000000000
12-15 0000000000000000 0000000000000000 0000000000000000
0000000000000000
16-19 0000000000000000 0000000000000000 0000000000000000
0000000000000000
20-23 0000000000000000 0000000000000000 0000000000000000
0000000000000000
24-27 0000000000000000 0000000000000000 0000000000000000
0000000000000000
28-31 0000000000000000 0000000000000000 0000000000000000
0000000000000000
Control Registers 0 - 31
00-03 0000000000000000 0000000000000000 0000000000000000
0000000000000000
04-07 0000000000000000 0000000000000000 0000000000000000
0000000000000000
08-11 0000000000000000 0000000000000000 0000000000000000
0000000000000000
12-15 0000000000000000 0000000000000000 0000000000000000
0000000000000000
16-19 0000000000000000 0000000000000000 0000000000000000
0000000000000000
20-23 0000000000000000 0000000000000000 0000000000000000
0000000000000000
24-27 0000000000000000 0000000000000000 0000000000000000
0000000000000000
28-31 0000000000000000 0000000000000000 0000000000000000
0000000000000000
Space Registers 0 - 7
00-03 00000000 00000000 00000000 00000000
04-07 00000000 00000000 00000000 00000000
IIA Space = 0x0000000000000000
IIA Offset = 0x0000000000000000
CPU State = 0x00000000
I/O Module Error Log Information:
Timestamp =
Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23)
'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
Rope Word1 Word2 Word3
------ ------------ ------------
0 0x00000000 0x0e0cc2a9 0x00000000fed30048
1 0x00000000 0x1e0cc009 0x00000000fed32048
2 ---------- 0x2e0cc009 ------------------
3 ---------- 0x3e0cc009 ------------------
4 0x00000000 0x4e0cc009 0x00000000fed38048
5 ---------- 0x5e0cc009 ------------------
6 0x00000000 0x6e0cc009 0x00000000fed3c048
7 ---------- 0x7e0cc009 ------------------
Eike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests
2011-10-17 7:18 ` Rolf Eike Beer
@ 2011-10-21 8:26 ` Rolf Eike Beer
2011-10-26 16:16 ` Grant Grundler
0 siblings, 1 reply; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-21 8:26 UTC (permalink / raw)
To: linux-parisc
>> On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote:
>>> I'm running the CMake tests every night. This is the second time in a
>>> row
>>> that my C3600 did not survive this. Since I was warned I connected a
>>> serial console.
>> ...
>>
>>> But then the machine got killed:
>>>
>>> Backtrace:
>>> [<1030b9ec>] tulip_get_stats+0x34/0x5c
>>> [<1038ac20>] dev_get_stats+0x98/0xe8
>>> [<102946b4>] led_work_func+0x11c/0x310
>>> [<10145204>] process_one_work+0x120/0x3ac
>>> [<10147110>] worker_thread+0x174/0x338
>>> [<1014b0b4>] kthread+0x9c/0xa4
>>> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24
>>>
>>>
>>> High Priority Machine Check (HPMC): Code=1 regs=10551080
>>> (Addr=00000000)
>>>
>>> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>>> PSW: 00000000000001001111111100001110 Not tainted
>>> r00-03 0004ff0e 105bf000 1030b9ec 2fc72000
>>> r04-07 0000000f 00000000 00000000 00000000
>>> r08-11 2fc72000 105bf600 2fea4208 7f000000
>>> r12-15 2fea4210 105ba000 10544000 2fc2f408
>>> r16-19 1041d1dc f000017c f0000174 2fea4210
>>> r20-23 0099f055 0099f050 1030b9b8 00000000
>>> r24-27 2ff57008 2fea4210 0004a040 10544000
>>> r28-31 0004a040 f68e066d 2fea4400 1038ac20
>>> sr00-03 00000000 00000000 00000000 00000017
>>> sr04-07 00000000 00000000 00000000 00000000
>>>
>>> IASQ: 00000000 00000000 IAOQ: 10284394 10284398
>>> IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040
>>> CPU: 0 CR30: 2fea4000 CR31: ffffdffe
>>> ORIG_R28: 00000000
>>> IAOQ[0]: ioread32+0xc/0x4c
>>
>> Usually the HMPC means tulip tried to read something
>> from MMIO space that didn't respond and this
>> resulted in a "Master Abort" (PCI bus controller
>> had to abort the transaction). On PCs that's not
>> fatal but is on many RISC architectures.
>>
>> If you can decode the instruction pointer (ioread32+0x10) to figure out
>> which register is used to dereference the MMIO address, it would
>> be obvious what the offending address is - just to confirm the
>> pointer isn't pointing off into the weeds. It will be one of the
>> registers that contains a 0xfnnnnnnn address.
>
> I will have a look.
>
>> Interrupt the boot process and collect the HPMC dump as described:
>> http://www.parisc-linux.org/faq/kernelbug-howto.html>
>>
>> The output will include the offending address that the ioread32 was
>> trying to access to confirm the instruction was decoded correctly.
>> If anyone has access to the magic decoder ring, we might be able to tell
>> more.
Ok, I have another one. I removed all those parts that did not show any
errors or where the register contents were all zeros.
Timestamp =
Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
HPMC Chassis Codes = 2cbf0 2500b 2cbfb
General Registers 0 - 31
00-03 0000000000000000 00000000105bf000 000000001030bbd4
000000002fe46000
04-07 000000000000000f 0000000000000000 0000000000000000
0000000000000008
08-11 000000002fe46000 00000000105bf600 000000002fec8208
000000007f000000
12-15 000000002fec8210 00000000105ba000 0000000010544000
000000002fc2f408
16-19 000000001041d1dc 00000000f000017c 00000000f0000174
000000002fec8210
20-23 000000000108ce00 000000000108cdf3 000000001030bba0
0000000000000000
24-27 000000000000f424 000000002fec8210 000000000004a040
0000000010544000
28-31 000000000004a040 0000000000000000 000000002fec8400
000000001038ae40
Control Registers 0 - 31
00-03 0000000000000000 0000000000000000 0000000000000000
0000000000000000
04-07 0000000000000000 0000000000000000 0000000000000000
0000000000000000
08-11 000000000000004e 0000000000000000 00000000000000c0
000000000000003d
12-15 0000000000000000 0000000000000000 0000000000102000
00000000fe000000
16-19 0000230bfe918584 0000000000000000 0000000010284504
000000000f80109c
20-23 00000000a627ffd0 000000000204a040 000000ff0006fc0e
0000000080000000
24-27 0000000000594000 000000011ec4a000 00000000ffffffff
00000000ffffffff
28-31 00000000ffffffff 00000000ffffffff 000000002fec8000
00000000ffffffff
Space Registers 0 - 7
00-03 00000000 00000000 00000000 00000027
04-07 00000000 00000000 00000000 00000000
IIA Space = 0x0000000000000000
IIA Offset = 0x0000000010284508
Check Type = 0x20000000
CPU State = 0x9e000004
Cache Check = 0x00000000
TLB Check = 0x00000000
Bus Check = 0x0030103b
Assists Check = 0x00000000
Assist State = 0x00000000
Path Info = 0x00000000
System Responder Address = 0x000000fff4008040
System Requestor Address = 0xfffffffffffa0000
Floating-Point Registers 0 - 31
00-03 0000001f00000000 0000000000000000 0000000000000000
0000000000000000
04-07 0000000a00000000 0000000000000000 0000000000000000
0000000049ba5e35
08-11 0000000000000000 1059900010544330 0000000000000000
105fbd602fe470c8
12-15 ffffffff00000000 0000000000000000 0000000000000000
0000000000000000
16-19 95380000ffffffff 8008000000000000 0010000000000000
9118000000000000
20-23 8108000000000000 8008000000000000 0000000000000000
0000000100000000
24-27 0000000000000000 0000000000090a6e 0000000000000015
0000000000000000
28-31 ffffffff0000313c 1055f1d010544000 0000000100000228
2fc302001011a234
'9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:
Check Summary = 0xcb81041008000000
Available Memory = 0x0000000020000000
CPU Diagnose Register 2 = 0x0301000000000004
CPU Status Register 0 = 0x2420c20000000000
CPU Status Register 1 = 0x8002000000000000
SADD LOG = 0x4b023fd9e8190951
Read Short LOG = 0xc1af00fff4008040
ERROR_STATUS = 0x0000000000100010
MEM_ADDR = 0x000001ff3fffffff
MEM_SYND = 0x0000000000000000
MEM_ADDR_CORR = 0x000001ff3fffffff
MEM_SYND_CORR = 0x0000000000000000
RUN_DATA_HIGH = 0xc1bff0fffed08040
RUN_DATA_LOW = 0xc1bff0fffed08040
RUN_CTRL = 0x0000021c00001418
RUN_ADDR = 0xc1bff0fffed08040
System Responder Path = 0x00ffffff0a000c00
HPMC PIM Analysis Information:
Timestamp =
Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
'9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:
A Data I/O Fetch Timeout occurred while CPU 0 was
requesting information from a device at the path 10/0/12/0 (built-in PCI
device).
I/O Module Error Log Information:
Timestamp =
Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
Rope Word1 Word2 Word3
------ ------------ ------------
0 0x00000000 0x0e0cc2a9 0x00000000fed30048
1 0x00000000 0x1e0cc009 0x00000000fed32048
2 ---------- 0x2e0cc009 ------------------
3 ---------- 0x3e0cc009 ------------------
4 0x00000000 0x4e0cc009 0x00000000fed38048
5 ---------- 0x5e0cc009 ------------------
6 0x00000000 0x6e0cc009 0x00000000fed3c048
7 ---------- 0x7e0cc009 ------------------
Eike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests
2011-10-21 8:26 ` Rolf Eike Beer
@ 2011-10-26 16:16 ` Grant Grundler
2011-10-26 17:54 ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer
0 siblings, 1 reply; 6+ messages in thread
From: Grant Grundler @ 2011-10-26 16:16 UTC (permalink / raw)
To: Rolf Eike Beer; +Cc: linux-parisc
On Fri, Oct 21, 2011 at 10:26:57AM +0200, Rolf Eike Beer wrote:
> Ok, I have another one. I removed all those parts that did not show any
> errors or where the register contents were all zeros.
>
> Timestamp =
> Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
...
> System Responder Address = 0x000000fff4008040
MMIO Address that wasn't responding. Note that it's 40 bits.
The 32-bit address used by OS is "F-extended" by HW (CPU I think).
> System Requestor Address = 0xfffffffffffa0000
Address of CPU that was requesting the MMIO address.
This is enough info to identify what I believe is the "victim".
It's not likely to be the root cause.
Historically, this type of HPMC happens because a device
attempted to DMA to an unmapped address and the IOMMU
went "fatal" (stopped routing traffic to PCI busses).
> '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes:
>
> Check Summary = 0xcb81041008000000
> Available Memory = 0x0000000020000000
> CPU Diagnose Register 2 = 0x0301000000000004
> CPU Status Register 0 = 0x2420c20000000000
> CPU Status Register 1 = 0x8002000000000000
> SADD LOG = 0x4b023fd9e8190951
> Read Short LOG = 0xc1af00fff4008040
> ERROR_STATUS = 0x0000000000100010
> MEM_ADDR = 0x000001ff3fffffff
> MEM_SYND = 0x0000000000000000
> MEM_ADDR_CORR = 0x000001ff3fffffff
> MEM_SYND_CORR = 0x0000000000000000
> RUN_DATA_HIGH = 0xc1bff0fffed08040
> RUN_DATA_LOW = 0xc1bff0fffed08040
> RUN_CTRL = 0x0000021c00001418
> RUN_ADDR = 0xc1bff0fffed08040
> System Responder Path = 0x00ffffff0a000c00
This part could yield another clue if we had the magic decoder ring. :(
> HPMC PIM Analysis Information:
>
> Timestamp =
> Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
>
>
> '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes:
>
> A Data I/O Fetch Timeout occurred while CPU 0 was
> requesting information from a device at the path 10/0/12/0 (built-in PCI
> device).
Doing "in io" at the BCH prompt should list all devices including 10/0/12/0
Google search is failing to find a posting with that content. :/
> '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
>
> Rope Word1 Word2 Word3
> ------ ------------ ------------
> 0 0x00000000 0x0e0cc2a9 0x00000000fed30048
> 1 0x00000000 0x1e0cc009 0x00000000fed32048
> 2 ---------- 0x2e0cc009 ------------------
> 3 ---------- 0x3e0cc009 ------------------
> 4 0x00000000 0x4e0cc009 0x00000000fed38048
> 5 ---------- 0x5e0cc009 ------------------
> 6 0x00000000 0x6e0cc009 0x00000000fed3c048
> 7 ---------- 0x7e0cc009 ------------------
"HP c3750 | hp workstation c3700 and c3650 - service handbook" in a
couple of different places says:
"I/O Error log word 3 contains the error address"
I'm assuming this is just the last accessed address by that PCI bus.
cheers,
grant
^ permalink raw reply [flat|nested] 6+ messages in thread
* HPMC on network load (was: HPMC running CMake Nightly tests)
2011-10-26 16:16 ` Grant Grundler
@ 2011-10-26 17:54 ` Rolf Eike Beer
0 siblings, 0 replies; 6+ messages in thread
From: Rolf Eike Beer @ 2011-10-26 17:54 UTC (permalink / raw)
To: linux-parisc
[-- Attachment #1: Type: text/plain, Size: 784 bytes --]
Grant Grundler write
> > HPMC PIM Analysis Information:
> >
> > Timestamp =
> >
> > Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52)
> >
> > '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304
> > bytes:
> >
> > A Data I/O Fetch Timeout occurred while CPU 0 was
> > requesting information from a device at the path 10/0/12/0 (built-in PCI
> > device).
>
> Doing "in io" at the BCH prompt should list all devices including 10/0/12/0
> Google search is failing to find a posting with that content. :/
IIRC it is the network card.
The last time I saw this was during "emerge --sync", which was hours away from
the nightly CMake run. Since all traces point at the network card I think this
really has nothing to do with CMake or CPU load at all.
Eike
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-10-26 17:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-27 7:32 HPMC running CMake Nightly tests Rolf Eike Beer
2011-10-12 4:32 ` Grant Grundler
2011-10-17 7:18 ` Rolf Eike Beer
2011-10-21 8:26 ` Rolf Eike Beer
2011-10-26 16:16 ` Grant Grundler
2011-10-26 17:54 ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).