* HPMC running CMake Nightly tests @ 2011-09-27 7:32 Rolf Eike Beer 2011-10-12 4:32 ` Grant Grundler 0 siblings, 1 reply; 6+ messages in thread From: Rolf Eike Beer @ 2011-09-27 7:32 UTC (permalink / raw) To: linux-parisc I'm running the CMake tests every night. This is the second time in a row that my C3600 did not survive this. Since I was warned I connected a serial console. The first things are expected crashes from CMake as detecting a crashed process is part of the tests. I wonder if these shouldn't be silenced as a userspace crash could otherwise too easily be used to flood the logs. do_page_fault() pid=16799 command='kwsysTestProces' type=15 address=0x00000000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00000000000001101111111100001111 Not tainted r00-03 0006ff0f 1041b000 00011ee3 fb46f4c0 r04-07 4072adc0 00000000 fb36b02c 0001cd60 r08-11 00000000 000e6e20 00000000 00000000 r12-15 00000000 000e59c0 000e2d24 ffffffff r16-19 000e2d14 000e3e20 00000000 4072adc0 r20-23 1054f020 00000000 406427e4 ffffffff r24-27 fffffff5 ffffffd3 4072cd24 0001b0e4 r28-31 00000000 00000001 fb46f500 4063192b sr00-03 00000030 00000017 00000000 00000030 sr04-07 00000030 00000030 00000030 00000030 VZOUICununcqcqcqcqcqcrmunTDVZOUI FPSR: 00000000000000000000000000000000 FPER1: 00000000 fr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 fr04-07 41d3a02318ce6e4c 2800000000000000 0000000190000000 402e000000000000 fr08-11 000000601be80000 1059900010544330 0000000000000000 105fbd602fc260c8 fr12-15 41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0 fr16-19 0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8 fr20-23 2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7418c00000 fr24-27 410b865000000000 3fe0000000000000 412e848000000000 1029341c102c3848 fr28-31 ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264 IASQ: 00000030 00000030 IAOQ: 00011ee7 00011eeb IIR: 0f801280 ISR: 00000030 IOR: 00000000 CPU: 0 CR30: 2ed5c000 CR31: ffffdffe ORIG_R28: 00000000 IAOQ[0]: 00011ee7 IAOQ[1]: 00011eeb RP(r2): 00011ee3 do_page_fault() pid=16827 command='kwsysTestProces' type=15 address=0x00000000 YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00000000000001101111111100001111 Not tainted r00-03 0006ff0f 1041b000 00011ee3 fb2a74c0 r04-07 4072adc0 00000000 fb61702c 0001cd60 r08-11 00000000 000e6e20 00000000 00000000 r12-15 00000000 000e59c0 000e2d24 ffffffff r16-19 000e2d14 000e3e20 00000000 4072adc0 r20-23 1054f020 00000000 406427e4 ffffffff r24-27 fffffff5 ffffffd3 4072cd24 0001b0e4 r28-31 00000000 00000001 fb2a7500 4063192b sr00-03 00000032 00000017 00000000 00000032 sr04-07 00000032 00000032 00000032 00000032 VZOUICununcqcqcqcqcqcrmunTDVZOUI FPSR: 00000000000000000000000000000000 FPER1: 00000000 fr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 fr04-07 41d3a02319c7bbbf 2800000000000000 0000000190000000 403e000000000000 fr08-11 0000005c3e100000 1059900010544330 0000000000000000 105fbd602fc260c8 fr12-15 41d3a01d2d9ab424 0000008110131d00 105b68001055eca8 1055eca810560cf0 fr16-19 0004000f1055f000 10131d0000000003 105b48c0105b48f4 000000371055f1e8 fr20-23 2fc30158105a4800 3b9aca001055f000 000000000000002f 00001c7419c00000 fr24-27 40fd802000000000 3fe0000000000000 412e848000000000 1029341c102c3848 fr28-31 ffffffff000032a4 1055f1d010544000 0000000100000228 2fc302001011a264 IASQ: 00000032 00000032 IAOQ: 00011ee7 00011eeb IIR: 0f801280 ISR: 00000032 IOR: 00000000 CPU: 0 CR30: 2095c000 CR31: ffffdffe ORIG_R28: 00000000 IAOQ[0]: 00011ee7 IAOQ[1]: 00011eeb RP(r2): 00011ee3 But then the machine got killed: Backtrace: [<1030b9ec>] tulip_get_stats+0x34/0x5c [<1038ac20>] dev_get_stats+0x98/0xe8 [<102946b4>] led_work_func+0x11c/0x310 [<10145204>] process_one_work+0x120/0x3ac [<10147110>] worker_thread+0x174/0x338 [<1014b0b4>] kthread+0x9c/0xa4 [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000) YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00000000000001001111111100001110 Not tainted r00-03 0004ff0e 105bf000 1030b9ec 2fc72000 r04-07 0000000f 00000000 00000000 00000000 r08-11 2fc72000 105bf600 2fea4208 7f000000 r12-15 2fea4210 105ba000 10544000 2fc2f408 r16-19 1041d1dc f000017c f0000174 2fea4210 r20-23 0099f055 0099f050 1030b9b8 00000000 r24-27 2ff57008 2fea4210 0004a040 10544000 r28-31 0004a040 f68e066d 2fea4400 1038ac20 sr00-03 00000000 00000000 00000000 00000017 sr04-07 00000000 00000000 00000000 00000000 IASQ: 00000000 00000000 IAOQ: 10284394 10284398 IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040 CPU: 0 CR30: 2fea4000 CR31: ffffdffe ORIG_R28: 00000000 IAOQ[0]: ioread32+0xc/0x4c IAOQ[1]: ioread32+0x10/0x4c RP(r2): tulip_get_stats+0x34/0x5c Backtrace: [<1030b9ec>] tulip_get_stats+0x34/0x5c [<1038ac20>] dev_get_stats+0x98/0xe8 [<102946b4>] led_work_func+0x11c/0x310 [<10145204>] process_one_work+0x120/0x3ac [<10147110>] worker_thread+0x174/0x338 [<1014b0b4>] kthread+0x9c/0xa4 [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 Kernel panic - not syncing: High Priority Machine Check (HPMC) Backtrace: [<1010edec>] panic+0x90/0x23c [<101143b8>] parisc_terminate+0xbc/0xd4 [<1011458c>] handle_interruption+0x1bc/0x718 [<10103078>] intr_check_sig+0x0/0x34 [<10284398>] ioread32+0x10/0x4c [<103e8fc0>] bictcp_acked+0x0/0x228 I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted. Any hints? Eike ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests 2011-09-27 7:32 HPMC running CMake Nightly tests Rolf Eike Beer @ 2011-10-12 4:32 ` Grant Grundler 2011-10-17 7:18 ` Rolf Eike Beer 0 siblings, 1 reply; 6+ messages in thread From: Grant Grundler @ 2011-10-12 4:32 UTC (permalink / raw) To: Rolf Eike Beer; +Cc: linux-parisc On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote: > I'm running the CMake tests every night. This is the second time in a row > that my C3600 did not survive this. Since I was warned I connected a > serial console. ... > But then the machine got killed: > > Backtrace: > [<1030b9ec>] tulip_get_stats+0x34/0x5c > [<1038ac20>] dev_get_stats+0x98/0xe8 > [<102946b4>] led_work_func+0x11c/0x310 > [<10145204>] process_one_work+0x120/0x3ac > [<10147110>] worker_thread+0x174/0x338 > [<1014b0b4>] kthread+0x9c/0xa4 > [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 > > > High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000) > > YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI > PSW: 00000000000001001111111100001110 Not tainted > r00-03 0004ff0e 105bf000 1030b9ec 2fc72000 > r04-07 0000000f 00000000 00000000 00000000 > r08-11 2fc72000 105bf600 2fea4208 7f000000 > r12-15 2fea4210 105ba000 10544000 2fc2f408 > r16-19 1041d1dc f000017c f0000174 2fea4210 > r20-23 0099f055 0099f050 1030b9b8 00000000 > r24-27 2ff57008 2fea4210 0004a040 10544000 > r28-31 0004a040 f68e066d 2fea4400 1038ac20 > sr00-03 00000000 00000000 00000000 00000017 > sr04-07 00000000 00000000 00000000 00000000 > > IASQ: 00000000 00000000 IAOQ: 10284394 10284398 > IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040 > CPU: 0 CR30: 2fea4000 CR31: ffffdffe > ORIG_R28: 00000000 > IAOQ[0]: ioread32+0xc/0x4c Usually the HMPC means tulip tried to read something from MMIO space that didn't respond and this resulted in a "Master Abort" (PCI bus controller had to abort the transaction). On PCs that's not fatal but is on many RISC architectures. If you can decode the instruction pointer (ioread32+0x10) to figure out which register is used to dereference the MMIO address, it would be obvious what the offending address is - just to confirm the pointer isn't pointing off into the weeds. It will be one of the registers that contains a 0xfnnnnnnn address. Also possible is something before already offended the SBA ("System Bus Adapter" : has IOMMU and mem controller in it) by trying to DMA to an unmapped address. SBA is "fatal" at that point and the next MMIO read causes the CPU to recognize the fatal state of the SBA. Decoding the HPMC (see below) can help determine that. > IAOQ[1]: ioread32+0x10/0x4c > RP(r2): tulip_get_stats+0x34/0x5c > Backtrace: > [<1030b9ec>] tulip_get_stats+0x34/0x5c > [<1038ac20>] dev_get_stats+0x98/0xe8 > [<102946b4>] led_work_func+0x11c/0x310 > [<10145204>] process_one_work+0x120/0x3ac > [<10147110>] worker_thread+0x174/0x338 > [<1014b0b4>] kthread+0x9c/0xa4 > [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 > > Kernel panic - not syncing: High Priority Machine Check (HPMC) > Backtrace: > [<1010edec>] panic+0x90/0x23c > [<101143b8>] parisc_terminate+0xbc/0xd4 > [<1011458c>] handle_interruption+0x1bc/0x718 > [<10103078>] intr_check_sig+0x0/0x34 > [<10284398>] ioread32+0x10/0x4c > [<103e8fc0>] bictcp_acked+0x0/0x228 > > I'm running 3.0.4 with d7dd2ff11b7fcd425aca5a875983c862d19a67ae reverted. > > Any hints? Interrupt the boot process and collect the HPMC dump as described: http://www.parisc-linux.org/faq/kernelbug-howto.html> The output will include the offending address that the ioread32 was trying to access to confirm the instruction was decoded correctly. If anyone has access to the magic decoder ring, we might be able to tell more. cheers, grant ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests 2011-10-12 4:32 ` Grant Grundler @ 2011-10-17 7:18 ` Rolf Eike Beer 2011-10-21 8:26 ` Rolf Eike Beer 0 siblings, 1 reply; 6+ messages in thread From: Rolf Eike Beer @ 2011-10-17 7:18 UTC (permalink / raw) To: linux-parisc > On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote: >> I'm running the CMake tests every night. This is the second time in a >> row >> that my C3600 did not survive this. Since I was warned I connected a >> serial console. > ... > >> But then the machine got killed: >> >> Backtrace: >> [<1030b9ec>] tulip_get_stats+0x34/0x5c >> [<1038ac20>] dev_get_stats+0x98/0xe8 >> [<102946b4>] led_work_func+0x11c/0x310 >> [<10145204>] process_one_work+0x120/0x3ac >> [<10147110>] worker_thread+0x174/0x338 >> [<1014b0b4>] kthread+0x9c/0xa4 >> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 >> >> >> High Priority Machine Check (HPMC): Code=1 regs=10551080 (Addr=00000000) >> >> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI >> PSW: 00000000000001001111111100001110 Not tainted >> r00-03 0004ff0e 105bf000 1030b9ec 2fc72000 >> r04-07 0000000f 00000000 00000000 00000000 >> r08-11 2fc72000 105bf600 2fea4208 7f000000 >> r12-15 2fea4210 105ba000 10544000 2fc2f408 >> r16-19 1041d1dc f000017c f0000174 2fea4210 >> r20-23 0099f055 0099f050 1030b9b8 00000000 >> r24-27 2ff57008 2fea4210 0004a040 10544000 >> r28-31 0004a040 f68e066d 2fea4400 1038ac20 >> sr00-03 00000000 00000000 00000000 00000017 >> sr04-07 00000000 00000000 00000000 00000000 >> >> IASQ: 00000000 00000000 IAOQ: 10284394 10284398 >> IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040 >> CPU: 0 CR30: 2fea4000 CR31: ffffdffe >> ORIG_R28: 00000000 >> IAOQ[0]: ioread32+0xc/0x4c > > Usually the HMPC means tulip tried to read something > from MMIO space that didn't respond and this > resulted in a "Master Abort" (PCI bus controller > had to abort the transaction). On PCs that's not > fatal but is on many RISC architectures. > > If you can decode the instruction pointer (ioread32+0x10) to figure out > which register is used to dereference the MMIO address, it would > be obvious what the offending address is - just to confirm the > pointer isn't pointing off into the weeds. It will be one of the > registers that contains a 0xfnnnnnnn address. I will have a look. > Interrupt the boot process and collect the HPMC dump as described: > http://www.parisc-linux.org/faq/kernelbug-howto.html> > > The output will include the offending address that the ioread32 was > trying to access to confirm the instruction was decoded correctly. > If anyone has access to the magic decoder ring, we might be able to tell > more. ----------------- Processor 0 HPMC Information ------------------ Timestamp = Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23) HPMC Chassis Codes = 2cbf0 2500b 2cbfb General Registers 0 - 31 00-03 0000000000000000 00000000105bf000 000000001030bbd4 000000002fc26000 04-07 000000000000000f 0000000000000000 0000000000000000 0000000000000000 08-11 000000002fc26000 00000000105bf600 000000002fc50208 000000007f000000 12-15 000000002fc50210 00000000105ba000 0000000010544000 000000002fc2e628 16-19 000000001041d1dc 00000000f000017c 00000000f0000174 000000002fc50210 20-23 000000000209f184 000000000209f17f 000000001030bba0 0000000000000000 24-27 000000000000f424 000000002fc50210 000000000004a040 0000000010544000 28-31 000000000004a040 0000000000000000 000000002fc50400 000000001038ae40 Control Registers 0 - 31 00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 08-11 000000000000006e 0000000000000000 00000000000000c0 000000000000003d 12-15 0000000000000000 0000000000000000 0000000000102000 00000000fe000000 16-19 000044dd642070fc 0000000000000000 0000000010284504 000000000f80109c 20-23 00000000a627ffd0 000000000204a040 000000ff0004fc0e 0000000080000000 24-27 0000000000594000 000000011df90000 00000000fffff5f7 00000000fffffdfe 28-31 00000000fffff7f4 00000000fffff7f6 000000002fc50000 00000000ffffdffe Space Registers 0 - 7 00-03 00000000 00000000 00000000 00000037 04-07 00000000 00000000 00000000 00000000 IIA Space = 0x0000000000000000 IIA Offset = 0x0000000010284508 Check Type = 0x20000000 CPU State = 0x9e000004 Cache Check = 0x00000000 TLB Check = 0x00000000 Bus Check = 0x0030103b Assists Check = 0x00000000 Assist State = 0x00000000 Path Info = 0x00000000 System Responder Address = 0x000000fff4008040 System Requestor Address = 0xfffffffffffa0000 Floating-Point Registers 0 - 31 00-03 0000001f00000000 0000000000000000 0000000000000000 0000000000000000 04-07 41bf636000000000 41bf636000000000 00000002625a0000 0000000000000000 08-11 0000000000000000 1059900010544330 0000000000000000 105fbd602fde70c8 12-15 ffffffffad401040 ffffddb6f5fc38f8 fffffffffdfc38d0 fffffffff5fc3ad0 16-19 ffffff8effffffff ffffffcff5fc3ad0 ffffffb3f1dc38c0 ffffffff21041800 20-23 ffffffffa5401040 fffffffff5fc38d0 0000000000000000 0000000100000000 24-27 0000000000000000 0000000000090a6e 0000000000000015 1029358c102c3a38 28-31 ffffffff0000313d 1055f1d010544000 0000000100000228 2fc302001011a234 '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes: Check Summary = 0xcb81041008000000 Available Memory = 0x0000000020000000 CPU Diagnose Register 2 = 0x0301000000000004 CPU Status Register 0 = 0x2420c20000000000 CPU Status Register 1 = 0x8002000000000000 SADD LOG = 0x4b023fd9e8190951 Read Short LOG = 0xc1af00fff4008040 ERROR_STATUS = 0x0000000000100010 MEM_ADDR = 0x000001ff3fffffff MEM_SYND = 0x0000000000000000 MEM_ADDR_CORR = 0x000001ff3fffffff MEM_SYND_CORR = 0x0000000000000000 RUN_DATA_HIGH = 0xc1bff0fffed08040 RUN_DATA_LOW = 0xc1bff0fffed08040 RUN_CTRL = 0x0000021c00001418 RUN_ADDR = 0xc1bff0fffed08040 System Responder Path = 0x00ffffff0a000c00 HPMC PIM Analysis Information: Timestamp = Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23) '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes: A Data I/O Fetch Timeout occurred while CPU 0 was requesting information from a device at the path 10/0/12/0 (built-in PCI device). Memory/IO Controller Error Analysis Information: The Memory/IO Controller only observed the Broadcast Error. It did not log any additional information about the HPMC. ----------------- Processor 0 LPMC Information ------------------ Check Type = 0x00000000 I/D Cache Parity Info = 0x00000000 Cache Check = 0x00000000 TLB Check = 0x00000000 Bus Check = 0x00000000 Assists Check = 0x00000000 Assist State = 0x00000000 Path Info = 0x00000000 System Responder Address = 0x0000000000000000 System Requestor Address = 0x0000000000000000 ----------------- Processor 0 TOC Information ------------------- General Registers 0 - 31 00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 08-11 0000000000000000 0000000000000000 0000000000000000 0000000000000000 12-15 0000000000000000 0000000000000000 0000000000000000 0000000000000000 16-19 0000000000000000 0000000000000000 0000000000000000 0000000000000000 20-23 0000000000000000 0000000000000000 0000000000000000 0000000000000000 24-27 0000000000000000 0000000000000000 0000000000000000 0000000000000000 28-31 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Control Registers 0 - 31 00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 08-11 0000000000000000 0000000000000000 0000000000000000 0000000000000000 12-15 0000000000000000 0000000000000000 0000000000000000 0000000000000000 16-19 0000000000000000 0000000000000000 0000000000000000 0000000000000000 20-23 0000000000000000 0000000000000000 0000000000000000 0000000000000000 24-27 0000000000000000 0000000000000000 0000000000000000 0000000000000000 28-31 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Space Registers 0 - 7 00-03 00000000 00000000 00000000 00000000 04-07 00000000 00000000 00000000 00000000 IIA Space = 0x0000000000000000 IIA Offset = 0x0000000000000000 CPU State = 0x00000000 I/O Module Error Log Information: Timestamp = Fri Oct 14 12:18:23 GMT 2011 (20:11:10:14:12:18:23) '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes: Rope Word1 Word2 Word3 ------ ------------ ------------ 0 0x00000000 0x0e0cc2a9 0x00000000fed30048 1 0x00000000 0x1e0cc009 0x00000000fed32048 2 ---------- 0x2e0cc009 ------------------ 3 ---------- 0x3e0cc009 ------------------ 4 0x00000000 0x4e0cc009 0x00000000fed38048 5 ---------- 0x5e0cc009 ------------------ 6 0x00000000 0x6e0cc009 0x00000000fed3c048 7 ---------- 0x7e0cc009 ------------------ Eike ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests 2011-10-17 7:18 ` Rolf Eike Beer @ 2011-10-21 8:26 ` Rolf Eike Beer 2011-10-26 16:16 ` Grant Grundler 0 siblings, 1 reply; 6+ messages in thread From: Rolf Eike Beer @ 2011-10-21 8:26 UTC (permalink / raw) To: linux-parisc >> On Tue, Sep 27, 2011 at 09:32:37AM +0200, Rolf Eike Beer wrote: >>> I'm running the CMake tests every night. This is the second time in a >>> row >>> that my C3600 did not survive this. Since I was warned I connected a >>> serial console. >> ... >> >>> But then the machine got killed: >>> >>> Backtrace: >>> [<1030b9ec>] tulip_get_stats+0x34/0x5c >>> [<1038ac20>] dev_get_stats+0x98/0xe8 >>> [<102946b4>] led_work_func+0x11c/0x310 >>> [<10145204>] process_one_work+0x120/0x3ac >>> [<10147110>] worker_thread+0x174/0x338 >>> [<1014b0b4>] kthread+0x9c/0xa4 >>> [<10102c5c>] ret_from_kernel_thread+0x1c/0x24 >>> >>> >>> High Priority Machine Check (HPMC): Code=1 regs=10551080 >>> (Addr=00000000) >>> >>> YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI >>> PSW: 00000000000001001111111100001110 Not tainted >>> r00-03 0004ff0e 105bf000 1030b9ec 2fc72000 >>> r04-07 0000000f 00000000 00000000 00000000 >>> r08-11 2fc72000 105bf600 2fea4208 7f000000 >>> r12-15 2fea4210 105ba000 10544000 2fc2f408 >>> r16-19 1041d1dc f000017c f0000174 2fea4210 >>> r20-23 0099f055 0099f050 1030b9b8 00000000 >>> r24-27 2ff57008 2fea4210 0004a040 10544000 >>> r28-31 0004a040 f68e066d 2fea4400 1038ac20 >>> sr00-03 00000000 00000000 00000000 00000017 >>> sr04-07 00000000 00000000 00000000 00000000 >>> >>> IASQ: 00000000 00000000 IAOQ: 10284394 10284398 >>> IIR: 0f80109c ISR: a627ffd0 IOR: 0204a040 >>> CPU: 0 CR30: 2fea4000 CR31: ffffdffe >>> ORIG_R28: 00000000 >>> IAOQ[0]: ioread32+0xc/0x4c >> >> Usually the HMPC means tulip tried to read something >> from MMIO space that didn't respond and this >> resulted in a "Master Abort" (PCI bus controller >> had to abort the transaction). On PCs that's not >> fatal but is on many RISC architectures. >> >> If you can decode the instruction pointer (ioread32+0x10) to figure out >> which register is used to dereference the MMIO address, it would >> be obvious what the offending address is - just to confirm the >> pointer isn't pointing off into the weeds. It will be one of the >> registers that contains a 0xfnnnnnnn address. > > I will have a look. > >> Interrupt the boot process and collect the HPMC dump as described: >> http://www.parisc-linux.org/faq/kernelbug-howto.html> >> >> The output will include the offending address that the ioread32 was >> trying to access to confirm the instruction was decoded correctly. >> If anyone has access to the magic decoder ring, we might be able to tell >> more. Ok, I have another one. I removed all those parts that did not show any errors or where the register contents were all zeros. Timestamp = Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) HPMC Chassis Codes = 2cbf0 2500b 2cbfb General Registers 0 - 31 00-03 0000000000000000 00000000105bf000 000000001030bbd4 000000002fe46000 04-07 000000000000000f 0000000000000000 0000000000000000 0000000000000008 08-11 000000002fe46000 00000000105bf600 000000002fec8208 000000007f000000 12-15 000000002fec8210 00000000105ba000 0000000010544000 000000002fc2f408 16-19 000000001041d1dc 00000000f000017c 00000000f0000174 000000002fec8210 20-23 000000000108ce00 000000000108cdf3 000000001030bba0 0000000000000000 24-27 000000000000f424 000000002fec8210 000000000004a040 0000000010544000 28-31 000000000004a040 0000000000000000 000000002fec8400 000000001038ae40 Control Registers 0 - 31 00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000 08-11 000000000000004e 0000000000000000 00000000000000c0 000000000000003d 12-15 0000000000000000 0000000000000000 0000000000102000 00000000fe000000 16-19 0000230bfe918584 0000000000000000 0000000010284504 000000000f80109c 20-23 00000000a627ffd0 000000000204a040 000000ff0006fc0e 0000000080000000 24-27 0000000000594000 000000011ec4a000 00000000ffffffff 00000000ffffffff 28-31 00000000ffffffff 00000000ffffffff 000000002fec8000 00000000ffffffff Space Registers 0 - 7 00-03 00000000 00000000 00000000 00000027 04-07 00000000 00000000 00000000 00000000 IIA Space = 0x0000000000000000 IIA Offset = 0x0000000010284508 Check Type = 0x20000000 CPU State = 0x9e000004 Cache Check = 0x00000000 TLB Check = 0x00000000 Bus Check = 0x0030103b Assists Check = 0x00000000 Assist State = 0x00000000 Path Info = 0x00000000 System Responder Address = 0x000000fff4008040 System Requestor Address = 0xfffffffffffa0000 Floating-Point Registers 0 - 31 00-03 0000001f00000000 0000000000000000 0000000000000000 0000000000000000 04-07 0000000a00000000 0000000000000000 0000000000000000 0000000049ba5e35 08-11 0000000000000000 1059900010544330 0000000000000000 105fbd602fe470c8 12-15 ffffffff00000000 0000000000000000 0000000000000000 0000000000000000 16-19 95380000ffffffff 8008000000000000 0010000000000000 9118000000000000 20-23 8108000000000000 8008000000000000 0000000000000000 0000000100000000 24-27 0000000000000000 0000000000090a6e 0000000000000015 0000000000000000 28-31 ffffffff0000313c 1055f1d010544000 0000000100000228 2fc302001011a234 '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes: Check Summary = 0xcb81041008000000 Available Memory = 0x0000000020000000 CPU Diagnose Register 2 = 0x0301000000000004 CPU Status Register 0 = 0x2420c20000000000 CPU Status Register 1 = 0x8002000000000000 SADD LOG = 0x4b023fd9e8190951 Read Short LOG = 0xc1af00fff4008040 ERROR_STATUS = 0x0000000000100010 MEM_ADDR = 0x000001ff3fffffff MEM_SYND = 0x0000000000000000 MEM_ADDR_CORR = 0x000001ff3fffffff MEM_SYND_CORR = 0x0000000000000000 RUN_DATA_HIGH = 0xc1bff0fffed08040 RUN_DATA_LOW = 0xc1bff0fffed08040 RUN_CTRL = 0x0000021c00001418 RUN_ADDR = 0xc1bff0fffed08040 System Responder Path = 0x00ffffff0a000c00 HPMC PIM Analysis Information: Timestamp = Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes: A Data I/O Fetch Timeout occurred while CPU 0 was requesting information from a device at the path 10/0/12/0 (built-in PCI device). I/O Module Error Log Information: Timestamp = Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes: Rope Word1 Word2 Word3 ------ ------------ ------------ 0 0x00000000 0x0e0cc2a9 0x00000000fed30048 1 0x00000000 0x1e0cc009 0x00000000fed32048 2 ---------- 0x2e0cc009 ------------------ 3 ---------- 0x3e0cc009 ------------------ 4 0x00000000 0x4e0cc009 0x00000000fed38048 5 ---------- 0x5e0cc009 ------------------ 6 0x00000000 0x6e0cc009 0x00000000fed3c048 7 ---------- 0x7e0cc009 ------------------ Eike ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: HPMC running CMake Nightly tests 2011-10-21 8:26 ` Rolf Eike Beer @ 2011-10-26 16:16 ` Grant Grundler 2011-10-26 17:54 ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer 0 siblings, 1 reply; 6+ messages in thread From: Grant Grundler @ 2011-10-26 16:16 UTC (permalink / raw) To: Rolf Eike Beer; +Cc: linux-parisc On Fri, Oct 21, 2011 at 10:26:57AM +0200, Rolf Eike Beer wrote: > Ok, I have another one. I removed all those parts that did not show any > errors or where the register contents were all zeros. > > Timestamp = > Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) ... > System Responder Address = 0x000000fff4008040 MMIO Address that wasn't responding. Note that it's 40 bits. The 32-bit address used by OS is "F-extended" by HW (CPU I think). > System Requestor Address = 0xfffffffffffa0000 Address of CPU that was requesting the MMIO address. This is enough info to identify what I believe is the "victim". It's not likely to be the root cause. Historically, this type of HPMC happens because a device attempted to DMA to an unmapped address and the IOMMU went "fatal" (stopped routing traffic to PCI busses). > '9000/785 B,C,J Workstation Unarchitected (per-CPU)', rev 1, 140 bytes: > > Check Summary = 0xcb81041008000000 > Available Memory = 0x0000000020000000 > CPU Diagnose Register 2 = 0x0301000000000004 > CPU Status Register 0 = 0x2420c20000000000 > CPU Status Register 1 = 0x8002000000000000 > SADD LOG = 0x4b023fd9e8190951 > Read Short LOG = 0xc1af00fff4008040 > ERROR_STATUS = 0x0000000000100010 > MEM_ADDR = 0x000001ff3fffffff > MEM_SYND = 0x0000000000000000 > MEM_ADDR_CORR = 0x000001ff3fffffff > MEM_SYND_CORR = 0x0000000000000000 > RUN_DATA_HIGH = 0xc1bff0fffed08040 > RUN_DATA_LOW = 0xc1bff0fffed08040 > RUN_CTRL = 0x0000021c00001418 > RUN_ADDR = 0xc1bff0fffed08040 > System Responder Path = 0x00ffffff0a000c00 This part could yield another clue if we had the magic decoder ring. :( > HPMC PIM Analysis Information: > > Timestamp = > Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) > > > '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 bytes: > > A Data I/O Fetch Timeout occurred while CPU 0 was > requesting information from a device at the path 10/0/12/0 (built-in PCI > device). Doing "in io" at the BCH prompt should list all devices including 10/0/12/0 Google search is failing to find a posting with that content. :/ > '9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes: > > Rope Word1 Word2 Word3 > ------ ------------ ------------ > 0 0x00000000 0x0e0cc2a9 0x00000000fed30048 > 1 0x00000000 0x1e0cc009 0x00000000fed32048 > 2 ---------- 0x2e0cc009 ------------------ > 3 ---------- 0x3e0cc009 ------------------ > 4 0x00000000 0x4e0cc009 0x00000000fed38048 > 5 ---------- 0x5e0cc009 ------------------ > 6 0x00000000 0x6e0cc009 0x00000000fed3c048 > 7 ---------- 0x7e0cc009 ------------------ "HP c3750 | hp workstation c3700 and c3650 - service handbook" in a couple of different places says: "I/O Error log word 3 contains the error address" I'm assuming this is just the last accessed address by that PCI bus. cheers, grant ^ permalink raw reply [flat|nested] 6+ messages in thread
* HPMC on network load (was: HPMC running CMake Nightly tests) 2011-10-26 16:16 ` Grant Grundler @ 2011-10-26 17:54 ` Rolf Eike Beer 0 siblings, 0 replies; 6+ messages in thread From: Rolf Eike Beer @ 2011-10-26 17:54 UTC (permalink / raw) To: linux-parisc [-- Attachment #1: Type: text/plain, Size: 784 bytes --] Grant Grundler write > > HPMC PIM Analysis Information: > > > > Timestamp = > > > > Thu Oct 20 09:05:52 GMT 2011 (20:11:10:20:09:05:52) > > > > '9000/785 B,C,J Workstation HPMC PIM Analysis (per-CPU)', rev 0, 1304 > > bytes: > > > > A Data I/O Fetch Timeout occurred while CPU 0 was > > requesting information from a device at the path 10/0/12/0 (built-in PCI > > device). > > Doing "in io" at the BCH prompt should list all devices including 10/0/12/0 > Google search is failing to find a posting with that content. :/ IIRC it is the network card. The last time I saw this was during "emerge --sync", which was hours away from the nightly CMake run. Since all traces point at the network card I think this really has nothing to do with CMake or CPU load at all. Eike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-10-26 17:54 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-09-27 7:32 HPMC running CMake Nightly tests Rolf Eike Beer 2011-10-12 4:32 ` Grant Grundler 2011-10-17 7:18 ` Rolf Eike Beer 2011-10-21 8:26 ` Rolf Eike Beer 2011-10-26 16:16 ` Grant Grundler 2011-10-26 17:54 ` HPMC on network load (was: HPMC running CMake Nightly tests) Rolf Eike Beer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).