Linux PARISC architecture development
 help / color / mirror / Atom feed
* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-26 15:46 Joel Soete
  2003-09-26 16:08 ` Joel Soete
  2003-09-26 16:50 ` Grant Grundler
  0 siblings, 2 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-26 15:46 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

>Yes - 6 is ITLB miss and 15 is Data TLB miss.
...
>
>> handle_interruption(26, ...).
>
>26 is "Data Memory Access rights Trap".
>This sounds normal for Copy-On-Write.

Yes to be sure I just finished to logon a b2k with same kernel (excepted
pdc support but I already verify it doesn't make any difference in the crash
in smp on the N) and effectively it is normal to read many 6, 15 and 26
interruptions.

>> SMP CALL FUNCTION TIMED OUT (CPU=1)
>
>The IPI handler will time out if the other CPU doesn't ack
>the function call with in a second. This is bad.

OTC This is the better messages I never get to start an analyse of this crash
:))

>It means either other CPU never got the interrupt (locked up
>with I-bit off) or the "unstarted_count" isn't coherent between the CPUs.

hmm how could I verify this hypothesis?

>>
>> Could this be a pb with sync between cpu time ref?
>> (because timeout = jiffies + HZ)
>
>I don't think so since jiffies is a global.
>And it's always be measured on the same CPU.
Ok
>
>> I have also a look for where this function is called but never see its
return
>> code tested to launch a 'stack dump' and a stop of system?
>
>You need to find out who is using smp_call_function() and which function
>they are trying to invoke. I suspect it's coming from mm/slab.c but
>would know which of the three it might be.

Effectively I don't find another place where it is called. And so add a
printk in each function calling smp_call_function_all_cpus() finaly.

That is allowing me to notice severall call to kmem_tune_cpucache() (7 exactly)
(and not other) but don't get any more 'SMP CALL FUNCTION TIMED OUT (CPU=1)'
:(
(i presume that, as previously, the system crash before having the opportunity
to flush its buffer?)

What do you think?

Thanks a lot for help,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-10-01  6:48 Joel Soete
  2003-10-01 17:20 ` Joel Soete
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Soete @ 2003-10-01  6:48 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

>>
>> In summary:
>> -------  Processor 1 HPMC Information - PDC Version: 41.28  ------
>
>Did you TOC the machine or did it HPMC?
>I was under the impression the SW had hung and one needed to TOC
>to regain control. TOC info is seperate from HPM
 info.

Exact, but TOC info only contains 0 so I suposed that system do actualy
a HPMC but do not seems to be managed by handle_interruption() as at its
begining I put a printk() which was suposed to write the 'code' value?

to be more accurate:

[...]
    struct siginfo si;

    printk(KERN_ERR "%s(%d, ...).\n", __FUNCTION__, code);
    mdelay(100);
[...]

which allowing me to read a lot of 6, 15, 26 codes but never 1?

>
>If it's in fact HPMC, then look at IOAQ/GR02 for both CPUs
>and see which functions they were executing in when HPMC occurred.

which were for cpu[1]:
GR[02] == rp = 000000001014dbf0

Func: zap_page_range, Off: 0xe0, Addr: 0x1014dbf0

    1014dbf0:  08 0e 02 5b   copy r14,dp
    1014dbf4:  03 c0 08 b4   mfctl tr6,r20
    1014dbf8:  4a 93 00 b0   ldw 58(r20),r19
    1014dbfc:  29 c5 20 00   addil b000,r14,%r1

[...]
Parse IAOQ = 0x000000001014dea0 for CPU[1]

Func: zap_page_range, Off: 0x390, Addr: 0x1014dea0

    1014dea0:  06 a0 52 00   pdtlb
r0(sr1,r21)
    1014dea4:  37 39 3f ff   ldo -1(r25),r25
    1014dea8:  bf 33 3f e5   cmpb,*<> r19,r25,1014dea0 <zap_page_range+0x390>
    1014deac:  36 b5 20 00   ldo 1000(r21),r21

And for cpu[3]:
GR[02] == rp = 000000001010cdd0

Func: handle
interruption, Off: 0xb0, Addr: 0x1010cdd0

    1010cdd0:  08 05 02 5b   copy r5,dp
    1010cdd4:  02 00 08 b4   mfctl itmr,r20
    1010cdd8:  02 00 08 b3   mfctl itmr,r19
    1010cddc:  0a 93 04 33   sub r19,r20,r19
  ...

Parse IAOQ = 0x000000
01010cde4 for CPU[3]

Func: handle_interruption, Off: 0xc4, Addr: 0x1010cde4

    1010cde0:  be 7c bf e5   cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>
    1010cde4:  08 00 02 40   nop
    1010cde8:  34 63 3f ff   ldo -1(r3),r3
    1010
dec:  ec 7f bf c5   cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>


Am i wrong if I presume that the nop isn would be harmless on cpu[3] OTC
'pdtlb r0(sr1,r21)' ?  But I do not read any code 10 printout by printk()
anyway it is the only exception: Privileged operation trap.

Thanks again,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-30 16:31 Joel Soete
  2003-09-30 18:50 ` Grant Grundler
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Soete @ 2003-09-30 16:31 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

Hi Grant,

Here is the very last test I did yesterday with the additional mdelay(100):

>TOC the machine, "ser pim" and look at PSW in TOC Info for each CPU.
>bit 0 is the I-Bit IIRC.

In summary:
-------  Processor 1 HPMC Information - PDC Ver
ion: 41.28  ------
[...]
CPU State                    = 0x9e000004
[...]
CPU Diagnose Register 2      = 0x0301010800802004
CPU Status Register 0        = 0x2640c24000000000
CPU Status Register 1        = 0x8000200000000000
[...]
-------  Proces
or 3 HPMC Information - PDC Version: 41.28  ------
[...]
CPU State                    = 0x9e000004
[...]
CPU Diagnose Register 2      = 0x0301030800802004
CPU Status Register 0        = 0x3640c24000000000
CPU Status Register 1        = 0x80000000
0000000
[...]

all I bits (well the lowest weight PSW bit :) ) are well 0


>Could be.
>Add mdelay(100) (or higher) after the lines of output you've added.
>The works if it's a functional problem that's not timing dependent.

Well after a ver
 long time of boot the system finaly crash without any
reason of panic??? (all interruption should be manage by handle_interruption?)

Just in case here is a short Pim-analyse:
-------  Processor 1 HPMC Information - PDC Version: 41.28  ------ 

GR of CPU[1]
00-03  0000000000000000  000000001041b018  000000001014dbf0  0000000000000000
04-07  0000000000008000  000000008d113c00  0000000040200000  0000000000008000
08-11  0000000000000000  000000008d2cd008  0000000080000000  00000000103fa2c8
12-15  0000000040180000  000000008d9a6280  00000000105389c0  0000000000000000
16-19  000000001045cf88  00000000103b6338  000000008d147010  ffffffffffffffff
20-23  00000000000001ff  0000000040178000  000000008d9a6280  0000000000088000
24-27  0000000040180000  0000000000000006  0000000040180000  00000000105389c0
28-31  0000000000000000  000000008d7ccef0  000000008d7ccf40  0000000000008000

GR[02] == rp = 000000001014dbf0

Func: zap_page_range, Off: 0xe0, Addr: 0x1014dbf0

    1014dbf0:	08 0e 02 5b 	copy r14,dp
    1014dbf4:	03 c0 08 b4 	mfctl tr6,r20
    1014dbf8:	4a 93 00 b0 	ldw 58(r20),r19
    1014dbfc:	29 c5 20 00 	addil b000,r14,%r1

GR[22] == t1(32bits) == arg4(64bits) = 000000008d9a6280

GR[21] == t2(32bits) == arg5(64bits) = 0000000040178000

GR[20] == t3(32bits) == arg6(64bits) = 00000000000001ff

GR[19] == t4(32bits) == arg7(64bits) = ffffffffffffffff

GR[26] == arg0 = 0000000040180000

GR[25] == arg1 = 0000000000000006

GR[24] == arg2 = 0000000040180000

GR[23] == arg3 = 0000000000088000

GR[27] == dp = 00000000105389c0

Func: __gp, Off: 0x0, Addr: 0x105389c0


GR[28] == ret0 = 0000000000000000

GR[29] == ret1 or sl = 000000008d7ccef0

GR[30] == sp = 000000008d7ccf40

GR[31] == ble rp = 0000000000008000
	Not parsable address!

CR of CPU[1]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11  00000000000002b2  0000000000000000  00000000000000c0  0000000000000003
12-15  0000000000000000  0000000000000000  0000000000107000  ffe0000000000000
16-19  000003182e3e3f89  0000000000000000  000000001014deac  0000000036b52000
20-23  00000000103401f5  00000000f33ccdd8  000000ff080ef70f  8000000000000000
24-27  0000000000461000  000000007d147000  0000000000041020  000000ffff95c810
28-31  5555555555555555  5555555555555555  000000008d7cc000  00000000105a0000

CR[00] == rctr = 0000000000000000

CR[08] == (Protection ID) pidr1 = 00000000000002b2

CR[10] == ccr = 00000000000000c0

CR[11] == sar = 0000000000000003

CR[14] == iva = 0000000000107000

CR[15] == eiem = ffe0000000000000

CR[16] == itmr = 000003182e3e3f89

CR[17] == pcsq = 0000000000000000

CR[18] == pcoq = 000000001014deac

CR[19] == iir = 0000000036b52000

CR[20] == isr = 00000000103401f5

CR[21] == ior = 00000000f33ccdd8

CR[22] == ipsw = 000000ff080ef70f

CR[23] == eirw = 8000000000000000

CR[24] == tr0 (ptov) = 0000000000461000

CR[25] == tr1 (vtop) = 000000007d147000

CR[26] == tr2 = 0000000000041020

CR[27] == tr3 = 000000ffff95c810

CR[28] == tr4 = 5555555555555555

CR[29] == tr5 = 5555555555555555

CR[30] == tr6 = 000000008d7cc000

CR[31] == tr7 = 00000000105a0000

SR of CPU[1]
00-03  0000ac80          0000ac80          00000000          0000ac80
04-07  00000000          00000000          00000000          00000000
Need much more work !!!

SR[00] == ts0 = 0000ac80

SR[01] == ts1 = 0000ac80

SR[03] == cpp = 0000ac80
	Not parsable address!
...
IIA Offset (back entry)      = 0x000000001014dea0
...

e.g. IAOQ = 0x000000001014dea0

FPR of CPU[1]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  000000008f760ec0  0000000000000002  000000001359d740  0000000000000420
08-11  0000000000000000  0000000000000802  00000000105389c0  000000001059a000
12-15  0000000013590000  0000000000000000  0000000010180574  00000000103dc6b8
16-19  00000000000009ee  000000008fa7e000  00000000105389c0  0000000013590000
20-23  00000000103b7b0c  fffffffffffffff4  000000000000021e  0000002f66666667
24-27  000007b100000000  0000999903590b70  0000000003590b78  000000001041b980
28-31  000000001041b980  00000000ff915e20  0000000010187b38  0000000000000004

Parse IAOQ = 0x000000001014dea0 for CPU[1]

Func: zap_page_range, Off: 0x390, Addr: 0x1014dea0

    1014dea0:	06 a0 52 00 	pdtlb r0(sr1,r21)
    1014dea4:	37 39 3f ff 	ldo -1(r25),r25
    1014dea8:	bf 33 3f e5 	cmpb,*<> r19,r25,1014dea0 <zap_page_range+0x390>
    1014deac:	36 b5 20 00 	ldo 1000(r21),r21
-------  Processor 3 HPMC Information - PDC Version: 41.28  ------ 

GR of CPU[3]
00-03  0000000000000000  0000000010429028  000000001010cdd0  0000000000000021
04-07  000000008d0c05b8  00000000105389c0  000000000000000f  0000000000000000
08-11  0000000000000000  0000000040026ee2  0000000040039141  0000000040026fb4
12-15  0000000040028380  00000000faf00950  00000000400342f4  0000000000000000
16-19  000000008d0c05b8  00000000faf00910  00000000faf00910  0000000000058706
20-23  000003182e080065  0000000000000000  0000000000000000  0000000000000000
24-27  0000000000000000  0000000000000000  00000000000003e8  00000000105389c0
28-31  0000000000086470  0000000000086470  000000008d0c0b40  0000000000000226

GR[02] == rp = 000000001010cdd0

Func: handle_interruption, Off: 0xb0, Addr: 0x1010cdd0

    1010cdd0:	08 05 02 5b 	copy r5,dp
    1010cdd4:	02 00 08 b4 	mfctl itmr,r20
    1010cdd8:	02 00 08 b3 	mfctl itmr,r19
    1010cddc:	0a 93 04 33 	sub r19,r20,r19
	...
    1010cde0:	be 7c bf e5 	cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>

	...
	...
    1010cdec:	ec 7f bf c5 	cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>

	...

GR[22] == t1(32bits) == arg4(64bits) = 0000000000000000

GR[21] == t2(32bits) == arg5(64bits) = 0000000000000000

GR[20] == t3(32bits) == arg6(64bits) = 000003182e080065

GR[19] == t4(32bits) == arg7(64bits) = 0000000000058706

GR[26] == arg0 = 00000000000003e8

GR[25] == arg1 = 0000000000000000

GR[24] == arg2 = 0000000000000000

GR[23] == arg3 = 0000000000000000

GR[27] == dp = 00000000105389c0

Func: __gp, Off: 0x0, Addr: 0x105389c0


GR[28] == ret0 = 0000000000086470

GR[29] == ret1 or sl = 0000000000086470

GR[30] == sp = 000000008d0c0b40

GR[31] == ble rp = 0000000000000226
	Not parsable address!

CR of CPU[3]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11  00000000000002b8  0000000000000000  00000000000000c0  000000000000003f
12-15  0000000000000000  0000000000000000  0000000000107000  ffe0000000000000
16-19  000003182e158ca8  0000000000000000  000000001010cde0  00000000be7cbfe5
20-23  00000000103401f4  00000000300c0b50  000000ff0804ff0e  8000000000000000
24-27  0000000000461000  000000007d0c4000  0000000000041020  000000ffff95c810
28-31  000000ffff95c810  5555555555555555  000000008d0c0000  0000000000008020

CR[00] == rctr = 0000000000000000

CR[08] == (Protection ID) pidr1 = 00000000000002b8

CR[10] == ccr = 00000000000000c0

CR[11] == sar = 000000000000003f

CR[14] == iva = 0000000000107000

CR[15] == eiem = ffe0000000000000

CR[16] == itmr = 000003182e158ca8

CR[17] == pcsq = 0000000000000000

CR[18] == pcoq = 000000001010cde0

CR[19] == iir = 00000000be7cbfe5

CR[20] == isr = 00000000103401f4

CR[21] == ior = 00000000300c0b50

CR[22] == ipsw = 000000ff0804ff0e

CR[23] == eirw = 8000000000000000

CR[24] == tr0 (ptov) = 0000000000461000

CR[25] == tr1 (vtop) = 000000007d0c4000

CR[26] == tr2 = 0000000000041020

CR[27] == tr3 = 000000ffff95c810

CR[28] == tr4 = 000000ffff95c810

CR[29] == tr5 = 5555555555555555

CR[30] == tr6 = 000000008d0c0000

CR[31] == tr7 = 0000000000008020

SR of CPU[3]
00-03  0000ae00          00006e00          00000000          0000ae00
04-07  00000000          00000000          00000000          00000000
Need much more work !!!

SR[00] == ts0 = 0000ae00

SR[01] == ts1 = 00006e00

SR[03] == cpp = 0000ae00
	Not parsable address!
...
IIA Offset (back entry)      = 0x000000001010cde4
...

e.g. IAOQ = 0x000000001010cde4

FPR of CPU[3]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  000000008f760ec0  0000000000000002  000000001359d740  0000000000000420
08-11  0000000000000000  0000000000000802  00000000105389c0  000000001059a000
12-15  0000000013590000  0000000000000000  0000000010180574  00000000103dc6b8
16-19  00000000000009ee  000000008fa7e000  00000000105389c0  0000000013590000
20-23  00000000103b7b0c  fffffffffffffff4  0000000000000000  0000000000000000
24-27  0000999900000000  0000999903590b70  0000000003590b78  000000001041b980
28-31  000000001041b980  00000000ff915e20  0000000010187b38  0000000000000000

Parse IAOQ = 0x000000001010cde4 for CPU[3]

Func: handle_interruption, Off: 0xc4, Addr: 0x1010cde4

    1010cde0:	be 7c bf e5 	cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>
    1010cde4:	08 00 02 40 	nop
    1010cde8:	34 63 3f ff 	ldo -1(r3),r3
    1010cdec:	ec 7f bf c5 	cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>

Any idea?

>Otherwise setup kernel crash dump and use tools from bruno/phi to view
>contents of the kernel message buffer.

Well, that seems to be the ultimate solution (I don't remember if it also
works on smp kernel?) but I will need to discuss a bit with them to see if
I reach to get a dump how could it be analysed?

Thanks again for your attention,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-25 14:56 Joel Soete
  2003-09-25 15:41 ` Derek Engelhaupt
  2003-09-25 23:35 ` Grant Grundler
  0 siblings, 2 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-25 14:56 UTC (permalink / raw)
  To: parisc-linux

Hi all,

Trying to continue investigation, I puted a printk at the begining of handle_interruption()
to get just the interruption's 'code' managed.

As already mentionned in previous mail that I could read many 6, 15 (but
it seems to be normal: e
en in UP kernel those interruption occurs) but
(most interesting) it is the very first time that I got the message making
failled the kernel:
[...]
handle_interruption(26, ...).
SMP CALL FUNCTION TIMED OUT (CPU=1)
handle_interruption(26, ...).



    Stack dump:

[...]

(unfortunately I couldn't grab this dump :( )

Could this be a pb with sync between cpu time ref? (because timeout = jiffies
+ HZ)

I have also a look for where this function is called but never see its return
code tested to launch a 'stack dump' and a stop of system?

Thanks in advance for help,
    Joel

PS: I don't know if it is important but the two cpus on this server are located
in slot 1 and 3 (not in slot 1 and 2 as we would logicaly expect) 



-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-10-01 17:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-26 15:46 [parisc-linux] N Class SMP pb ? (follow up) Joel Soete
2003-09-26 16:08 ` Joel Soete
2003-09-26 16:50 ` Grant Grundler
2003-09-27 18:16   ` Joel Soete
  -- strict thread matches above, loose matches on Subject: below --
2003-10-01  6:48 Joel Soete
2003-10-01 17:20 ` Joel Soete
2003-09-30 16:31 Joel Soete
2003-09-30 18:50 ` Grant Grundler
2003-09-25 14:56 Joel Soete
2003-09-25 15:41 ` Derek Engelhaupt
2003-09-25 23:35 ` Grant Grundler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox