Linux PARISC architecture development
 help / color / mirror / Atom feed
* [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-25 14:56 Joel Soete
  2003-09-25 15:41 ` Derek Engelhaupt
  2003-09-25 23:35 ` Grant Grundler
  0 siblings, 2 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-25 14:56 UTC (permalink / raw)
  To: parisc-linux

Hi all,

Trying to continue investigation, I puted a printk at the begining of handle_interruption()
to get just the interruption's 'code' managed.

As already mentionned in previous mail that I could read many 6, 15 (but
it seems to be normal: e
en in UP kernel those interruption occurs) but
(most interesting) it is the very first time that I got the message making
failled the kernel:
[...]
handle_interruption(26, ...).
SMP CALL FUNCTION TIMED OUT (CPU=1)
handle_interruption(26, ...).



    Stack dump:

[...]

(unfortunately I couldn't grab this dump :( )

Could this be a pb with sync between cpu time ref? (because timeout = jiffies
+ HZ)

I have also a look for where this function is called but never see its return
code tested to launch a 'stack dump' and a stop of system?

Thanks in advance for help,
    Joel

PS: I don't know if it is important but the two cpus on this server are located
in slot 1 and 3 (not in slot 1 and 2 as we would logicaly expect) 



-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-25 14:56 Joel Soete
@ 2003-09-25 15:41 ` Derek Engelhaupt
  2003-09-25 23:35 ` Grant Grundler
  1 sibling, 0 replies; 11+ messages in thread
From: Derek Engelhaupt @ 2003-09-25 15:41 UTC (permalink / raw)
  To: parisc-linux

[-- Attachment #1: Type: text/plain, Size: 2190 bytes --]

They are in the right slots...N Class CPU loading in order: 1,3,5,7,0,2,4,6.  If you are looking at the back of the machine with the rear cover open, the two cpus should be in the left two slot.  First memory carrier should be in the right most slot and loaded toward the left.  I should know since I just had to tear apart an entire N to upgrade it from 6 550Mhz cpus to 8 750Mhz cpus.  Takes about 3 hours and it requires a system board change.  The N has 3 system boards: an A, a B, and a C rev.  "A" is for 360-440.  "B" is for 360-550.  And the "C" is for the 650-750, but I'm sure it would accept all the processors slower than 650 too with the right speed setting on the dip switches.
 
derek


Joel Soete <soete.joel@tiscali.be> wrote:
Hi all,

Trying to continue investigation, I puted a printk at the begining of handle_interruption()
to get just the interruption's 'code' managed.

As already mentionned in previous mail that I could read many 6, 15 (but
it seems to be normal: e
en in UP kernel those interruption occurs) but
(most interesting) it is the very first time that I got the message making
failled the kernel:
[...]
handle_interruption(26, ...).
SMP CALL FUNCTION TIMED OUT (CPU=1)
handle_interruption(26, ...).



Stack dump:

[...]

(unfortunately I couldn't grab this dump :( )

Could this be a pb with sync between cpu time ref? (because timeout = jiffies
+ HZ)

I have also a look for where this function is called but never see its return
code tested to launch a 'stack dump' and a stop of system?

Thanks in advance for help,
Joel

PS: I don't know if it is important but the two cpus on this server are located
in slot 1 and 3 (not in slot 1 and 2 as we would logicaly expect) 



-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux


---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

[-- Attachment #2: Type: text/html, Size: 2649 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-25 14:56 Joel Soete
  2003-09-25 15:41 ` Derek Engelhaupt
@ 2003-09-25 23:35 ` Grant Grundler
  1 sibling, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-09-25 23:35 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Thu, Sep 25, 2003 at 04:56:26PM +0200, Joel Soete wrote:
...
> As already mentionned in previous mail that I could read many 6, 15 (but
> it seems to be normal in UP kernel those interruption occurs)


Yes - 6 is ITLB miss and 15 is Data TLB miss.

> but (most interesting) it is the very first time that I got
> the message making failed the kernel:
> [...]
> handle_interruption(26, ...).

26 is "Data Memory Access rights Trap".
This sounds normal for Copy-On-Write.

> SMP CALL FUNCTION TIMED OUT (CPU=1)

The IPI handler will time out if the other CPU doesn't ack
the function call with in a second. This is bad.
It means either other CPU never got the interrupt (locked up
with I-bit off) or the "unstarted_count" isn't coherent
between the CPUs.

> handle_interruption(26, ...).
>
> Could this be a pb with sync between cpu time ref?
> (because timeout = jiffies + HZ)

I don't think so since jiffies is a global.
And it's always be measured on the same CPU.

> I have also a look for where this function is called but never see its return
> code tested to launch a 'stack dump' and a stop of system?

You need to find out who is using smp_call_function() and which function
they are trying to invoke. I suspect it's coming from mm/slab.c but
would know which of the three it might be.

grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-26 15:46 Joel Soete
  2003-09-26 16:08 ` Joel Soete
  2003-09-26 16:50 ` Grant Grundler
  0 siblings, 2 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-26 15:46 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

>Yes - 6 is ITLB miss and 15 is Data TLB miss.
...
>
>> handle_interruption(26, ...).
>
>26 is "Data Memory Access rights Trap".
>This sounds normal for Copy-On-Write.

Yes to be sure I just finished to logon a b2k with same kernel (excepted
pdc support but I already verify it doesn't make any difference in the crash
in smp on the N) and effectively it is normal to read many 6, 15 and 26
interruptions.

>> SMP CALL FUNCTION TIMED OUT (CPU=1)
>
>The IPI handler will time out if the other CPU doesn't ack
>the function call with in a second. This is bad.

OTC This is the better messages I never get to start an analyse of this crash
:))

>It means either other CPU never got the interrupt (locked up
>with I-bit off) or the "unstarted_count" isn't coherent between the CPUs.

hmm how could I verify this hypothesis?

>>
>> Could this be a pb with sync between cpu time ref?
>> (because timeout = jiffies + HZ)
>
>I don't think so since jiffies is a global.
>And it's always be measured on the same CPU.
Ok
>
>> I have also a look for where this function is called but never see its
return
>> code tested to launch a 'stack dump' and a stop of system?
>
>You need to find out who is using smp_call_function() and which function
>they are trying to invoke. I suspect it's coming from mm/slab.c but
>would know which of the three it might be.

Effectively I don't find another place where it is called. And so add a
printk in each function calling smp_call_function_all_cpus() finaly.

That is allowing me to notice severall call to kmem_tune_cpucache() (7 exactly)
(and not other) but don't get any more 'SMP CALL FUNCTION TIMED OUT (CPU=1)'
:(
(i presume that, as previously, the system crash before having the opportunity
to flush its buffer?)

What do you think?

Thanks a lot for help,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-26 15:46 [parisc-linux] N Class SMP pb ? (follow up) Joel Soete
@ 2003-09-26 16:08 ` Joel Soete
  2003-09-26 16:50 ` Grant Grundler
  1 sibling, 0 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-26 16:08 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

>
>That is allowing me to notice severall call to kmem_tune_cpucache() (7 exactly)
>(and not other) but don't get any more 'SMP CALL FUNCTION TIMED OUT (CPU=1)'
>:(
>(i presume that, as previously, the system crash before having the opportun
>ty to flush its buffer?)

btw: does it exists some tips to flush buffer before all (or not buffering
console ouput)?


-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-26 15:46 [parisc-linux] N Class SMP pb ? (follow up) Joel Soete
  2003-09-26 16:08 ` Joel Soete
@ 2003-09-26 16:50 ` Grant Grundler
  2003-09-27 18:16   ` Joel Soete
  1 sibling, 1 reply; 11+ messages in thread
From: Grant Grundler @ 2003-09-26 16:50 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Fri, Sep 26, 2003 at 05:46:35PM +0200, Joel Soete wrote:
> >It means either other CPU never got the interrupt (locked up
> >with I-bit off) or the "unstarted_count" isn't coherent between the CPUs.
> 
> hmm how could I verify this hypothesis?

TOC the machine, "ser pim" and look at PSW in TOC Info for each CPU.
bit 0 is the I-Bit IIRC.

On second thought, I'm skeptical unstarted_count isn't coherent
since it's a kernel global as well (like jiffies).

> >You need to find out who is using smp_call_function() and which function
> >they are trying to invoke. I suspect it's coming from mm/slab.c but
> >would know which of the three it might be.
> 
> Effectively I don't find another place where it is called. And so add a
> printk in each function calling smp_call_function_all_cpus() finaly.
> 
> That is allowing me to notice severall call to kmem_tune_cpucache() (7 exactly)
> (and not other) but don't get any more 'SMP CALL FUNCTION TIMED OUT (CPU=1)'
> :(
> (i presume that, as previously, the system crash before having the opportunity
> to flush its buffer?)
> 
> What do you think?

Could be.
Add mdelay(100) (or higher) after the lines of output you've added.
The works if it's a functional problem that's not timing dependent.

Otherwise setup kernel crash dump and use tools from bruno/phi to view
contents of the kernel message buffer.

grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-26 16:50 ` Grant Grundler
@ 2003-09-27 18:16   ` Joel Soete
  0 siblings, 0 replies; 11+ messages in thread
From: Joel Soete @ 2003-09-27 18:16 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

Hello Grant,

Grant Grundler wrote:

>On Fri, Sep 26, 2003 at 05:46:35PM +0200, Joel Soete wrote:
>  
>
>>>It means either other CPU never got the interrupt (locked up
>>>with I-bit off) or the "unstarted_count" isn't coherent between the CPUs.
>>>      
>>>
>>hmm how could I verify this hypothesis?
>>    
>>
>
>TOC the machine, "ser pim" and look at PSW in TOC Info for each CPU.
>bit 0 is the I-Bit IIRC.
>  
>
Here is such TOC:
PROCESSOR PIM INFORMATION

Original Product Number:  A3639C
Current Product Number:   A3639C


-------  Processor 1 HPMC Information - PDC Version: 41.28^@  ------

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

HPMC Chassis Codes

       Chassis Code        Extension
       ------------        ---------
       0x0000082000ff6242  0x0000000000000000
       0x1800082011016312  0xcb81000000000000
       0x0000087000ff6292  0x000000ffff800000
       0x6000082013016062  0x2002000000080000
       0x6000082013016072  0x0000000000080000
       0x7000082013016082  0x0000000000192200
       0x6000082013036062  0x2001000000082004
       0x6000082013036072  0x0000000000082000
       0x7000082013036082  0x0000000000992600
       0x6000082070006062  0x0000000000080000
       0x6000082070006072  0x0000000000080000
       0x7000082070006082  0x0000000000192200
       0x6000082070016062  0x0000000000000800
       0x6000082070016072  0x0000000000000800
       0x7000082070016082  0x00000000001a4400
       0x0000080080006310  0x0000000000000001
       0x7000082082006333  0x0000000000b92200
       0x7000082082016333  0x0000000000b92200
       0x000008008000631f  0x0000000000000000
       0x0000082000ff6452  0x0000000000000000
       0x0000082000ff6402  0x0000000000000000
       0x0000080080006300  0x0000000000000001
       0x7000082082006333  0x0000000000b92200
       0x7000082382006343  0x0000000000070200
       0x7000082382016343  0x0000000000070200
       0x7000082382026343  0x0000000000070200
       0x7000082382046343  0x0000000000070200
       0x7000082382056343  0x0000000000070200
       0x7000082382086343  0x0000000000070200
       0x70000823820a6343  0x0000000000070200
       0x70000823820c6343  0x0000000000070200
       0x7000082082016333  0x0000000000b92200
       0x7000082382106343  0x0000000000070200
       0x7000082382126343  0x0000000000070200
       0x7000082382146343  0x0000000000070200
       0x7000082382186343  0x0000000000070200
       0x70000823821a6343  0x0000000000070200
       0x70000823821c6343  0x0000000000070200
       0x0000080089006200  0x0000000000000000
       0x0000082389006200  0x0000000000000000
       0x0000080086006200  0x0000000000000000
       0x000008008000630f  0x0000000000000000


General Registers 0 - 31
00-03  0000000000000000  00000000104f6380  000000001014acb4  
00000000104f3b80
04-07  000000008f029000  0000000010423688  000000008f0b8000  
0000000010000000
08-11  0000000013484f70  0000000013481e48  000000007f0b8b25  
000000001054ebc0
12-15  00000000000e1984  000000001054ec20  000000008f0a40c0  
000000008f0bf708
16-19  0000000013481e48  0000000000000000  00000000faf005e0  
0000000000000580
20-23  000000001054ebc0  00000000002f7465  00000000003f45a2  
000fe051ffc07eb8
24-27  000000007f029b27  00000000000e1984  000000008f0a40c0  
00000000104f3b80
28-31  000000000007f029  003f81480007f029  000000008f0e4f40  
0000000000008ba3


Control Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000016  0000000000000000  00000000000000c0  
000000000000002b
12-15  0000000000000000  0000000000000000  0000000000107000  
ffe0000000000000
16-19  00000024643cebe8  0000000000000000  000000001014acec  
0000000037dd3f61
20-23  0000000000000600  00000000000e1984  000000ff0804c70f  
c000000000000000
24-27  0000000000427000  000000007f04b000  0000000000041020  
000000ffff95c810
28-31  5555555555555555  5555555555555555  000000008f0e4000  
0000000010560000

Space Registers 0 - 7
00-03  00000580          00000580          00000000          00000580
04-07  00000000          00000000          00000000          00000000


IIA Space (back entry)       = 0x0000000000000000
IIA Offset (back entry)      = 0x000000001014acf0
Check Type                   = 0x20000000
CPU State                    = 0x9e000004
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x0010c03b
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x0000000000000000
System Requestor Address     = 0xfffffffffed25000


Floating Point Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  000000001050eec0  00000000104f3b80  0000000000000002  
000000001049d248
08-11  00000000104f3b80  0000000000000802  00000000104be588  
000000008fac8000
12-15  0000000000000000  0000000000000000  000000001016ace8  
00000000103ad6e0
16-19  00000000000009ca  000000008f7cb000  000000000800000f  
000000001049d250
20-23  000000001050eec0  00000000104f3b80  00000000003f45a2  
000000000000ba2e
24-27  0000999900000000  000099997fac8b70  000000007fac8b78  
000000000bebc200
28-31  0000000000000001  00000000ff915e20  0000000010165bf4  
00000000104f3b80


Check Summary                = 0xcb81000000000000
Available Memory             = 0x0000000100000000
CPU Diagnose Register 2      = 0x0301010800802004
CPU Status Register 0        = 0x2640c24000000000
CPU Status Register 1        = 0x8000200000000000
SADD LOG                     = 0xf8efdb00003fd800
Read Short LOG               = 0xc18200ff80000002



-----------------  DEW 1 HPMC Information -  ------

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

Runway Control Log Reg            = 0x00927b0000000000
Runway Address Data Log Reg Odd   = 0xc0aa1010c4a61010
Runway Address Data Log Reg Even  = 0xc8a61010cca61010
Runway Address Log Reg            = 0x00000000000000f4
Runway Broad Error Log Reg        = 0x000000000000005c

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
                ERR_ERROR       X                X

Merced Bus Requestor Address      = 0x0000000000000000
Merced Bus Target Address         = 0x0000000000000000
Merced Bus Responder Address      = 0x0000000000000000
Merced Error Status Reg           = 0x2002000000080000
Merced Error Overflow Reg         = 0x0000000000080000
Merced AERR Addr1 Log Reg         = 0x00006000ff86fdc0
Merced AERR Addr2 Log Reg         = 0x00008000078fff08
Merced DERR  Log Reg              = 0x0001000000000000
Merced Error Syndrome Reg         = 0x00000000000000c0


-------  Processor 1^@ LPMC Information ------------------

Check Type                   = 0x00000000
IC Parity Info               = 0x00000000
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x00000000
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x0000000000000000
System Requestor Address     = 0x0000000000000000



-------  Processor 1^@ TOC Information -------------------

General Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
12-15  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
16-19  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
20-23  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
24-27  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
28-31  0000000000000000  0000000000000000  0000000000000000  
0000000000000000


Control Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
12-15  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
16-19  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
20-23  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
24-27  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
28-31  0000000000000000  0000000000000000  0000000000000000  
0000000000000000

Space Registers 0 - 7
00-03  00000000          00000000          00000000          00000000
04-07  00000000          00000000          00000000          00000000

IIA Space (back entry)       = 0x0000000000000000
IIA Offset (back entry)      = 0x0000000000000000
CPU State                    = 0x00000000



-------  Processor 3 HPMC Information - PDC Version: 41.28^@  ------

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

HPMC Chassis Codes

       Chassis Code        Extension
       ------------        ---------
       0x0000082000ff6242  0x0000000000000000
       0x1800082011036322  0xcb81800000000000
       0x0000082000ff6452  0x0000000000000000
       0x0000082000ff6402  0x0000000000000000


General Registers 0 - 31
00-03  0000000000000000  0000000010502b80  00000000101161cc  
00000000103ef0f8
04-07  000000000800000f  0000000000000002  0000000000000000  
00000000104f3b80
08-11  00000000103ef0f8  00000000103ef0f8  000000001038c43c  
000000001038af08
12-15  0000000000000001  0000000000000001  0000000000000000  
000000001038e004
16-19  000000001038e018  000000008f7cc180  0000000000000002  
0000000000000001
20-23  000000000000702c  0000000010423078  00000000104f4380  
0000000000000001
24-27  0000000000000116  000000001038c43c  00000000103ef130  
00000000104f3b80
28-31  0000000000000000  000000008f0353b0  000000008f0353c0  
0000000000008ba3


Control Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000018  0000000000000000  00000000000000c0  
000000000000003d
12-15  0000000000000000  0000000000000000  0000000000107000  
ffe0000000000000
16-19  000000246412e91b  0000000000000000  00000000101162d0  
000000008e605e8d
20-23  0000000000000600  0000000000000000  000000000806060f  
0000000000000000
24-27  0000000000427000  000000007f03e000  0000000000041020  
000000ffff95c810
28-31  000000ffff95c810  5555555555555555  000000008f034000  
0000000000008020

Space Registers 0 - 7
00-03  00000600          00000000          00000000          00000600
04-07  00000000          00000000          00000000          00000000


IIA Space (back entry)       = 0x0000000000000000
IIA Offset (back entry)      = 0x00000000101162d4
Check Type                   = 0x20000000
CPU State                    = 0x9e000004
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x0030000d
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0xfffffffffed2d000
System Requestor Address     = 0x000000fffed2c000


Floating Point Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  000000001050eec0  00000000104f3b80  0000000000000002  
000000001049d248
08-11  00000000104f3b80  0000000000000802  00000000104be588  
000000008fac8000
12-15  0000000000000000  0000000000000000  000000001016ace8  
00000000103ad6e0
16-19  00000000000009ca  000000008f7cb000  000000000800000f  
000000001049d250
20-23  000000001050eec0  00000000104f3b80  0000000000000000  
000000000000ba2e
24-27  0000999900000000  000099997fac8b70  000000007fac8b78  
000000000bebc200
28-31  0000000000000001  00000000ff915e20  0000000010165bf4  
00000000104f3b80


Check Summary                = 0xcb81800000000000
Available Memory             = 0x0000000100000000
CPU Diagnose Register 2      = 0x0301030800802004
CPU Status Register 0        = 0x3640c24000000000
CPU Status Register 1        = 0x8000000000000000
SADD LOG                     = 0x48e0000000000002
Read Short LOG               = 0xc18080ff80080014



-----------------  DEW 3 HPMC Information -  ------

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

Runway Control Log Reg            = 0x0006720000000000
Runway Address Data Log Reg Odd   = 0xfffffffffffc3f00
Runway Address Data Log Reg Even  = 0xfffffffffffc3f00
Runway Address Log Reg            = 0x0000000000000048
Runway Broad Error Log Reg        = 0x00000000000000dc

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
  X             ERR_ERROR       X            X   X

Merced Bus Requestor Address      = 0x0000000000000000
Merced Bus Target Address         = 0x0000000000000000
Merced Bus Responder Address      = 0x0000000000000000
Merced Error Status Reg           = 0x2001000000082004
Merced Error Overflow Reg         = 0x0000000000082000
Merced AERR Addr1 Log Reg         = 0x00c0000000300000
Merced AERR Addr2 Log Reg         = 0x0000000000f00000
Merced DERR  Log Reg              = 0x00c1100000000000
Merced Error Syndrome Reg         = 0x0000000052000000


-------  Processor 3^@ LPMC Information ------------------

Check Type                   = 0x00000000
IC Parity Info               = 0x00000000
Cache Check                  = 0x00000000
TLB Check                    = 0x00000000
Bus Check                    = 0x00000000
Assists Check                = 0x00000000
Assist State                 = 0x00000000
Path Info                    = 0x00000000
System Responder Address     = 0x0000000000000000
System Requestor Address     = 0x0000000000000000



-------  Processor 3^@ TOC Information -------------------

General Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
12-15  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
16-19  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
20-23  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
24-27  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
28-31  0000000000000000  0000000000000000  0000000000000000  
0000000000000000


Control Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
08-11  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
12-15  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
16-19  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
20-23  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
24-27  0000000000000000  0000000000000000  0000000000000000  
0000000000000000
28-31  0000000000000000  0000000000000000  0000000000000000  
0000000000000000

Space Registers 0 - 7
00-03  00000000          00000000          00000000          00000000
04-07  00000000          00000000          00000000          00000000

IIA Space (back entry)       = 0x0000000000000000
IIA Offset (back entry)      = 0x0000000000000000
CPU State                    = 0x00000000


--------------  Memory Error Log Information  --------------

Bus 0 Log Information

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
                ERR_ERROR       X                X

Bus Requestor Address      = 0x0000000000000000
Bus Target Address         = 0x0000000000000000
Bus Responder Address      = 0x0000000000000000

Error Status Reg           = 0x0000000000080000
Error Overflow Reg         = 0x0000000000080000
AERR Address 1 Log Reg     = 0x0000000000000000
AERR Address 2 Log Reg     = 0xf800000000000000
FERR  Log Reg              = 0x0000000000000000
DERR  Log Reg              = 0x000133000051cdc0
Error Syndrome Reg         = 0x0000000000000000



 Address/Control Parity Error Registers

   Address/Control Parity Error Bit (AE) Not Set



Bus 1 Log Information

Timestamp =    Tue Mar  11 18:07:11 GMT 2003    (20:03:03:11:18:07:11)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
               ERR_TIMEOUT   X               X

Bus Requestor Address      = 0xfffffffffed2c000
Bus Target Address         = 0x00000000f000a000
Bus Responder Address      = 0x0000000000000000

Error Status Reg           = 0x0000000000000800
Error Overflow Reg         = 0x0000000000000800 
AERR Address 1 Log Reg     = 0x08006000f000a000 
AERR Address 2 Log Reg     = 0x6000b0003f700a10
FERR  Log Reg              = 0x0000000000000000
DERR  Log Reg              = 0x0000000000000000
Error Syndrome Reg         = 0x0000000000000000



 Address/Control Parity Error Registers 

   Address/Control Parity Error Bit (AE) Not Set



------------  I/O Module Error Log Information  ------------

Summary of IO subsystem log entries
-----------------------------------
                        Phys Loc             Vendor  Device   Severity
Description             (hex)                 Id      Id      CORR UNC 
FE  CW
-----------             -----                ------  ------   
----------------
System Bus Adapter SB  0x000000ffffffff82   0x103c  0x1050              X
System Bus Adapter RP  0x000000ffff0dff83   0x103c  0x1051              X
System Bus Adapter RP  0x000000ffff0eff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff06ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff02ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff01ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff04ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff05ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000101ffff03ff83   0x103c  0x1051              X
System Bus Adapter SB  0x000000ffffffff82   0x103c  0x1050              X
System Bus Adapter RP  0x000202ffff0cff83   0x103c  0x1051              X
System Bus Adapter RP  0x000202ffff0aff83   0x103c  0x1051              X
System Bus Adapter RP  0x000202ffff09ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000202ffff0bff83   0x103c  0x1051              X
System Bus Adapter RP  0x000202ffff08ff83   0x103c  0x1051              X
System Bus Adapter RP  0x000202ffff07ff83   0x103c  0x1051              X


Detail display of IO subsystem log entries
------------------------------------------

System Bus Adapter -- System Bus Interface
------------------------------------------

Timestamp =    Tue Mar  11 18:09:10 GMT 2003    (20:03:03:11:18:09:10)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
  X       X     ERR_ERROR       X                X

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0xfffffffffed00000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000007ff0034

System Bus Adapter --       Rope Interface
------------------------------------------

Timestamp =    Tue Mar  11 18:09:12 GMT 2003    (20:03:03:11:18:09:12)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
               ERR_FUNCTION                      X      

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0x0000000000000000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000000000000
Rope Physical Location  = 0x000000ffff0dff83

System Bus Adapter --       Rope Interface
------------------------------------------

Timestamp =    Tue Mar  11 18:09:12 GMT 2003    (20:03:03:11:18:09:12)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
               ERR_FUNCTION                      X      
 
IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0x0000000000000000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000000000000
Rope Physical Location  = 0x000000ffff0eff83

System Bus Adapter --       Rope Interface
------------------------------------------
Timestamp =    Tue Mar  11 18:09:12 GMT 2003    (20:03:03:11:18:09:12)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
               ERR_FUNCTION                      X

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0x0000000000000000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000000000000
Rope Physical Location  = 0x000101ffff06ff83

System Bus Adapter --       Rope Interface
------------------------------------------

Timestamp =    Tue Mar  11 18:09:12 GMT 2003    (20:03:03:11:18:09:12)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
               ERR_FUNCTION                      X

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0x0000000000000000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000000000000
Rope Physical Location  = 0x000101ffff02ff83

System Bus Adapter --       Rope Interface
------------------------------------------

Timestamp =    Tue Mar  11 18:09:12 GMT 2003    (20:03:03:11:18:09:12)
[...]

Well that for an older test but I don't know yet what could be the PSW 
(sorry I haven't found more doc about TOC output)?

>On second thought, I'm skeptical unstarted_count isn't coherent
>since it's a kernel global as well (like jiffies).
>
>  
>
>>>You need to find out who is using smp_call_function() and which function
>>>they are trying to invoke. I suspect it's coming from mm/slab.c but
>>>would know which of the three it might be.
>>>      
>>>
>>Effectively I don't find another place where it is called. And so add a
>>printk in each function calling smp_call_function_all_cpus() finaly.
>>
>>That is allowing me to notice severall call to kmem_tune_cpucache() (7 exactly)
>>(and not other) but don't get any more 'SMP CALL FUNCTION TIMED OUT (CPU=1)'
>>:(
>>(i presume that, as previously, the system crash before having the opportunity
>>to flush its buffer?)
>>
>>What do you think?
>>    
>>
>
>Could be.
>Add mdelay(100) (or higher) after the lines of output you've added.
>The works if it's a functional problem that's not timing dependent.
>  
>
Because during another test I reach to boot this N (well only during 
half an hour) in SMP, I am quite sure that is such a problem  somewhere 
(the problem is to find where).

>Otherwise setup kernel crash dump and use tools from bruno/phi to view
>contents of the kernel message buffer.
>
I already thought to this (because I test severall bruno's patch), but I 
have two pb to implement it:
a) my system has 2Gb (4* 512Mb iirc) of ram and I don't see how to 
reconfigure the disk with at least 2Gb of swap(== dump area iirc)?
The disk slicing being:
    Name        Flags      Part Type  FS Type          [Label]        
Size (MB)
 ------------------------------------------------------------------------------
    sda1        Boot        Primary   Linux/PA-RISC 
boot                  67.56
    sda2                    Primary   Linux swap                         
135.11
    sda3                    Primary   Linux ext3                         
130.89
    sda5                    Logical   Linux ext3                        
1760.56
    sda6                    Logical   Linux ext3                         
261.77
    sda7                    Logical   Linux ext3                         
130.89
    sda8                    Logical   Linux ext3                         
130.89
    sda9                    Logical   Linux ext3                        
1574.79

sda5 being the root fs must be into the 2Gb limits iirc but I am not 
quiet sure that swap also has have to be in those limits (in fact it is 
just like this because of the very first puffin :) (now obsolete) 
install instruction?

b) afaik p4 is not yet publicaly realesed?

Thanks in advance for your additional help,
    Joel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-09-30 16:31 Joel Soete
  2003-09-30 18:50 ` Grant Grundler
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Soete @ 2003-09-30 16:31 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

Hi Grant,

Here is the very last test I did yesterday with the additional mdelay(100):

>TOC the machine, "ser pim" and look at PSW in TOC Info for each CPU.
>bit 0 is the I-Bit IIRC.

In summary:
-------  Processor 1 HPMC Information - PDC Ver
ion: 41.28  ------
[...]
CPU State                    = 0x9e000004
[...]
CPU Diagnose Register 2      = 0x0301010800802004
CPU Status Register 0        = 0x2640c24000000000
CPU Status Register 1        = 0x8000200000000000
[...]
-------  Proces
or 3 HPMC Information - PDC Version: 41.28  ------
[...]
CPU State                    = 0x9e000004
[...]
CPU Diagnose Register 2      = 0x0301030800802004
CPU Status Register 0        = 0x3640c24000000000
CPU Status Register 1        = 0x80000000
0000000
[...]

all I bits (well the lowest weight PSW bit :) ) are well 0


>Could be.
>Add mdelay(100) (or higher) after the lines of output you've added.
>The works if it's a functional problem that's not timing dependent.

Well after a ver
 long time of boot the system finaly crash without any
reason of panic??? (all interruption should be manage by handle_interruption?)

Just in case here is a short Pim-analyse:
-------  Processor 1 HPMC Information - PDC Version: 41.28  ------ 

GR of CPU[1]
00-03  0000000000000000  000000001041b018  000000001014dbf0  0000000000000000
04-07  0000000000008000  000000008d113c00  0000000040200000  0000000000008000
08-11  0000000000000000  000000008d2cd008  0000000080000000  00000000103fa2c8
12-15  0000000040180000  000000008d9a6280  00000000105389c0  0000000000000000
16-19  000000001045cf88  00000000103b6338  000000008d147010  ffffffffffffffff
20-23  00000000000001ff  0000000040178000  000000008d9a6280  0000000000088000
24-27  0000000040180000  0000000000000006  0000000040180000  00000000105389c0
28-31  0000000000000000  000000008d7ccef0  000000008d7ccf40  0000000000008000

GR[02] == rp = 000000001014dbf0

Func: zap_page_range, Off: 0xe0, Addr: 0x1014dbf0

    1014dbf0:	08 0e 02 5b 	copy r14,dp
    1014dbf4:	03 c0 08 b4 	mfctl tr6,r20
    1014dbf8:	4a 93 00 b0 	ldw 58(r20),r19
    1014dbfc:	29 c5 20 00 	addil b000,r14,%r1

GR[22] == t1(32bits) == arg4(64bits) = 000000008d9a6280

GR[21] == t2(32bits) == arg5(64bits) = 0000000040178000

GR[20] == t3(32bits) == arg6(64bits) = 00000000000001ff

GR[19] == t4(32bits) == arg7(64bits) = ffffffffffffffff

GR[26] == arg0 = 0000000040180000

GR[25] == arg1 = 0000000000000006

GR[24] == arg2 = 0000000040180000

GR[23] == arg3 = 0000000000088000

GR[27] == dp = 00000000105389c0

Func: __gp, Off: 0x0, Addr: 0x105389c0


GR[28] == ret0 = 0000000000000000

GR[29] == ret1 or sl = 000000008d7ccef0

GR[30] == sp = 000000008d7ccf40

GR[31] == ble rp = 0000000000008000
	Not parsable address!

CR of CPU[1]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11  00000000000002b2  0000000000000000  00000000000000c0  0000000000000003
12-15  0000000000000000  0000000000000000  0000000000107000  ffe0000000000000
16-19  000003182e3e3f89  0000000000000000  000000001014deac  0000000036b52000
20-23  00000000103401f5  00000000f33ccdd8  000000ff080ef70f  8000000000000000
24-27  0000000000461000  000000007d147000  0000000000041020  000000ffff95c810
28-31  5555555555555555  5555555555555555  000000008d7cc000  00000000105a0000

CR[00] == rctr = 0000000000000000

CR[08] == (Protection ID) pidr1 = 00000000000002b2

CR[10] == ccr = 00000000000000c0

CR[11] == sar = 0000000000000003

CR[14] == iva = 0000000000107000

CR[15] == eiem = ffe0000000000000

CR[16] == itmr = 000003182e3e3f89

CR[17] == pcsq = 0000000000000000

CR[18] == pcoq = 000000001014deac

CR[19] == iir = 0000000036b52000

CR[20] == isr = 00000000103401f5

CR[21] == ior = 00000000f33ccdd8

CR[22] == ipsw = 000000ff080ef70f

CR[23] == eirw = 8000000000000000

CR[24] == tr0 (ptov) = 0000000000461000

CR[25] == tr1 (vtop) = 000000007d147000

CR[26] == tr2 = 0000000000041020

CR[27] == tr3 = 000000ffff95c810

CR[28] == tr4 = 5555555555555555

CR[29] == tr5 = 5555555555555555

CR[30] == tr6 = 000000008d7cc000

CR[31] == tr7 = 00000000105a0000

SR of CPU[1]
00-03  0000ac80          0000ac80          00000000          0000ac80
04-07  00000000          00000000          00000000          00000000
Need much more work !!!

SR[00] == ts0 = 0000ac80

SR[01] == ts1 = 0000ac80

SR[03] == cpp = 0000ac80
	Not parsable address!
...
IIA Offset (back entry)      = 0x000000001014dea0
...

e.g. IAOQ = 0x000000001014dea0

FPR of CPU[1]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  000000008f760ec0  0000000000000002  000000001359d740  0000000000000420
08-11  0000000000000000  0000000000000802  00000000105389c0  000000001059a000
12-15  0000000013590000  0000000000000000  0000000010180574  00000000103dc6b8
16-19  00000000000009ee  000000008fa7e000  00000000105389c0  0000000013590000
20-23  00000000103b7b0c  fffffffffffffff4  000000000000021e  0000002f66666667
24-27  000007b100000000  0000999903590b70  0000000003590b78  000000001041b980
28-31  000000001041b980  00000000ff915e20  0000000010187b38  0000000000000004

Parse IAOQ = 0x000000001014dea0 for CPU[1]

Func: zap_page_range, Off: 0x390, Addr: 0x1014dea0

    1014dea0:	06 a0 52 00 	pdtlb r0(sr1,r21)
    1014dea4:	37 39 3f ff 	ldo -1(r25),r25
    1014dea8:	bf 33 3f e5 	cmpb,*<> r19,r25,1014dea0 <zap_page_range+0x390>
    1014deac:	36 b5 20 00 	ldo 1000(r21),r21
-------  Processor 3 HPMC Information - PDC Version: 41.28  ------ 

GR of CPU[3]
00-03  0000000000000000  0000000010429028  000000001010cdd0  0000000000000021
04-07  000000008d0c05b8  00000000105389c0  000000000000000f  0000000000000000
08-11  0000000000000000  0000000040026ee2  0000000040039141  0000000040026fb4
12-15  0000000040028380  00000000faf00950  00000000400342f4  0000000000000000
16-19  000000008d0c05b8  00000000faf00910  00000000faf00910  0000000000058706
20-23  000003182e080065  0000000000000000  0000000000000000  0000000000000000
24-27  0000000000000000  0000000000000000  00000000000003e8  00000000105389c0
28-31  0000000000086470  0000000000086470  000000008d0c0b40  0000000000000226

GR[02] == rp = 000000001010cdd0

Func: handle_interruption, Off: 0xb0, Addr: 0x1010cdd0

    1010cdd0:	08 05 02 5b 	copy r5,dp
    1010cdd4:	02 00 08 b4 	mfctl itmr,r20
    1010cdd8:	02 00 08 b3 	mfctl itmr,r19
    1010cddc:	0a 93 04 33 	sub r19,r20,r19
	...
    1010cde0:	be 7c bf e5 	cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>

	...
	...
    1010cdec:	ec 7f bf c5 	cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>

	...

GR[22] == t1(32bits) == arg4(64bits) = 0000000000000000

GR[21] == t2(32bits) == arg5(64bits) = 0000000000000000

GR[20] == t3(32bits) == arg6(64bits) = 000003182e080065

GR[19] == t4(32bits) == arg7(64bits) = 0000000000058706

GR[26] == arg0 = 00000000000003e8

GR[25] == arg1 = 0000000000000000

GR[24] == arg2 = 0000000000000000

GR[23] == arg3 = 0000000000000000

GR[27] == dp = 00000000105389c0

Func: __gp, Off: 0x0, Addr: 0x105389c0


GR[28] == ret0 = 0000000000086470

GR[29] == ret1 or sl = 0000000000086470

GR[30] == sp = 000000008d0c0b40

GR[31] == ble rp = 0000000000000226
	Not parsable address!

CR of CPU[3]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11  00000000000002b8  0000000000000000  00000000000000c0  000000000000003f
12-15  0000000000000000  0000000000000000  0000000000107000  ffe0000000000000
16-19  000003182e158ca8  0000000000000000  000000001010cde0  00000000be7cbfe5
20-23  00000000103401f4  00000000300c0b50  000000ff0804ff0e  8000000000000000
24-27  0000000000461000  000000007d0c4000  0000000000041020  000000ffff95c810
28-31  000000ffff95c810  5555555555555555  000000008d0c0000  0000000000008020

CR[00] == rctr = 0000000000000000

CR[08] == (Protection ID) pidr1 = 00000000000002b8

CR[10] == ccr = 00000000000000c0

CR[11] == sar = 000000000000003f

CR[14] == iva = 0000000000107000

CR[15] == eiem = ffe0000000000000

CR[16] == itmr = 000003182e158ca8

CR[17] == pcsq = 0000000000000000

CR[18] == pcoq = 000000001010cde0

CR[19] == iir = 00000000be7cbfe5

CR[20] == isr = 00000000103401f4

CR[21] == ior = 00000000300c0b50

CR[22] == ipsw = 000000ff0804ff0e

CR[23] == eirw = 8000000000000000

CR[24] == tr0 (ptov) = 0000000000461000

CR[25] == tr1 (vtop) = 000000007d0c4000

CR[26] == tr2 = 0000000000041020

CR[27] == tr3 = 000000ffff95c810

CR[28] == tr4 = 000000ffff95c810

CR[29] == tr5 = 5555555555555555

CR[30] == tr6 = 000000008d0c0000

CR[31] == tr7 = 0000000000008020

SR of CPU[3]
00-03  0000ae00          00006e00          00000000          0000ae00
04-07  00000000          00000000          00000000          00000000
Need much more work !!!

SR[00] == ts0 = 0000ae00

SR[01] == ts1 = 00006e00

SR[03] == cpp = 0000ae00
	Not parsable address!
...
IIA Offset (back entry)      = 0x000000001010cde4
...

e.g. IAOQ = 0x000000001010cde4

FPR of CPU[3]
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  000000008f760ec0  0000000000000002  000000001359d740  0000000000000420
08-11  0000000000000000  0000000000000802  00000000105389c0  000000001059a000
12-15  0000000013590000  0000000000000000  0000000010180574  00000000103dc6b8
16-19  00000000000009ee  000000008fa7e000  00000000105389c0  0000000013590000
20-23  00000000103b7b0c  fffffffffffffff4  0000000000000000  0000000000000000
24-27  0000999900000000  0000999903590b70  0000000003590b78  000000001041b980
28-31  000000001041b980  00000000ff915e20  0000000010187b38  0000000000000000

Parse IAOQ = 0x000000001010cde4 for CPU[3]

Func: handle_interruption, Off: 0xc4, Addr: 0x1010cde4

    1010cde0:	be 7c bf e5 	cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>
    1010cde4:	08 00 02 40 	nop
    1010cde8:	34 63 3f ff 	ldo -1(r3),r3
    1010cdec:	ec 7f bf c5 	cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>

Any idea?

>Otherwise setup kernel crash dump and use tools from bruno/phi to view
>contents of the kernel message buffer.

Well, that seems to be the ultimate solution (I don't remember if it also
works on smp kernel?) but I will need to discuss a bit with them to see if
I reach to get a dump how could it be analysed?

Thanks again for your attention,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-09-30 16:31 Joel Soete
@ 2003-09-30 18:50 ` Grant Grundler
  0 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-09-30 18:50 UTC (permalink / raw)
  To: Joel Soete; +Cc: parisc-linux

On Tue, Sep 30, 2003 at 06:31:17PM +0200, Joel Soete wrote:
> Hi Grant,
> 
> Here is the very last test I did yesterday with the additional mdelay(100):
> 
> >TOC the machine, "ser pim" and look at PSW in TOC Info for each CPU.
> >bit 0 is the I-Bit IIRC.
> 
> In summary:
> -------  Processor 1 HPMC Information - PDC Version: 41.28  ------

Did you TOC the machine or did it HPMC?
I was under the impression the SW had hung and one needed to TOC
to regain control. TOC info is seperate from HPMC info.

If it's in fact HPMC, then look at IOAQ/GR02 for both CPUs
and see which functions they were executing in when HPMC occurred.

grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
@ 2003-10-01  6:48 Joel Soete
  2003-10-01 17:20 ` Joel Soete
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Soete @ 2003-10-01  6:48 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

>>
>> In summary:
>> -------  Processor 1 HPMC Information - PDC Version: 41.28  ------
>
>Did you TOC the machine or did it HPMC?
>I was under the impression the SW had hung and one needed to TOC
>to regain control. TOC info is seperate from HPM
 info.

Exact, but TOC info only contains 0 so I suposed that system do actualy
a HPMC but do not seems to be managed by handle_interruption() as at its
begining I put a printk() which was suposed to write the 'code' value?

to be more accurate:

[...]
    struct siginfo si;

    printk(KERN_ERR "%s(%d, ...).\n", __FUNCTION__, code);
    mdelay(100);
[...]

which allowing me to read a lot of 6, 15, 26 codes but never 1?

>
>If it's in fact HPMC, then look at IOAQ/GR02 for both CPUs
>and see which functions they were executing in when HPMC occurred.

which were for cpu[1]:
GR[02] == rp = 000000001014dbf0

Func: zap_page_range, Off: 0xe0, Addr: 0x1014dbf0

    1014dbf0:  08 0e 02 5b   copy r14,dp
    1014dbf4:  03 c0 08 b4   mfctl tr6,r20
    1014dbf8:  4a 93 00 b0   ldw 58(r20),r19
    1014dbfc:  29 c5 20 00   addil b000,r14,%r1

[...]
Parse IAOQ = 0x000000001014dea0 for CPU[1]

Func: zap_page_range, Off: 0x390, Addr: 0x1014dea0

    1014dea0:  06 a0 52 00   pdtlb
r0(sr1,r21)
    1014dea4:  37 39 3f ff   ldo -1(r25),r25
    1014dea8:  bf 33 3f e5   cmpb,*<> r19,r25,1014dea0 <zap_page_range+0x390>
    1014deac:  36 b5 20 00   ldo 1000(r21),r21

And for cpu[3]:
GR[02] == rp = 000000001010cdd0

Func: handle
interruption, Off: 0xb0, Addr: 0x1010cdd0

    1010cdd0:  08 05 02 5b   copy r5,dp
    1010cdd4:  02 00 08 b4   mfctl itmr,r20
    1010cdd8:  02 00 08 b3   mfctl itmr,r19
    1010cddc:  0a 93 04 33   sub r19,r20,r19
  ...

Parse IAOQ = 0x000000
01010cde4 for CPU[3]

Func: handle_interruption, Off: 0xc4, Addr: 0x1010cde4

    1010cde0:  be 7c bf e5   cmpb,*>> ret0,r19,1010cdd8 <handle_interruption+0xb8>
    1010cde4:  08 00 02 40   nop
    1010cde8:  34 63 3f ff   ldo -1(r3),r3
    1010
dec:  ec 7f bf c5   cmpib,*<> -1,r3,1010cdd4 <handle_interruption+0xb4>


Am i wrong if I presume that the nop isn would be harmless on cpu[3] OTC
'pdtlb r0(sr1,r21)' ?  But I do not read any code 10 printout by printk()
anyway it is the only exception: Privileged operation trap.

Thanks again,
    Joel




-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [parisc-linux] N Class SMP pb ?  (follow up)
  2003-10-01  6:48 Joel Soete
@ 2003-10-01 17:20 ` Joel Soete
  0 siblings, 0 replies; 11+ messages in thread
From: Joel Soete @ 2003-10-01 17:20 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

Hi Grant,

I also notice additional info:

a) in SL (gsp console) A grab severall message as:

Log Entry #   0 : 
SYSTEM NAME: ap8002
DATE: 10/01/2003 TIME: 16:32:35
ALERT LEVEL: 2 = Non-Urgent operator attention required

SOURCE: 8 = I/O 
SOURCE DETAIL: 2 = system bus adapter   SOURCE ID: 1
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: 6 = machine check   STATUS: 3
CALLER SUBACTIVITY: 33 = implementation dependent
REPORTING ENTITY TYPE: 0 = system firmware   REPORTING ENTITY ID: 01

0x7000102082016333 00000000 00B92200 type 14 = Problem Detail
0x5800182082016333 00006709 01102023 type 11 = Timestamp 10/01/2003 16:32:35
Type CR for next entry, Q CR to quit.

Which seems indicating an I/O pb (But I don't know how much there are relevant
because: 'implementation dependent')

b) at the end of the pim info I also notice:
[...]
------------  I/O Module Error Log Information  ------------

Summary of IO subsystem log entries
-----------------------------------
                        Phys Loc             Vendor  Device   Severity
Description             (hex)                 Id      Id      CORR UNC FE
 CW
-----------             -----                ------  ------   ----------------
System Bus Adapter SB  0x000000ffffffff82   0x103c  0x1050              X
System Bus Adapter SB  0x000000ffffffff82   0x103c  0x1050              X


Detail display of IO subsystem log entries
------------------------------------------

System Bus Adapter -- System Bus Interface
------------------------------------------

Timestamp =    Wed Oct  1 16:32:31 GMT 2003    (20:03:10:01:16:32:31)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
  X       X     ERR_ERROR       X                X       

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0xfffffffffed00000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff00

Module Error Register   = 0x0000000007ff0034

System Bus Adapter -- System Bus Interface
------------------------------------------

Timestamp =    Wed Oct  1 16:32:31 GMT 2003    (20:03:10:01:16:32:31)

  OV  RQ  RS      ESTAT      A  C  D  corr  unc  fe  cw  pf
  --  --  --      -----      -  -  -  ----  ---  --  --  --
  X       X     ERR_ERROR       X                X       

IO Requestor Address    = 0x0000000000000000
IO Target Address       = 0x0000000000000000
IO Responder Address    = 0xfffffffffed40000
IO Physical Location    = 0x000000ffffffff82
IO Hardware Path        = 0x00ffffffffffff01

Module Error Register   = 0x0000000007ff0034
[...]

And "IO Responder Address    = 0xfffffffffed40000" match the bootlog entry:
Found devices:
[...]
11. IKE I/O Bus Converter Merced Port (7) at 0xfffffffffed40000 [1], versions
0x803, 0x0, 0xc

And "IO Responder Address    = 0xfffffffffed00000" 
2. IKE I/O Bus Converter Merced Port (7) at 0xfffffffffed00000 [0], versions
0x803, 0x0, 0xc

Could it be the sources of the crash pb?

Thanks in advance,
    Joel



-------------------------------------------------------------------------
L'Internet rapide, c'est pour tout le monde. Tiscali ADSL, 19,50 Euro
pendant 3 mois! http://reg.tiscali.be/default.asp?lg=fr 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-10-01 17:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-26 15:46 [parisc-linux] N Class SMP pb ? (follow up) Joel Soete
2003-09-26 16:08 ` Joel Soete
2003-09-26 16:50 ` Grant Grundler
2003-09-27 18:16   ` Joel Soete
  -- strict thread matches above, loose matches on Subject: below --
2003-10-01  6:48 Joel Soete
2003-10-01 17:20 ` Joel Soete
2003-09-30 16:31 Joel Soete
2003-09-30 18:50 ` Grant Grundler
2003-09-25 14:56 Joel Soete
2003-09-25 15:41 ` Derek Engelhaupt
2003-09-25 23:35 ` Grant Grundler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox