[PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
@ 2008-10-12  4:32 Aaron Tokhy
  2008-10-13  7:58 ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Aaron Tokhy @ 2008-10-12  4:32 UTC (permalink / raw)
  To: linuxppc-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello,

I recently built 2.6.27 with these patches on my PS3.

http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch

These patches enable the 'ps3vram' module, which creates a MTD node
/dev/mtdblock0.  In addition to the 256 MB of XDR ram used by the
system, I can use 245 MB of the video ram as a fast swap (getting a
somewhat valuable 60 MB/s read/write speed on a random access device).
I was using the mtdblock0 as a swap space when the soft lockup occurred
while leaving the `top` program open.

Now I am not sure if the patch is the issue.  None of the functions in
that list are functions in the patch... but this is my first time at
debugging a kernel bug, some of the functions have the word 'page' so it
might be due to problems occurring while paging to that mtdblock0
device, but surely calls to the functions in that patch would appear.
How would I start debugging this?

The trace is also available in pastebin: http://pastebin.com/m2ea72e52

BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
ps3_lpm soundcore
irq event stamp: 5018780
hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
hardirqs last disabled at (5018780): [<c000000000003600>]
decrementer_common+0x100/0x180
softirqs last  enabled at (5018778): [<c000000000020928>]
.call_do_softirq+0x14/0x24
softirqs last disabled at (5018773): [<c000000000020928>]
.call_do_softirq+0x14/0x24
NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0
REGS: c000000006f37280 TRAP: 0901   Not tainted  (2.6.27)
MSR: 8000000000008032 <EE,IR,DR>  CR: 42004424  XER: 00000000
TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0
GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570
GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000
GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874
GPR12: 8000000000008032 c000000000586300
NIP [c000000000084110] .csd_flag_wait+0x14/0x1c
LR [c000000000084468] .smp_call_function_single+0x13c/0x164
Call Trace:
[c000000006f37500] [c000000000084468]
.smp_call_function_single+0x13c/0x164 (unreliable)
[c000000006f375c0] [c000000000084578] .smp_call_function_mask+0xe8/0x244
[c000000006f37720] [c00000000005809c] .on_each_cpu+0x24/0x9c
[c000000006f377c0] [c00000000009bde4] .drain_all_pages+0x24/0x3c
[c000000006f37840] [c00000000009c0c8] .__alloc_pages_internal+0x2cc/0x464
[c000000006f37950] [c0000000000c3d54] .__slab_alloc+0x1f8/0x6cc
[c000000006f37a10] [c0000000000c466c] .kmem_cache_alloc+0x74/0x108
[c000000006f37ab0] [c0000000000cd200] .get_empty_filp+0x98/0x1a0
[c000000006f37b40] [c0000000000d9fa0] .__path_lookup_intent_open+0x40/0xd0
[c000000006f37bf0] [c0000000000da294] .do_filp_open+0xc0/0x7f0
[c000000006f37d80] [c0000000000c9818] .do_sys_open+0x88/0x154
[c000000006f37e30] [c0000000000076dc] syscall_exit+0x0/0x40
Instruction dump:
2f880000 3860fff0 409e000c f88b0008 38600000 ebc1fff0 4e800020 7c0004ac
80030020 780907e1 4d820020 7c210b78 <7c421378> 4bffffe8 4e800020 7c0802a6

- --
- -Thanks
Aaron Tokhy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjxfeYACgkQO3nEAs/Ru1mjtwCfW25E51GIAY5KOcpJOp2TeUrz
hhQAni7m4UM7ojCPnjEsmiAEVxpLoljh
=AVql
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-12  4:32 [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 Aaron Tokhy
@ 2008-10-13  7:58 ` Geert Uytterhoeven
  2008-10-14  9:32   ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-13  7:58 UTC (permalink / raw)
  To: Aaron Tokhy; +Cc: Linux/PPC Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2816 bytes --]

On Sun, 12 Oct 2008, Aaron Tokhy wrote:
> I recently built 2.6.27 with these patches on my PS3.
> 
> http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
> http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch
> 
> These patches enable the 'ps3vram' module, which creates a MTD node

> Now I am not sure if the patch is the issue.  None of the functions in

No, we've seen similar things happen without ps3vram, too.

> BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
> Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
> ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
> ps3_lpm soundcore
> irq event stamp: 5018780
> hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
> hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180
> softirqs last  enabled at (5018778): [<c000000000020928>] .call_do_softirq+0x14/0x24
> softirqs last disabled at (5018773): [<c000000000020928>] .call_do_softirq+0x14/0x24
> NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0
> REGS: c000000006f37280 TRAP: 0901   Not tainted  (2.6.27)
> MSR: 8000000000008032 <EE,IR,DR>  CR: 42004424  XER: 00000000
> TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0
> GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570
> GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000
> GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874
> GPR12: 8000000000008032 c000000000586300
> NIP [c000000000084110] .csd_flag_wait+0x14/0x1c
> LR [c000000000084468] .smp_call_function_single+0x13c/0x164
> Call Trace:
> [c000000006f37500] [c000000000084468] .smp_call_function_single+0x13c/0x164 (unreliable)

smp_call_function_single() causes an IPI to be sent to the other CPU thread.
However, the IPI never seems to arrive at the other CPU thread, causing the
soft lockup message to be printed on the console.

If this happens when the BKL is held before sending the IPI, the system will
deadlock when the other CPU thread tries to acquire the BKL. In that
unfortunate case, you won't see any message on the console of a retail PS3,
though.

So far we do not know what's the exact cause of the IPI not arriving, hence
suggestions are welcome.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-13  7:58 ` Geert Uytterhoeven
@ 2008-10-14  9:32   ` Geert Uytterhoeven
  2008-10-15  4:49     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-14  9:32 UTC (permalink / raw)
  To: Aaron Tokhy; +Cc: Linux/PPC Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4851 bytes --]

On Mon, 13 Oct 2008, Geert Uytterhoeven wrote:
> On Sun, 12 Oct 2008, Aaron Tokhy wrote:
> > I recently built 2.6.27 with these patches on my PS3.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
> > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch
> > 
> > These patches enable the 'ps3vram' module, which creates a MTD node
> 
> > Now I am not sure if the patch is the issue.  None of the functions in
> 
> No, we've seen similar things happen without ps3vram, too.
> 
> > BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
> > Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
> > ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
> > ps3_lpm soundcore
> > irq event stamp: 5018780
> > hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
> > hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180
> > softirqs last  enabled at (5018778): [<c000000000020928>] .call_do_softirq+0x14/0x24
> > softirqs last disabled at (5018773): [<c000000000020928>] .call_do_softirq+0x14/0x24
> > NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0
> > REGS: c000000006f37280 TRAP: 0901   Not tainted  (2.6.27)
> > MSR: 8000000000008032 <EE,IR,DR>  CR: 42004424  XER: 00000000
> > TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0
> > GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570
> > GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000
> > GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874
> > GPR12: 8000000000008032 c000000000586300
> > NIP [c000000000084110] .csd_flag_wait+0x14/0x1c
> > LR [c000000000084468] .smp_call_function_single+0x13c/0x164
> > Call Trace:
> > [c000000006f37500] [c000000000084468] .smp_call_function_single+0x13c/0x164 (unreliable)
> 
> smp_call_function_single() causes an IPI to be sent to the other CPU thread.
> However, the IPI never seems to arrive at the other CPU thread, causing the
> soft lockup message to be printed on the console.
> 
> If this happens when the BKL is held before sending the IPI, the system will
> deadlock when the other CPU thread tries to acquire the BKL. In that
> unfortunate case, you won't see any message on the console of a retail PS3,
> though.
> 
> So far we do not know what's the exact cause of the IPI not arriving, hence
> suggestions are welcome.

I've enabled the recently introduced CONFIG_RCU_CPU_STALL_DETECTOR option and
got:

| <3>RCU detected CPU 1 stall (t=4295279718/750 jiffies)
| Call Trace:
| [c000000013e5a940] [c00000000000f314] .show_stack+0x70/0x184 (unreliable)
| [c000000013e5a9f0] [c00000000009029c] .__rcu_pending+0x9c/0x2b4
| [c000000013e5aa90] [c0000000000904ec] .rcu_pending+0x38/0x84
| [c000000013e5ab10] [c00000000005d9f0] .update_process_times+0x40/0x8c
| [c000000013e5aba0] [c000000000076d4c] .tick_sched_timer+0x154/0x1bc
| [c000000013e5ac60] [c00000000006e630] .__run_hrtimer+0x8c/0x128
| [c000000013e5ad00] [c00000000006f60c] .hrtimer_interrupt+0x10c/0x1c8
| [c000000013e5add0] [c00000000001d2d0] .timer_interrupt+0xcc/0x124
| [c000000013e5ae80] [c000000000003614] decrementer_common+0x114/0x180
| --- Exception: 901 at .csd_flag_wait+0x4/0x1c
|     LR = .smp_call_function_single+0x13c/0x164
| [c000000013e5b230] [c000000000082774] .smp_call_function_mask+0xe4/0x240
| [c000000013e5b390] [c0000000000566dc] .on_each_cpu+0x24/0x94
| [c000000013e5b430] [c0000000000998bc] .drain_all_pages+0x24/0x3c
| [c000000013e5b4b0] [c000000000099ba4] .__alloc_pages_internal+0x2d0/0x464
| [c000000013e5b5b0] [c0000000000bb158] .cache_alloc_refill+0x340/0x678
| [c000000013e5b680] [c0000000000bb574] .__kmalloc+0xe4/0x170
| [c000000013e5b720] [c000000000297e18] .__alloc_skb+0x7c/0x154
| [c000000013e5b7c0] [c0000000002923a8] .sock_alloc_send_skb+0xc4/0x2a4
| [c000000013e5b8a0] [c00000000030a464] .unix_stream_sendmsg+0x178/0x384
| [c000000013e5b990] [c00000000028e234] .sock_aio_write+0xec/0x114
| [c000000013e5baa0] [c0000000000bf2dc] .do_sync_readv_writev+0xc8/0x130
| [c000000013e5bc30] [c0000000000fefa0] .compat_do_readv_writev+0x1e0/0x33c
| [c000000013e5bd90] [c0000000000ff184] .compat_sys_writev+0x88/0xbc
| [c000000013e5be30] [c0000000000074dc] syscall_exit+0x0/0x40

which points again to smp_call_function_single...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-14  9:32   ` Geert Uytterhoeven
@ 2008-10-15  4:49     ` Benjamin Herrenschmidt
  2008-10-15  9:25       ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15  4:49 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy

On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:

> 
> which points again to smp_call_function_single...

Yup, it doesn't bring more information. At this stage, your 'other' CPU
is stuck with interrupts disabled. Hard to tell what's happening without
some HW assist. Do you have ways to trigger a non-maskable interrupt
such as a 0x100 ? That would allow to catch the other guy in xmon and
see what it was doing...

It could be something in the ps3vram driver causing the kernel to
lockup.... Now the question is whether the kernel is stuffed with
something like a deadlock with interrupts off, or is it a HW problem
causing a CPU to lockup on an access to the vram ?

I'm afraid only you guys have tools that would allow to debug that
sort of problem.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15  4:49     ` Benjamin Herrenschmidt
@ 2008-10-15  9:25       ` Geert Uytterhoeven
  2008-10-15  9:28         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-15  9:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1835 bytes --]

On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
> > which points again to smp_call_function_single...
> 
> Yup, it doesn't bring more information. At this stage, your 'other' CPU
> is stuck with interrupts disabled. Hard to tell what's happening without
> some HW assist. Do you have ways to trigger a non-maskable interrupt
> such as a 0x100 ? That would allow to catch the other guy in xmon and
> see what it was doing...

Interrupts are not disabled on the other CPU thread, at least not according to
the irqs_disabled() check I added to the printing of the `spinlock lockup'
message in __spin_lock_debug().

As the log also said

| hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
| hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180

I started blinking the LEDs on decrementer interupts, which do arrive on both
CPU threads.

However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages,
as they do indicate interrupts are off...

> It could be something in the ps3vram driver causing the kernel to
> lockup.... Now the question is whether the kernel is stuffed with
> something like a deadlock with interrupts off, or is it a HW problem
> causing a CPU to lockup on an access to the vram ?

It's not related to the ps3vram driver, as it happens without, too.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15  9:25       ` Geert Uytterhoeven
@ 2008-10-15  9:28         ` Benjamin Herrenschmidt
  2008-10-15  9:46           ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15  9:28 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy

On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote:
> On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> > On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
> > > which points again to smp_call_function_single...
> > 
> > Yup, it doesn't bring more information. At this stage, your 'other' CPU
> > is stuck with interrupts disabled. Hard to tell what's happening without
> > some HW assist. Do you have ways to trigger a non-maskable interrupt
> > such as a 0x100 ? That would allow to catch the other guy in xmon and
> > see what it was doing...
> 
> Interrupts are not disabled on the other CPU thread, at least not according to
> the irqs_disabled() check I added to the printing of the `spinlock lockup'
> message in __spin_lock_debug().
> 
> As the log also said
> 
> | hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
> | hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180
> 
> I started blinking the LEDs on decrementer interupts, which do arrive on both
> CPU threads.

Hrm, ok I though the log shows the decrementer interrupt of the thread
that's still working. If you are confident they are both taking
interrupts, then there's indeed something to track down.

> However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages,
> as they do indicate interrupts are off...

Well, at the time of the sample, the other CPU indeed -seems- to be in
an IRQ disabled section yes. 

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15  9:28         ` Benjamin Herrenschmidt
@ 2008-10-15  9:46           ` Geert Uytterhoeven
  2008-10-15 11:37             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-15  9:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2438 bytes --]

On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote:
> > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> > > On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
> > > > which points again to smp_call_function_single...
> > > 
> > > Yup, it doesn't bring more information. At this stage, your 'other' CPU
> > > is stuck with interrupts disabled. Hard to tell what's happening without
> > > some HW assist. Do you have ways to trigger a non-maskable interrupt
> > > such as a 0x100 ? That would allow to catch the other guy in xmon and
> > > see what it was doing...
> > 
> > Interrupts are not disabled on the other CPU thread, at least not according to
> > the irqs_disabled() check I added to the printing of the `spinlock lockup'
> > message in __spin_lock_debug().
> > 
> > As the log also said
> > 
> > | hardirqs last  enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4
> > | hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180
> > 
> > I started blinking the LEDs on decrementer interupts, which do arrive on both
> > CPU threads.
> 
> Hrm, ok I though the log shows the decrementer interrupt of the thread
> that's still working. If you are confident they are both taking
> interrupts, then there's indeed something to track down.
> 
> > However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages,
> > as they do indicate interrupts are off...
> 
> Well, at the time of the sample, the other CPU indeed -seems- to be in
> an IRQ disabled section yes. 

This is not really a sample. The hardirqs enable/disable is actually tracked
using the TRACE_{EN,DIS}ABLE_INTS macros.

For the decrementer, the interrupt code is generated by the
STD_EXCEPTION_COMMON_LITE() macro.

Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do
use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
interrupts, without enabling them again...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15  9:46           ` Geert Uytterhoeven
@ 2008-10-15 11:37             ` Benjamin Herrenschmidt
  2008-10-15 11:46               ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15 11:37 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy


> > Well, at the time of the sample, the other CPU indeed -seems- to be in
> > an IRQ disabled section yes. 
> 
> This is not really a sample. The hardirqs enable/disable is actually tracked
> using the TRACE_{EN,DIS}ABLE_INTS macros.

That's what I meant. IE. the hardirq state was updated by the stuck CPU
but sampled by the non-stuck one. ie. the non-stuck one could have
sampled a transcient value where it happened to have hard irq
disabled...

> For the decrementer, the interrupt code is generated by the
> STD_EXCEPTION_COMMON_LITE() macro.

Yeah, I know that :-)

> Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do
> use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
> interrupts, without enabling them again...

Well, they aren't supposed to enable IRQs if they were disabled...

Ben.

> With kind regards,
> 
> Geert Uytterhoeven
> Software Architect
> 
> Sony Techsoft Centre Europe
> The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium
> 
> Phone:    +32 (0)2 700 8453
> Fax:      +32 (0)2 700 8622
> E-mail:   Geert.Uytterhoeven@sonycom.com
> Internet: http://www.sony-europe.com/
> 
> A division of Sony Europe (Belgium) N.V.
> VAT BE 0413.825.160 · RPR Brussels
> Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15 11:37             ` Benjamin Herrenschmidt
@ 2008-10-15 11:46               ` Geert Uytterhoeven
  2008-10-15 11:49                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-15 11:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1383 bytes --]

On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> > > Well, at the time of the sample, the other CPU indeed -seems- to be in
> > > an IRQ disabled section yes. 
> > 
> > This is not really a sample. The hardirqs enable/disable is actually tracked
> > using the TRACE_{EN,DIS}ABLE_INTS macros.
> 
> That's what I meant. IE. the hardirq state was updated by the stuck CPU
> but sampled by the non-stuck one. ie. the non-stuck one could have
> sampled a transcient value where it happened to have hard irq
> disabled...

These states are per_cpu.

> > Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do
> > use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
> > interrupts, without enabling them again...
> 
> Well, they aren't supposed to enable IRQs if they were disabled...

They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
So this makes the actual state recording useless...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15 11:46               ` Geert Uytterhoeven
@ 2008-10-15 11:49                 ` Benjamin Herrenschmidt
  2008-10-15 12:05                   ` Geert Uytterhoeven
  0 siblings, 1 reply; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15 11:49 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy

On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote:
> On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> > > > Well, at the time of the sample, the other CPU indeed -seems- to be in
> > > > an IRQ disabled section yes. 
> > > 
> > > This is not really a sample. The hardirqs enable/disable is actually tracked
> > > using the TRACE_{EN,DIS}ABLE_INTS macros.
> > 
> > That's what I meant. IE. the hardirq state was updated by the stuck CPU
> > but sampled by the non-stuck one. ie. the non-stuck one could have
> > sampled a transcient value where it happened to have hard irq
> > disabled...
> 
> These states are per_cpu.

I know, but that doesn't prevent another CPU from peeking at them :-)
The question is, was the message printed by the CPU that locked up or by
the other one that detected the lockup ?

> They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
> So this makes the actual state recording useless...

Well, they record that when they disable it. They don't enable it. Can
you find a spot where the IRQ is enabled and it's not recorded or a case
where it's not disabled and recorded as disabled ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15 11:49                 ` Benjamin Herrenschmidt
@ 2008-10-15 12:05                   ` Geert Uytterhoeven
  2008-10-15 20:53                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 12+ messages in thread
From: Geert Uytterhoeven @ 2008-10-15 12:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1932 bytes --]

On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote:
> > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
> > > > > Well, at the time of the sample, the other CPU indeed -seems- to be in
> > > > > an IRQ disabled section yes. 
> > > > 
> > > > This is not really a sample. The hardirqs enable/disable is actually tracked
> > > > using the TRACE_{EN,DIS}ABLE_INTS macros.
> > > 
> > > That's what I meant. IE. the hardirq state was updated by the stuck CPU
> > > but sampled by the non-stuck one. ie. the non-stuck one could have
> > > sampled a transcient value where it happened to have hard irq
> > > disabled...
> > 
> > These states are per_cpu.
> 
> I know, but that doesn't prevent another CPU from peeking at them :-)
> The question is, was the message printed by the CPU that locked up or by
> the other one that detected the lockup ?

It's printed by the spinlock debug code, i.e. by the CPU that wants to take the
spinlock (in this case the spinlock for the BKL).

> > They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
> > So this makes the actual state recording useless...
> 
> Well, they record that when they disable it. They don't enable it. Can
> you find a spot where the IRQ is enabled and it's not recorded or a case
> where it's not disabled and recorded as disabled ?

I guess it's auto-enabled when the decrementer interrupt handler exits?
So shouldn't there be a `bl trace_hardirqs_on' somewhere?

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64
  2008-10-15 12:05                   ` Geert Uytterhoeven
@ 2008-10-15 20:53                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 12+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15 20:53 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy

On Wed, 2008-10-15 at 14:05 +0200, Geert Uytterhoeven wrote:
> I guess it's auto-enabled when the decrementer interrupt handler exits?
> So shouldn't there be a `bl trace_hardirqs_on' somewhere?

The interrupts are restored to their previous state on exit of
interrupts via the TRACE_AND_RESTORE_IRQ() macro which is called
from entry_64.S in the main restore path and in head_64.S in the
fast path and hashing faults.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-10-15 20:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-12  4:32 [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 Aaron Tokhy
2008-10-13  7:58 ` Geert Uytterhoeven
2008-10-14  9:32   ` Geert Uytterhoeven
2008-10-15  4:49     ` Benjamin Herrenschmidt
2008-10-15  9:25       ` Geert Uytterhoeven
2008-10-15  9:28         ` Benjamin Herrenschmidt
2008-10-15  9:46           ` Geert Uytterhoeven
2008-10-15 11:37             ` Benjamin Herrenschmidt
2008-10-15 11:46               ` Geert Uytterhoeven
2008-10-15 11:49                 ` Benjamin Herrenschmidt
2008-10-15 12:05                   ` Geert Uytterhoeven
2008-10-15 20:53                     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).