* [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 @ 2008-10-12 4:32 Aaron Tokhy 2008-10-13 7:58 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Aaron Tokhy @ 2008-10-12 4:32 UTC (permalink / raw) To: linuxppc-dev -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello, I recently built 2.6.27 with these patches on my PS3. http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch These patches enable the 'ps3vram' module, which creates a MTD node /dev/mtdblock0. In addition to the 256 MB of XDR ram used by the system, I can use 245 MB of the video ram as a fast swap (getting a somewhat valuable 60 MB/s read/write speed on a random access device). I was using the mtdblock0 as a swap space when the soft lockup occurred while leaving the `top` program open. Now I am not sure if the patch is the issue. None of the functions in that list are functions in the patch... but this is my first time at debugging a kernel bug, some of the functions have the word 'page' so it might be due to problems occurring while paging to that mtdblock0 device, but surely calls to the functions in that patch would appear. How would I start debugging this? The trace is also available in pastebin: http://pastebin.com/m2ea72e52 BUG: soft lockup - CPU#0 stuck for 61s! [top:22788] Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3 ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg ps3_lpm soundcore irq event stamp: 5018780 hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 softirqs last enabled at (5018778): [<c000000000020928>] .call_do_softirq+0x14/0x24 softirqs last disabled at (5018773): [<c000000000020928>] .call_do_softirq+0x14/0x24 NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0 REGS: c000000006f37280 TRAP: 0901 Not tainted (2.6.27) MSR: 8000000000008032 <EE,IR,DR> CR: 42004424 XER: 00000000 TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0 GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570 GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000 GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874 GPR12: 8000000000008032 c000000000586300 NIP [c000000000084110] .csd_flag_wait+0x14/0x1c LR [c000000000084468] .smp_call_function_single+0x13c/0x164 Call Trace: [c000000006f37500] [c000000000084468] .smp_call_function_single+0x13c/0x164 (unreliable) [c000000006f375c0] [c000000000084578] .smp_call_function_mask+0xe8/0x244 [c000000006f37720] [c00000000005809c] .on_each_cpu+0x24/0x9c [c000000006f377c0] [c00000000009bde4] .drain_all_pages+0x24/0x3c [c000000006f37840] [c00000000009c0c8] .__alloc_pages_internal+0x2cc/0x464 [c000000006f37950] [c0000000000c3d54] .__slab_alloc+0x1f8/0x6cc [c000000006f37a10] [c0000000000c466c] .kmem_cache_alloc+0x74/0x108 [c000000006f37ab0] [c0000000000cd200] .get_empty_filp+0x98/0x1a0 [c000000006f37b40] [c0000000000d9fa0] .__path_lookup_intent_open+0x40/0xd0 [c000000006f37bf0] [c0000000000da294] .do_filp_open+0xc0/0x7f0 [c000000006f37d80] [c0000000000c9818] .do_sys_open+0x88/0x154 [c000000006f37e30] [c0000000000076dc] syscall_exit+0x0/0x40 Instruction dump: 2f880000 3860fff0 409e000c f88b0008 38600000 ebc1fff0 4e800020 7c0004ac 80030020 780907e1 4d820020 7c210b78 <7c421378> 4bffffe8 4e800020 7c0802a6 - -- - -Thanks Aaron Tokhy -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkjxfeYACgkQO3nEAs/Ru1mjtwCfW25E51GIAY5KOcpJOp2TeUrz hhQAni7m4UM7ojCPnjEsmiAEVxpLoljh =AVql -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-12 4:32 [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 Aaron Tokhy @ 2008-10-13 7:58 ` Geert Uytterhoeven 2008-10-14 9:32 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-13 7:58 UTC (permalink / raw) To: Aaron Tokhy; +Cc: Linux/PPC Development [-- Attachment #1: Type: TEXT/PLAIN, Size: 2816 bytes --] On Sun, 12 Oct 2008, Aaron Tokhy wrote: > I recently built 2.6.27 with these patches on my PS3. > > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch > > These patches enable the 'ps3vram' module, which creates a MTD node > Now I am not sure if the patch is the issue. None of the functions in No, we've seen similar things happen without ps3vram, too. > BUG: soft lockup - CPU#0 stuck for 61s! [top:22788] > Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3 > ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg > ps3_lpm soundcore > irq event stamp: 5018780 > hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 > hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 > softirqs last enabled at (5018778): [<c000000000020928>] .call_do_softirq+0x14/0x24 > softirqs last disabled at (5018773): [<c000000000020928>] .call_do_softirq+0x14/0x24 > NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0 > REGS: c000000006f37280 TRAP: 0901 Not tainted (2.6.27) > MSR: 8000000000008032 <EE,IR,DR> CR: 42004424 XER: 00000000 > TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0 > GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570 > GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000 > GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874 > GPR12: 8000000000008032 c000000000586300 > NIP [c000000000084110] .csd_flag_wait+0x14/0x1c > LR [c000000000084468] .smp_call_function_single+0x13c/0x164 > Call Trace: > [c000000006f37500] [c000000000084468] .smp_call_function_single+0x13c/0x164 (unreliable) smp_call_function_single() causes an IPI to be sent to the other CPU thread. However, the IPI never seems to arrive at the other CPU thread, causing the soft lockup message to be printed on the console. If this happens when the BKL is held before sending the IPI, the system will deadlock when the other CPU thread tries to acquire the BKL. In that unfortunate case, you won't see any message on the console of a retail PS3, though. So far we do not know what's the exact cause of the IPI not arriving, hence suggestions are welcome. With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-13 7:58 ` Geert Uytterhoeven @ 2008-10-14 9:32 ` Geert Uytterhoeven 2008-10-15 4:49 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-14 9:32 UTC (permalink / raw) To: Aaron Tokhy; +Cc: Linux/PPC Development [-- Attachment #1: Type: TEXT/PLAIN, Size: 4851 bytes --] On Mon, 13 Oct 2008, Geert Uytterhoeven wrote: > On Sun, 12 Oct 2008, Aaron Tokhy wrote: > > I recently built 2.6.27 with these patches on my PS3. > > > > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch > > http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch > > > > These patches enable the 'ps3vram' module, which creates a MTD node > > > Now I am not sure if the patch is the issue. None of the functions in > > No, we've seen similar things happen without ps3vram, too. > > > BUG: soft lockup - CPU#0 stuck for 61s! [top:22788] > > Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3 > > ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg > > ps3_lpm soundcore > > irq event stamp: 5018780 > > hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 > > hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 > > softirqs last enabled at (5018778): [<c000000000020928>] .call_do_softirq+0x14/0x24 > > softirqs last disabled at (5018773): [<c000000000020928>] .call_do_softirq+0x14/0x24 > > NIP: c000000000084110 LR: c000000000084468 CTR: c0000000003181d0 > > REGS: c000000006f37280 TRAP: 0901 Not tainted (2.6.27) > > MSR: 8000000000008032 <EE,IR,DR> CR: 42004424 XER: 00000000 > > TASK = c000000007980000[22788] 'top' THREAD: c000000006f34000 CPU: 0 > > GPR00: 0000000000000001 c000000006f37500 c0000000005543d0 c000000006f37570 > > GPR04: 0000000000000000 c00000000008427c 0000000000000001 0000000000000000 > > GPR08: 0000000000000830 0000000000000001 0000000000000000 c000000000b96874 > > GPR12: 8000000000008032 c000000000586300 > > NIP [c000000000084110] .csd_flag_wait+0x14/0x1c > > LR [c000000000084468] .smp_call_function_single+0x13c/0x164 > > Call Trace: > > [c000000006f37500] [c000000000084468] .smp_call_function_single+0x13c/0x164 (unreliable) > > smp_call_function_single() causes an IPI to be sent to the other CPU thread. > However, the IPI never seems to arrive at the other CPU thread, causing the > soft lockup message to be printed on the console. > > If this happens when the BKL is held before sending the IPI, the system will > deadlock when the other CPU thread tries to acquire the BKL. In that > unfortunate case, you won't see any message on the console of a retail PS3, > though. > > So far we do not know what's the exact cause of the IPI not arriving, hence > suggestions are welcome. I've enabled the recently introduced CONFIG_RCU_CPU_STALL_DETECTOR option and got: | <3>RCU detected CPU 1 stall (t=4295279718/750 jiffies) | Call Trace: | [c000000013e5a940] [c00000000000f314] .show_stack+0x70/0x184 (unreliable) | [c000000013e5a9f0] [c00000000009029c] .__rcu_pending+0x9c/0x2b4 | [c000000013e5aa90] [c0000000000904ec] .rcu_pending+0x38/0x84 | [c000000013e5ab10] [c00000000005d9f0] .update_process_times+0x40/0x8c | [c000000013e5aba0] [c000000000076d4c] .tick_sched_timer+0x154/0x1bc | [c000000013e5ac60] [c00000000006e630] .__run_hrtimer+0x8c/0x128 | [c000000013e5ad00] [c00000000006f60c] .hrtimer_interrupt+0x10c/0x1c8 | [c000000013e5add0] [c00000000001d2d0] .timer_interrupt+0xcc/0x124 | [c000000013e5ae80] [c000000000003614] decrementer_common+0x114/0x180 | --- Exception: 901 at .csd_flag_wait+0x4/0x1c | LR = .smp_call_function_single+0x13c/0x164 | [c000000013e5b230] [c000000000082774] .smp_call_function_mask+0xe4/0x240 | [c000000013e5b390] [c0000000000566dc] .on_each_cpu+0x24/0x94 | [c000000013e5b430] [c0000000000998bc] .drain_all_pages+0x24/0x3c | [c000000013e5b4b0] [c000000000099ba4] .__alloc_pages_internal+0x2d0/0x464 | [c000000013e5b5b0] [c0000000000bb158] .cache_alloc_refill+0x340/0x678 | [c000000013e5b680] [c0000000000bb574] .__kmalloc+0xe4/0x170 | [c000000013e5b720] [c000000000297e18] .__alloc_skb+0x7c/0x154 | [c000000013e5b7c0] [c0000000002923a8] .sock_alloc_send_skb+0xc4/0x2a4 | [c000000013e5b8a0] [c00000000030a464] .unix_stream_sendmsg+0x178/0x384 | [c000000013e5b990] [c00000000028e234] .sock_aio_write+0xec/0x114 | [c000000013e5baa0] [c0000000000bf2dc] .do_sync_readv_writev+0xc8/0x130 | [c000000013e5bc30] [c0000000000fefa0] .compat_do_readv_writev+0x1e0/0x33c | [c000000013e5bd90] [c0000000000ff184] .compat_sys_writev+0x88/0xbc | [c000000013e5be30] [c0000000000074dc] syscall_exit+0x0/0x40 which points again to smp_call_function_single... With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-14 9:32 ` Geert Uytterhoeven @ 2008-10-15 4:49 ` Benjamin Herrenschmidt 2008-10-15 9:25 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2008-10-15 4:49 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote: > > which points again to smp_call_function_single... Yup, it doesn't bring more information. At this stage, your 'other' CPU is stuck with interrupts disabled. Hard to tell what's happening without some HW assist. Do you have ways to trigger a non-maskable interrupt such as a 0x100 ? That would allow to catch the other guy in xmon and see what it was doing... It could be something in the ps3vram driver causing the kernel to lockup.... Now the question is whether the kernel is stuffed with something like a deadlock with interrupts off, or is it a HW problem causing a CPU to lockup on an access to the vram ? I'm afraid only you guys have tools that would allow to debug that sort of problem. Cheers, Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 4:49 ` Benjamin Herrenschmidt @ 2008-10-15 9:25 ` Geert Uytterhoeven 2008-10-15 9:28 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-15 9:25 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy [-- Attachment #1: Type: TEXT/PLAIN, Size: 1835 bytes --] On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote: > > which points again to smp_call_function_single... > > Yup, it doesn't bring more information. At this stage, your 'other' CPU > is stuck with interrupts disabled. Hard to tell what's happening without > some HW assist. Do you have ways to trigger a non-maskable interrupt > such as a 0x100 ? That would allow to catch the other guy in xmon and > see what it was doing... Interrupts are not disabled on the other CPU thread, at least not according to the irqs_disabled() check I added to the printing of the `spinlock lockup' message in __spin_lock_debug(). As the log also said | hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 | hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 I started blinking the LEDs on decrementer interupts, which do arrive on both CPU threads. However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages, as they do indicate interrupts are off... > It could be something in the ps3vram driver causing the kernel to > lockup.... Now the question is whether the kernel is stuffed with > something like a deadlock with interrupts off, or is it a HW problem > causing a CPU to lockup on an access to the vram ? It's not related to the ps3vram driver, as it happens without, too. With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 9:25 ` Geert Uytterhoeven @ 2008-10-15 9:28 ` Benjamin Herrenschmidt 2008-10-15 9:46 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2008-10-15 9:28 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote: > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > > On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote: > > > which points again to smp_call_function_single... > > > > Yup, it doesn't bring more information. At this stage, your 'other' CPU > > is stuck with interrupts disabled. Hard to tell what's happening without > > some HW assist. Do you have ways to trigger a non-maskable interrupt > > such as a 0x100 ? That would allow to catch the other guy in xmon and > > see what it was doing... > > Interrupts are not disabled on the other CPU thread, at least not according to > the irqs_disabled() check I added to the printing of the `spinlock lockup' > message in __spin_lock_debug(). > > As the log also said > > | hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 > | hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 > > I started blinking the LEDs on decrementer interupts, which do arrive on both > CPU threads. Hrm, ok I though the log shows the decrementer interrupt of the thread that's still working. If you are confident they are both taking interrupts, then there's indeed something to track down. > However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages, > as they do indicate interrupts are off... Well, at the time of the sample, the other CPU indeed -seems- to be in an IRQ disabled section yes. Cheers, Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 9:28 ` Benjamin Herrenschmidt @ 2008-10-15 9:46 ` Geert Uytterhoeven 2008-10-15 11:37 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-15 9:46 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy [-- Attachment #1: Type: TEXT/PLAIN, Size: 2438 bytes --] On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote: > > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > > > On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote: > > > > which points again to smp_call_function_single... > > > > > > Yup, it doesn't bring more information. At this stage, your 'other' CPU > > > is stuck with interrupts disabled. Hard to tell what's happening without > > > some HW assist. Do you have ways to trigger a non-maskable interrupt > > > such as a 0x100 ? That would allow to catch the other guy in xmon and > > > see what it was doing... > > > > Interrupts are not disabled on the other CPU thread, at least not according to > > the irqs_disabled() check I added to the printing of the `spinlock lockup' > > message in __spin_lock_debug(). > > > > As the log also said > > > > | hardirqs last enabled at (5018779): [<c000000000007c1c>] restore+0x1c/0xe4 > > | hardirqs last disabled at (5018780): [<c000000000003600>] decrementer_common+0x100/0x180 > > > > I started blinking the LEDs on decrementer interupts, which do arrive on both > > CPU threads. > > Hrm, ok I though the log shows the decrementer interrupt of the thread > that's still working. If you are confident they are both taking > interrupts, then there's indeed something to track down. > > > However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages, > > as they do indicate interrupts are off... > > Well, at the time of the sample, the other CPU indeed -seems- to be in > an IRQ disabled section yes. This is not really a sample. The hardirqs enable/disable is actually tracked using the TRACE_{EN,DIS}ABLE_INTS macros. For the decrementer, the interrupt code is generated by the STD_EXCEPTION_COMMON_LITE() macro. Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled interrupts, without enabling them again... With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 9:46 ` Geert Uytterhoeven @ 2008-10-15 11:37 ` Benjamin Herrenschmidt 2008-10-15 11:46 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2008-10-15 11:37 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy > > Well, at the time of the sample, the other CPU indeed -seems- to be in > > an IRQ disabled section yes. > > This is not really a sample. The hardirqs enable/disable is actually tracked > using the TRACE_{EN,DIS}ABLE_INTS macros. That's what I meant. IE. the hardirq state was updated by the stuck CPU but sampled by the non-stuck one. ie. the non-stuck one could have sampled a transcient value where it happened to have hard irq disabled... > For the decrementer, the interrupt code is generated by the > STD_EXCEPTION_COMMON_LITE() macro. Yeah, I know that :-) > Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do > use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled > interrupts, without enabling them again... Well, they aren't supposed to enable IRQs if they were disabled... Ben. > With kind regards, > > Geert Uytterhoeven > Software Architect > > Sony Techsoft Centre Europe > The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium > > Phone: +32 (0)2 700 8453 > Fax: +32 (0)2 700 8622 > E-mail: Geert.Uytterhoeven@sonycom.com > Internet: http://www.sony-europe.com/ > > A division of Sony Europe (Belgium) N.V. > VAT BE 0413.825.160 · RPR Brussels > Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 11:37 ` Benjamin Herrenschmidt @ 2008-10-15 11:46 ` Geert Uytterhoeven 2008-10-15 11:49 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-15 11:46 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy [-- Attachment #1: Type: TEXT/PLAIN, Size: 1383 bytes --] On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > > > Well, at the time of the sample, the other CPU indeed -seems- to be in > > > an IRQ disabled section yes. > > > > This is not really a sample. The hardirqs enable/disable is actually tracked > > using the TRACE_{EN,DIS}ABLE_INTS macros. > > That's what I meant. IE. the hardirq state was updated by the stuck CPU > but sampled by the non-stuck one. ie. the non-stuck one could have > sampled a transcient value where it happened to have hard irq > disabled... These states are per_cpu. > > Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do > > use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled > > interrupts, without enabling them again... > > Well, they aren't supposed to enable IRQs if they were disabled... They do call TRACE_DISABLE_INTS, which records the interrupt being disabled. So this makes the actual state recording useless... With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 11:46 ` Geert Uytterhoeven @ 2008-10-15 11:49 ` Benjamin Herrenschmidt 2008-10-15 12:05 ` Geert Uytterhoeven 0 siblings, 1 reply; 12+ messages in thread From: Benjamin Herrenschmidt @ 2008-10-15 11:49 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote: > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > > > > Well, at the time of the sample, the other CPU indeed -seems- to be in > > > > an IRQ disabled section yes. > > > > > > This is not really a sample. The hardirqs enable/disable is actually tracked > > > using the TRACE_{EN,DIS}ABLE_INTS macros. > > > > That's what I meant. IE. the hardirq state was updated by the stuck CPU > > but sampled by the non-stuck one. ie. the non-stuck one could have > > sampled a transcient value where it happened to have hard irq > > disabled... > > These states are per_cpu. I know, but that doesn't prevent another CPU from peeking at them :-) The question is, was the message printed by the CPU that locked up or by the other one that detected the lockup ? > They do call TRACE_DISABLE_INTS, which records the interrupt being disabled. > So this makes the actual state recording useless... Well, they record that when they disable it. They don't enable it. Can you find a spot where the IRQ is enabled and it's not recorded or a case where it's not disabled and recorded as disabled ? Cheers, Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 11:49 ` Benjamin Herrenschmidt @ 2008-10-15 12:05 ` Geert Uytterhoeven 2008-10-15 20:53 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 12+ messages in thread From: Geert Uytterhoeven @ 2008-10-15 12:05 UTC (permalink / raw) To: Benjamin Herrenschmidt; +Cc: Linux/PPC Development, Aaron Tokhy [-- Attachment #1: Type: TEXT/PLAIN, Size: 1932 bytes --] On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote: > > On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote: > > > > > Well, at the time of the sample, the other CPU indeed -seems- to be in > > > > > an IRQ disabled section yes. > > > > > > > > This is not really a sample. The hardirqs enable/disable is actually tracked > > > > using the TRACE_{EN,DIS}ABLE_INTS macros. > > > > > > That's what I meant. IE. the hardirq state was updated by the stuck CPU > > > but sampled by the non-stuck one. ie. the non-stuck one could have > > > sampled a transcient value where it happened to have hard irq > > > disabled... > > > > These states are per_cpu. > > I know, but that doesn't prevent another CPU from peeking at them :-) > The question is, was the message printed by the CPU that locked up or by > the other one that detected the lockup ? It's printed by the spinlock debug code, i.e. by the CPU that wants to take the spinlock (in this case the spinlock for the BKL). > > They do call TRACE_DISABLE_INTS, which records the interrupt being disabled. > > So this makes the actual state recording useless... > > Well, they record that when they disable it. They don't enable it. Can > you find a spot where the IRQ is enabled and it's not recorded or a case > where it's not disabled and recorded as disabled ? I guess it's auto-enabled when the decrementer interrupt handler exits? So shouldn't there be a `bl trace_hardirqs_on' somewhere? With kind regards, Geert Uytterhoeven Software Architect Sony Techsoft Centre Europe The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium Phone: +32 (0)2 700 8453 Fax: +32 (0)2 700 8622 E-mail: Geert.Uytterhoeven@sonycom.com Internet: http://www.sony-europe.com/ A division of Sony Europe (Belgium) N.V. VAT BE 0413.825.160 · RPR Brussels Fortis · BIC GEBABEBB · IBAN BE41293037680010 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 2008-10-15 12:05 ` Geert Uytterhoeven @ 2008-10-15 20:53 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 12+ messages in thread From: Benjamin Herrenschmidt @ 2008-10-15 20:53 UTC (permalink / raw) To: Geert Uytterhoeven; +Cc: Linux/PPC Development, Aaron Tokhy On Wed, 2008-10-15 at 14:05 +0200, Geert Uytterhoeven wrote: > I guess it's auto-enabled when the decrementer interrupt handler exits? > So shouldn't there be a `bl trace_hardirqs_on' somewhere? The interrupts are restored to their previous state on exit of interrupts via the TRACE_AND_RESTORE_IRQ() macro which is called from entry_64.S in the main restore path and in head_64.S in the fast path and hashing faults. Cheers, Ben. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-10-15 20:53 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-12 4:32 [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64 Aaron Tokhy 2008-10-13 7:58 ` Geert Uytterhoeven 2008-10-14 9:32 ` Geert Uytterhoeven 2008-10-15 4:49 ` Benjamin Herrenschmidt 2008-10-15 9:25 ` Geert Uytterhoeven 2008-10-15 9:28 ` Benjamin Herrenschmidt 2008-10-15 9:46 ` Geert Uytterhoeven 2008-10-15 11:37 ` Benjamin Herrenschmidt 2008-10-15 11:46 ` Geert Uytterhoeven 2008-10-15 11:49 ` Benjamin Herrenschmidt 2008-10-15 12:05 ` Geert Uytterhoeven 2008-10-15 20:53 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).