* [BUG] lockup with the latest kernel
@ 2009-08-19 15:49 Steven Rostedt
2009-08-19 15:50 ` Steven Rostedt
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Steven Rostedt @ 2009-08-19 15:49 UTC (permalink / raw)
To: LKML; +Cc: Thomas Gleixner, Peter Zijlstra, mingo, Andrew Morton
I'm hitting this funny lockup and I'm not sure why. This box is an older
i386 SMP box that has two sockets (no dual core).
$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon(tm) MP 2800+
stepping : 0
cpu MHz : 2133.418
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow
bogomips : 4266.83
clflush size : 32
power management: ts
processor : 1
vendor_id : AuthenticAMD
[ same as above ]
$ lspci -vvv
[snip]
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Kernel driver in use: amd756_smbus
[snip]
Here's a few backtraces:
[13288.222084] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0110821,
registers:
[13288.222084] Modules linked in: iptable_filter ip_tables x_tables
binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss
sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod
snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth
snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1
snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc
snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi
snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii
gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug
i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal
processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw
pcspkr parport_pc floppy parport
[13288.222084]
[13288.222084] Pid: 3, comm: migration/0 Not tainted (2.6.30-rc6 #3)
Unknown
[13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0
[13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92
[13288.222084] EAX: fffff000 EBX: 000000fd ECX: c042cccc EDX: 00000800
[13288.222084] ESI: 00000002 EDI: 00000086 EBP: f7065f10 ESP: f7065f00
[13288.222084] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[13288.222084] Process migration/0 (pid: 3, ti=f7064000 task=f705f0c0
task.ti=f7064000)
[13288.222084] Stack:
[13288.222084] c0350550 0000007e efbfea20 c1819c60 f7065f18 c010f1fa
f7065f24 c011b69c
[13288.222084] c0350490 f7065f2c c011b70b f7065f5c c01213d3 00000000
00000000 00000001
[13288.222084] 00000000 00000001 00000000 00000092 c5685f08 c5685f44
00000003 f7065f64
[13288.222084] Call Trace:
[13288.222084] [<c010f1fa>] ? native_smp_send_reschedule+0x44/0x46
[13288.222084] [<c011b69c>] ? resched_task+0x5f/0x62
[13288.222084] [<c011b70b>] ? check_preempt_curr_idle+0x13/0x15
[13288.222084] [<c01213d3>] ? try_to_wake_up+0x1f4/0x228
[13288.222084] [<c0121417>] ? default_wake_function+0x10/0x12
[13288.222084] [<c011a22f>] ? __wake_up_common+0x39/0x61
[13288.222084] [<c011b7da>] ? complete+0x30/0x43
[13288.222084] [<c0123eaa>] ? migration_thread+0x19d/0x1db
[13288.222084] [<c0123d0d>] ? migration_thread+0x0/0x1db
[13288.222084] [<c0137411>] ? kthread+0x4b/0x6f
[13288.222084] [<c01373c6>] ? kthread+0x0/0x6f
[13288.222084] [<c01033cf>] ? kernel_thread_helper+0x7/0x10
[13288.222084] Code: eb 27 3b c0 e8 56 62 01 00 a1 88 cd 42 c0 83 fb 02 8b
50 20 75 10 89 55 f0 ff 90 b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 f4 f8 42 c0
<8b> 80 00 c3 ff ff f6 c4 10 75 ee a1 f4 f8 42 c0 c1 e6 18 2d f0
[13288.222084] ---[ end trace 94cd80322ca2a918 ]---
[13288.222093] BUG: NMI Watchdog detected LOCKUP on CPU1, ip c03492d5,
registers:
[13288.222093] Modules linked in: iptable_filter ip_tables x_tables
binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss
sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod
snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth
snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1
snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc
snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi
snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii
gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug
i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal
processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw
pcspkr parport_pc floppy parport
[13288.222093]
[13288.222093] Pid: 0, comm: swapper Tainted: G D (2.6.30-rc6 #3)
Unknown
[13288.222093] EIP: 0060:[<c03492d5>] EFLAGS: 00000097 CPU: 1
[13288.222093] EIP is at _spin_lock_irq+0x19/0x1f
[13288.222093] EAX: c1819c60 EBX: c1819c60 ECX: f708f6b0 EDX: 00007e7d
[13288.222093] ESI: c04bfc60 EDI: c04bcd18 EBP: f7093f18 ESP: f7093f18
[13288.222093] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[13288.222093] Process swapper (pid: 0, ti=f7092000 task=f708f500
task.ti=f7092000)
[13288.222093] Stack:
[13288.222093] f7093f8c c0347415 f7093f28 c012b204 f708f6b0 c01100ba
c04bcd18 c04bfc60
[13288.222093] f708f6b4 c1816488 00000000 00000001 00000000 f7093f94
f7092000 0000007b
[13288.222093] f708f500 000000d8 00000000 c0468e60 00000001 c1816f8c
f7093f94 c01424e4
[13288.222093] Call Trace:
[13288.222093] [<c0347415>] ? __schedule+0xc8/0x855
[13288.222093] [<c012b204>] ? irq_exit+0x39/0x5c
[13288.222093] [<c01100ba>] ? smp_error_interrupt+0x61/0x63
[13288.222093] [<c01424e4>] ? tick_nohz_restart_sched_tick+0x2c/0x139
[13288.222093] [<c0347baf>] ? schedule+0xd/0x1c
[13288.222093] [<c0101c5d>] ? cpu_idle+0x60/0x63
[13288.222093] [<c0344d11>] ? start_secondary+0x195/0x19a
[13288.222093] Code: 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 5d c3
55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90
<8a> 10 eb f6 5d c3 55 89 e5 0f 1f 44 00 00 9c 5a fa f0 83 28 01
[13288.222093] ---[ end trace 94cd80322ca2a919 ]---
And another instance:
:
[45012.674548] Modules linked in: vfat fat iptable_filter ip_tables
x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl
auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log
dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp
snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul
snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm
snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd
snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq
3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport
amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse
amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev
serio_raw processor thermal_sys parport_pc parport floppy
[45012.674548]
[45012.674548] Pid: 0, comm: swapper Not tainted (2.6.31-rc6 #9) Unknown
[45012.674548] EIP: 0060:[<c1267059>] EFLAGS: 00000097 CPU: 1
[45012.674548] EIP is at _spin_lock_irq+0x15/0x1f
[45012.674548] EAX: c1c9a200 EBX: c1c9a200 ECX: f708f550 EDX: 0000a7a6
[45012.674548] ESI: 00000001 EDI: f708f708 EBP: f7093f1c ESP: f7093f1c
[45012.674548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[45012.674548] Process swapper (pid: 0, ti=f7092000 task=f708f550
task.ti=f7092000)
[45012.674548] Stack:
[45012.674548] f7093f94 c1265404 c1032f33 f7093f30 c13f0200 f708f704
c10032e6 c13ec1f4
[45012.674548] <0> f708f708 c13f0200 00000001 00000000 f7093f94 f7092000
00000001 0000007b
[45012.674548] <0> 000000d8 f708f550 c1395288 00000001 c1c9648c f7093f94
c104babe f7092000
[45012.674548] Call Trace:
[45012.674548] [<c1265404>] ? schedule+0xca/0x86c
[45012.674548] [<c1032f33>] ? irq_exit+0x39/0x5c
[45012.674548] [<c10032e6>] ? error_interrupt+0x2a/0x30
[45012.674548] [<c104babe>] ? tick_nohz_restart_sched_tick+0x2c/0x139
[45012.674548] [<c1001c15>] ? cpu_idle+0x60/0x63
[45012.674548] [<c1262bc7>] ? start_secondary+0x195/0x19a
[45012.674548] Code: 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6
89 c8 c9 c3 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2
<74> 06 f3 90 8a 10 eb f6 c9 c3 55 89 e5 0f 1f 44 00 00 9c 5a fa
[45012.674548] ---[ end trace 7323d2c20b35b0b0 ]---
[45012.678548] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c1013ff8,
registers:
[45012.678548] Modules linked in: vfat fat iptable_filter ip_tables
x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl
auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log
dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp
snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul
snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm
snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd
snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq
3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport
amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse
amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev
serio_raw processor thermal_sys parport_pc parport floppy
[45012.678548]
[45012.678548] Pid: 9294, comm: backup2l Tainted: G D (2.6.31-rc6
#9) Unknown
[45012.678548] EIP: 0060:[<c1013ff8>] EFLAGS: 00000002 CPU: 0
[45012.678548] EIP is at default_send_IPI_mask_logical+0x5c/0x90
[45012.678548] EAX: 000018f0 EBX: 00000082 ECX: c1356b5c EDX: 00000800
[45012.678548] ESI: 000000fd EDI: 00000002 EBP: ceaf7e18 ESP: ceaf7e08
[45012.678548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[45012.678548] Process backup2l (pid: 9294, ti=ceaf6000 task=db276eb0
task.ti=ceaf6000)
[45012.678548] Stack:
[45012.678548] c126fcc8 000000a7 db13e9c0 c1c9a200 ceaf7e20 c10127ce
ceaf7e2c c1020ebf
[45012.678548] <0> c126fc08 ceaf7e34 c1020f2e ceaf7e64 c1029d20 00000000
00000000 00000001
[45012.678548] <0> 00000001 00000001 00000000 00000086 ce995e40 e1cee7f8
00000001 ceaf7e6c
[45012.678548] Call Trace:
[45012.678548] [<c10127ce>] ? native_smp_send_reschedule+0x44/0x46
[45012.678548] [<c1020ebf>] ? resched_task+0x5f/0x62
[45012.678548] [<c1020f2e>] ? check_preempt_curr_idle+0x13/0x15
[45012.678548] [<c1029d20>] ? try_to_wake_up+0x1f7/0x22b
[45012.678548] [<c1029d64>] ? default_wake_function+0x10/0x12
[45012.678548] [<c1040645>] ? autoremove_wake_function+0x14/0x34
[45012.678548] [<c101e110>] ? __wake_up_common+0x39/0x61
[45012.678548] [<c1021048>] ? __wake_up_sync_key+0x38/0x4a
[45012.678548] [<c1021069>] ? __wake_up_sync+0xf/0x12
[45012.678548] [<c10b0fd8>] ? pipe_release+0x5e/0x92
[45012.678548] [<c10b103c>] ? pipe_write_release+0x14/0x16
[45012.678548] [<c10ac42e>] ? __fput+0xcf/0x177
[45012.678548] [<c10ac4f0>] ? fput+0x1a/0x1c
[45012.678548] [<c10a982d>] ? filp_close+0x56/0x60
[45012.678548] [<c102f56b>] ? put_files_struct+0x5d/0xa1
[45012.678548] [<c102f5ea>] ? exit_files+0x3b/0x40
[45012.678548] [<c1030ca5>] ? do_exit+0x1ae/0x5b3
[45012.678548] [<c103110c>] ? do_group_exit+0x62/0x89
[45012.678548] [<c103114b>] ? sys_exit_group+0x18/0x1c
[45012.678548] [<c1002944>] ? sysenter_do_call+0x12/0x22
[45012.678548] Code: a1 18 6c 35 c1 83 fe 02 8b 50 20 75 10 89 55 f0 ff 90
b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 d4 97 35 c1 8b 80 00 c3 ff ff f6 c4 10
<75> ee a1 d4 97 35 c1 c1 e7 18 2d f0 3c 00 00 89 38 89 f0 09 d0
[45012.678548] ---[ end trace 7323d2c20b35b0b1 ]---
[45014.223303] Clocksource tsc unstable (delta = 9374024843 ns)
[45014.229055] Fixing recursive fault but reboot is needed!
Always happens where one CPU is sending an IPI and the other has the rq
spinlock. Seems to be that the IPI expects the other CPU to not have
interrupts disabled or something?
Note, I've seen this on 2.6.30-rc6 as well (yes that's 2.6.30). But this
does not happen on 2.6.29. Unfortunately, 2.6.29 makes my NIC go kaputt
for some reason.
I've enabled LOCKDEP and it just makes the bug trigger easier.
Anyway, anyone have any ideas?
Since this can take up to a week to trigger, I'll start doing a git bisect
on it from 2.6.29 to 2.6.30-rc6. And see if I can find the change that
broke. Although this box is my DHCP/web/mail server, so it will be a PITA
:-(
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [BUG] lockup with the latest kernel 2009-08-19 15:49 [BUG] lockup with the latest kernel Steven Rostedt @ 2009-08-19 15:50 ` Steven Rostedt 2009-08-19 16:18 ` Andrew Morton 2009-08-27 22:41 ` Steven Rostedt 2 siblings, 0 replies; 13+ messages in thread From: Steven Rostedt @ 2009-08-19 15:50 UTC (permalink / raw) To: LKML; +Cc: Thomas Gleixner, Peter Zijlstra, Andrew Morton, Ingo Molnar [ Sorry for the repost, but I sent to Ingo's "special" account last time. /me needs to add an alias for mingo as well as ingo ] On Wed, 19 Aug 2009, Steven Rostedt wrote: > > I'm hitting this funny lockup and I'm not sure why. This box is an older > i386 SMP box that has two sockets (no dual core). > > $ cat /proc/cpuinfo > processor : 0 > vendor_id : AuthenticAMD > cpu family : 6 > model : 10 > model name : AMD Athlon(tm) MP 2800+ > stepping : 0 > cpu MHz : 2133.418 > cache size : 512 KB > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 1 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow > bogomips : 4266.83 > clflush size : 32 > power management: ts > > processor : 1 > vendor_id : AuthenticAMD > [ same as above ] > > > $ lspci -vvv > [snip] > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) > Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Kernel driver in use: amd756_smbus > [snip] > > Here's a few backtraces: > > [13288.222084] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0110821, > registers: > [13288.222084] Modules linked in: iptable_filter ip_tables x_tables > binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss > sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod > snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth > snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 > snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc > snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi > snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii > gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug > i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal > processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw > pcspkr parport_pc floppy parport > [13288.222084] > [13288.222084] Pid: 3, comm: migration/0 Not tainted (2.6.30-rc6 #3) > Unknown > [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 > [13288.222084] EAX: fffff000 EBX: 000000fd ECX: c042cccc EDX: 00000800 > [13288.222084] ESI: 00000002 EDI: 00000086 EBP: f7065f10 ESP: f7065f00 > [13288.222084] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [13288.222084] Process migration/0 (pid: 3, ti=f7064000 task=f705f0c0 > task.ti=f7064000) > [13288.222084] Stack: > [13288.222084] c0350550 0000007e efbfea20 c1819c60 f7065f18 c010f1fa > f7065f24 c011b69c > [13288.222084] c0350490 f7065f2c c011b70b f7065f5c c01213d3 00000000 > 00000000 00000001 > [13288.222084] 00000000 00000001 00000000 00000092 c5685f08 c5685f44 > 00000003 f7065f64 > [13288.222084] Call Trace: > [13288.222084] [<c010f1fa>] ? native_smp_send_reschedule+0x44/0x46 > [13288.222084] [<c011b69c>] ? resched_task+0x5f/0x62 > [13288.222084] [<c011b70b>] ? check_preempt_curr_idle+0x13/0x15 > [13288.222084] [<c01213d3>] ? try_to_wake_up+0x1f4/0x228 > [13288.222084] [<c0121417>] ? default_wake_function+0x10/0x12 > [13288.222084] [<c011a22f>] ? __wake_up_common+0x39/0x61 > [13288.222084] [<c011b7da>] ? complete+0x30/0x43 > [13288.222084] [<c0123eaa>] ? migration_thread+0x19d/0x1db > [13288.222084] [<c0123d0d>] ? migration_thread+0x0/0x1db > [13288.222084] [<c0137411>] ? kthread+0x4b/0x6f > [13288.222084] [<c01373c6>] ? kthread+0x0/0x6f > [13288.222084] [<c01033cf>] ? kernel_thread_helper+0x7/0x10 > [13288.222084] Code: eb 27 3b c0 e8 56 62 01 00 a1 88 cd 42 c0 83 fb 02 8b > 50 20 75 10 89 55 f0 ff 90 b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 f4 f8 42 c0 > <8b> 80 00 c3 ff ff f6 c4 10 75 ee a1 f4 f8 42 c0 c1 e6 18 2d f0 > [13288.222084] ---[ end trace 94cd80322ca2a918 ]--- > [13288.222093] BUG: NMI Watchdog detected LOCKUP on CPU1, ip c03492d5, > registers: > [13288.222093] Modules linked in: iptable_filter ip_tables x_tables > binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss > sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod > snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth > snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 > snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc > snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi > snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii > gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug > i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal > processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw > pcspkr parport_pc floppy parport > [13288.222093] > [13288.222093] Pid: 0, comm: swapper Tainted: G D (2.6.30-rc6 #3) > Unknown > [13288.222093] EIP: 0060:[<c03492d5>] EFLAGS: 00000097 CPU: 1 > [13288.222093] EIP is at _spin_lock_irq+0x19/0x1f > [13288.222093] EAX: c1819c60 EBX: c1819c60 ECX: f708f6b0 EDX: 00007e7d > [13288.222093] ESI: c04bfc60 EDI: c04bcd18 EBP: f7093f18 ESP: f7093f18 > [13288.222093] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [13288.222093] Process swapper (pid: 0, ti=f7092000 task=f708f500 > task.ti=f7092000) > [13288.222093] Stack: > [13288.222093] f7093f8c c0347415 f7093f28 c012b204 f708f6b0 c01100ba > c04bcd18 c04bfc60 > [13288.222093] f708f6b4 c1816488 00000000 00000001 00000000 f7093f94 > f7092000 0000007b > [13288.222093] f708f500 000000d8 00000000 c0468e60 00000001 c1816f8c > f7093f94 c01424e4 > [13288.222093] Call Trace: > [13288.222093] [<c0347415>] ? __schedule+0xc8/0x855 > [13288.222093] [<c012b204>] ? irq_exit+0x39/0x5c > [13288.222093] [<c01100ba>] ? smp_error_interrupt+0x61/0x63 > [13288.222093] [<c01424e4>] ? tick_nohz_restart_sched_tick+0x2c/0x139 > [13288.222093] [<c0347baf>] ? schedule+0xd/0x1c > [13288.222093] [<c0101c5d>] ? cpu_idle+0x60/0x63 > [13288.222093] [<c0344d11>] ? start_secondary+0x195/0x19a > [13288.222093] Code: 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 5d c3 > 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 > <8a> 10 eb f6 5d c3 55 89 e5 0f 1f 44 00 00 9c 5a fa f0 83 28 01 > [13288.222093] ---[ end trace 94cd80322ca2a919 ]--- > > > And another instance: > > : > [45012.674548] Modules linked in: vfat fat iptable_filter ip_tables > x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log > dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp > snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul > snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm > snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd > snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq > 3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport > amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse > amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev > serio_raw processor thermal_sys parport_pc parport floppy > [45012.674548] > [45012.674548] Pid: 0, comm: swapper Not tainted (2.6.31-rc6 #9) Unknown > [45012.674548] EIP: 0060:[<c1267059>] EFLAGS: 00000097 CPU: 1 > [45012.674548] EIP is at _spin_lock_irq+0x15/0x1f > [45012.674548] EAX: c1c9a200 EBX: c1c9a200 ECX: f708f550 EDX: 0000a7a6 > [45012.674548] ESI: 00000001 EDI: f708f708 EBP: f7093f1c ESP: f7093f1c > [45012.674548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [45012.674548] Process swapper (pid: 0, ti=f7092000 task=f708f550 > task.ti=f7092000) > [45012.674548] Stack: > [45012.674548] f7093f94 c1265404 c1032f33 f7093f30 c13f0200 f708f704 > c10032e6 c13ec1f4 > [45012.674548] <0> f708f708 c13f0200 00000001 00000000 f7093f94 f7092000 > 00000001 0000007b > [45012.674548] <0> 000000d8 f708f550 c1395288 00000001 c1c9648c f7093f94 > c104babe f7092000 > [45012.674548] Call Trace: > [45012.674548] [<c1265404>] ? schedule+0xca/0x86c > [45012.674548] [<c1032f33>] ? irq_exit+0x39/0x5c > [45012.674548] [<c10032e6>] ? error_interrupt+0x2a/0x30 > [45012.674548] [<c104babe>] ? tick_nohz_restart_sched_tick+0x2c/0x139 > [45012.674548] [<c1001c15>] ? cpu_idle+0x60/0x63 > [45012.674548] [<c1262bc7>] ? start_secondary+0x195/0x19a > [45012.674548] Code: 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 > 89 c8 c9 c3 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 > <74> 06 f3 90 8a 10 eb f6 c9 c3 55 89 e5 0f 1f 44 00 00 9c 5a fa > [45012.674548] ---[ end trace 7323d2c20b35b0b0 ]--- > [45012.678548] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c1013ff8, > registers: > [45012.678548] Modules linked in: vfat fat iptable_filter ip_tables > x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log > dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp > snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul > snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm > snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd > snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq > 3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport > amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse > amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev > serio_raw processor thermal_sys parport_pc parport floppy > [45012.678548] > [45012.678548] Pid: 9294, comm: backup2l Tainted: G D (2.6.31-rc6 > #9) Unknown > [45012.678548] EIP: 0060:[<c1013ff8>] EFLAGS: 00000002 CPU: 0 > [45012.678548] EIP is at default_send_IPI_mask_logical+0x5c/0x90 > [45012.678548] EAX: 000018f0 EBX: 00000082 ECX: c1356b5c EDX: 00000800 > [45012.678548] ESI: 000000fd EDI: 00000002 EBP: ceaf7e18 ESP: ceaf7e08 > [45012.678548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [45012.678548] Process backup2l (pid: 9294, ti=ceaf6000 task=db276eb0 > task.ti=ceaf6000) > [45012.678548] Stack: > [45012.678548] c126fcc8 000000a7 db13e9c0 c1c9a200 ceaf7e20 c10127ce > ceaf7e2c c1020ebf > [45012.678548] <0> c126fc08 ceaf7e34 c1020f2e ceaf7e64 c1029d20 00000000 > 00000000 00000001 > [45012.678548] <0> 00000001 00000001 00000000 00000086 ce995e40 e1cee7f8 > 00000001 ceaf7e6c > [45012.678548] Call Trace: > [45012.678548] [<c10127ce>] ? native_smp_send_reschedule+0x44/0x46 > [45012.678548] [<c1020ebf>] ? resched_task+0x5f/0x62 > [45012.678548] [<c1020f2e>] ? check_preempt_curr_idle+0x13/0x15 > [45012.678548] [<c1029d20>] ? try_to_wake_up+0x1f7/0x22b > [45012.678548] [<c1029d64>] ? default_wake_function+0x10/0x12 > [45012.678548] [<c1040645>] ? autoremove_wake_function+0x14/0x34 > [45012.678548] [<c101e110>] ? __wake_up_common+0x39/0x61 > [45012.678548] [<c1021048>] ? __wake_up_sync_key+0x38/0x4a > [45012.678548] [<c1021069>] ? __wake_up_sync+0xf/0x12 > [45012.678548] [<c10b0fd8>] ? pipe_release+0x5e/0x92 > [45012.678548] [<c10b103c>] ? pipe_write_release+0x14/0x16 > [45012.678548] [<c10ac42e>] ? __fput+0xcf/0x177 > [45012.678548] [<c10ac4f0>] ? fput+0x1a/0x1c > [45012.678548] [<c10a982d>] ? filp_close+0x56/0x60 > [45012.678548] [<c102f56b>] ? put_files_struct+0x5d/0xa1 > [45012.678548] [<c102f5ea>] ? exit_files+0x3b/0x40 > [45012.678548] [<c1030ca5>] ? do_exit+0x1ae/0x5b3 > [45012.678548] [<c103110c>] ? do_group_exit+0x62/0x89 > [45012.678548] [<c103114b>] ? sys_exit_group+0x18/0x1c > [45012.678548] [<c1002944>] ? sysenter_do_call+0x12/0x22 > [45012.678548] Code: a1 18 6c 35 c1 83 fe 02 8b 50 20 75 10 89 55 f0 ff 90 > b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 d4 97 35 c1 8b 80 00 c3 ff ff f6 c4 10 > <75> ee a1 d4 97 35 c1 c1 e7 18 2d f0 3c 00 00 89 38 89 f0 09 d0 > [45012.678548] ---[ end trace 7323d2c20b35b0b1 ]--- > [45014.223303] Clocksource tsc unstable (delta = 9374024843 ns) > [45014.229055] Fixing recursive fault but reboot is needed! > > > Always happens where one CPU is sending an IPI and the other has the rq > spinlock. Seems to be that the IPI expects the other CPU to not have > interrupts disabled or something? > > Note, I've seen this on 2.6.30-rc6 as well (yes that's 2.6.30). But this > does not happen on 2.6.29. Unfortunately, 2.6.29 makes my NIC go kaputt > for some reason. > > I've enabled LOCKDEP and it just makes the bug trigger easier. > > Anyway, anyone have any ideas? > > Since this can take up to a week to trigger, I'll start doing a git bisect > on it from 2.6.29 to 2.6.30-rc6. And see if I can find the change that > broke. Although this box is my DHCP/web/mail server, so it will be a PITA > :-( > > Thanks, > > -- Steve > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-19 15:49 [BUG] lockup with the latest kernel Steven Rostedt 2009-08-19 15:50 ` Steven Rostedt @ 2009-08-19 16:18 ` Andrew Morton 2009-08-27 22:41 ` Steven Rostedt 2 siblings, 0 replies; 13+ messages in thread From: Andrew Morton @ 2009-08-19 16:18 UTC (permalink / raw) To: Steven Rostedt; +Cc: LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar On Wed, 19 Aug 2009 11:49:25 -0400 (EDT) Steven Rostedt <rostedt@goodmis.org> wrote: > Always happens where one CPU is sending an IPI and the other has the rq > spinlock. Seems to be that the IPI expects the other CPU to not have > interrupts disabled or something? > > Note, I've seen this on 2.6.30-rc6 as well (yes that's 2.6.30). But this > does not happen on 2.6.29. Unfortunately, 2.6.29 makes my NIC go kaputt > for some reason. > > I've enabled LOCKDEP and it just makes the bug trigger easier. > > Anyway, anyone have any ideas? We'd need to see the backtrace on the target CPU. It shouldn't be too hard - set that CPU's bit in arch/x86/kernel/apic/nmi.c:backtrace_mask and then clear it again when that CPU has responded. Or even: diff -puN arch/x86/kernel/apic/nmi.c~a arch/x86/kernel/apic/nmi.c --- a/arch/x86/kernel/apic/nmi.c~a +++ a/arch/x86/kernel/apic/nmi.c @@ -387,6 +387,8 @@ void touch_nmi_watchdog(void) } EXPORT_SYMBOL(touch_nmi_watchdog); +extern int wizzle; + notrace __kprobes int nmi_watchdog_tick(struct pt_regs *regs, unsigned reason) { @@ -415,7 +417,8 @@ nmi_watchdog_tick(struct pt_regs *regs, } /* We can be called before check_nmi_watchdog, hence NULL check. */ - if (backtrace_mask != NULL && cpumask_test_cpu(cpu, backtrace_mask)) { + if (cpu == wizzle || + (backtrace_mask != NULL && cpumask_test_cpu(cpu, backtrace_mask))) { static DEFINE_SPINLOCK(lock); /* Serialise the printks */ spin_lock(&lock); diff -puN arch/x86/kernel/smp.c~a arch/x86/kernel/smp.c --- a/arch/x86/kernel/smp.c~a +++ a/arch/x86/kernel/smp.c @@ -111,13 +111,17 @@ * it goes straight through and wastes no time serializing * anything. Worst case is that we lose a reschedule ... */ +int wizzle = -1; + static void native_smp_send_reschedule(int cpu) { if (unlikely(cpu_is_offline(cpu))) { WARN_ON(1); return; } + wizzle = cpu; apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR); + wizzle = -1; } void native_send_call_func_single_ipi(int cpu) _ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-19 15:49 [BUG] lockup with the latest kernel Steven Rostedt 2009-08-19 15:50 ` Steven Rostedt 2009-08-19 16:18 ` Andrew Morton @ 2009-08-27 22:41 ` Steven Rostedt 2009-08-27 22:45 ` Steven Rostedt 2009-08-28 2:46 ` Tejun Heo 2 siblings, 2 replies; 13+ messages in thread From: Steven Rostedt @ 2009-08-27 22:41 UTC (permalink / raw) To: LKML; +Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Tejun Heo, Andrew Morton After a long git bisect, it seems that the bug was introduced by this commit: commit 02cf94c370e0dc9bf408fe45eb86fe9ad58eaf7f Author: Tejun Heo <tj@kernel.org> Date: Wed Jan 21 17:26:06 2009 +0900 x86: make x86_32 use tlb_64.c Impact: less contention when issuing invalidate IPI, cleanup Make x86_32 use the same tlb code as 64bit. The 64bit code uses multiple IPI vectors for tlb shootdown to reduce contention. This patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the code paths. Note that the usage of asmlinkage is inconsistent for x86_32 and 64 and calls for further cleanup. This has been noted with a FIXME comment in tlb_64.c. Signed-off-by: Tejun Heo <tj@kernel.org> I can easily hit this bug at this commit, but I ran for a week on the commit before it. Thus I'm assuming this is the bug (but I'm not 100% sure). It looks like a probable candidate. Could the TLB be overlapping with the APIC somehow where the access to the register is being cached? Thoughts? -- Steve Keeping original post below for Tejun's reference. On Wed, 19 Aug 2009, Steven Rostedt wrote: > > I'm hitting this funny lockup and I'm not sure why. This box is an older > i386 SMP box that has two sockets (no dual core). > > $ cat /proc/cpuinfo > processor : 0 > vendor_id : AuthenticAMD > cpu family : 6 > model : 10 > model name : AMD Athlon(tm) MP 2800+ > stepping : 0 > cpu MHz : 2133.418 > cache size : 512 KB > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 1 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 mmx fxsr sse syscall mp mmxext 3dnowext 3dnow > bogomips : 4266.83 > clflush size : 32 > power management: ts > > processor : 1 > vendor_id : AuthenticAMD > [ same as above ] > > > $ lspci -vvv > [snip] > 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) > Subsystem: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Kernel driver in use: amd756_smbus > [snip] > > Here's a few backtraces: > > [13288.222084] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0110821, > registers: > [13288.222084] Modules linked in: iptable_filter ip_tables x_tables > binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss > sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod > snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth > snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 > snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc > snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi > snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii > gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug > i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal > processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw > pcspkr parport_pc floppy parport > [13288.222084] > [13288.222084] Pid: 3, comm: migration/0 Not tainted (2.6.30-rc6 #3) > Unknown > [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 > [13288.222084] EAX: fffff000 EBX: 000000fd ECX: c042cccc EDX: 00000800 > [13288.222084] ESI: 00000002 EDI: 00000086 EBP: f7065f10 ESP: f7065f00 > [13288.222084] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [13288.222084] Process migration/0 (pid: 3, ti=f7064000 task=f705f0c0 > task.ti=f7064000) > [13288.222084] Stack: > [13288.222084] c0350550 0000007e efbfea20 c1819c60 f7065f18 c010f1fa > f7065f24 c011b69c > [13288.222084] c0350490 f7065f2c c011b70b f7065f5c c01213d3 00000000 > 00000000 00000001 > [13288.222084] 00000000 00000001 00000000 00000092 c5685f08 c5685f44 > 00000003 f7065f64 > [13288.222084] Call Trace: > [13288.222084] [<c010f1fa>] ? native_smp_send_reschedule+0x44/0x46 > [13288.222084] [<c011b69c>] ? resched_task+0x5f/0x62 > [13288.222084] [<c011b70b>] ? check_preempt_curr_idle+0x13/0x15 > [13288.222084] [<c01213d3>] ? try_to_wake_up+0x1f4/0x228 > [13288.222084] [<c0121417>] ? default_wake_function+0x10/0x12 > [13288.222084] [<c011a22f>] ? __wake_up_common+0x39/0x61 > [13288.222084] [<c011b7da>] ? complete+0x30/0x43 > [13288.222084] [<c0123eaa>] ? migration_thread+0x19d/0x1db > [13288.222084] [<c0123d0d>] ? migration_thread+0x0/0x1db > [13288.222084] [<c0137411>] ? kthread+0x4b/0x6f > [13288.222084] [<c01373c6>] ? kthread+0x0/0x6f > [13288.222084] [<c01033cf>] ? kernel_thread_helper+0x7/0x10 > [13288.222084] Code: eb 27 3b c0 e8 56 62 01 00 a1 88 cd 42 c0 83 fb 02 8b > 50 20 75 10 89 55 f0 ff 90 b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 f4 f8 42 c0 > <8b> 80 00 c3 ff ff f6 c4 10 75 ee a1 f4 f8 42 c0 c1 e6 18 2d f0 > [13288.222084] ---[ end trace 94cd80322ca2a918 ]--- > [13288.222093] BUG: NMI Watchdog detected LOCKUP on CPU1, ip c03492d5, > registers: > [13288.222093] Modules linked in: iptable_filter ip_tables x_tables > binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl auth_rpcgss > sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log dm_mod > snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp snd_emu10k1_synth > snd_emux_synth snd_seq_virmidi snd_seq_midi_emul snd_emu10k1 > snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc > snd_util_mem snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi > snd_seq_midi_event snd_seq snd_timer 3c59x emu10k1_gp snd_seq_device mii > gameport ohci_hcd snd ehci_hcd soundcore usbcore shpchp pci_hotplug > i2c_amd756 amd_k7_agp i2c_core amd_rng agpgart rng_core container thermal > processor thermal_sys button ide_cd_mod evdev cdrom psmouse serio_raw > pcspkr parport_pc floppy parport > [13288.222093] > [13288.222093] Pid: 0, comm: swapper Tainted: G D (2.6.30-rc6 #3) > Unknown > [13288.222093] EIP: 0060:[<c03492d5>] EFLAGS: 00000097 CPU: 1 > [13288.222093] EIP is at _spin_lock_irq+0x19/0x1f > [13288.222093] EAX: c1819c60 EBX: c1819c60 ECX: f708f6b0 EDX: 00007e7d > [13288.222093] ESI: c04bfc60 EDI: c04bcd18 EBP: f7093f18 ESP: f7093f18 > [13288.222093] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [13288.222093] Process swapper (pid: 0, ti=f7092000 task=f708f500 > task.ti=f7092000) > [13288.222093] Stack: > [13288.222093] f7093f8c c0347415 f7093f28 c012b204 f708f6b0 c01100ba > c04bcd18 c04bfc60 > [13288.222093] f708f6b4 c1816488 00000000 00000001 00000000 f7093f94 > f7092000 0000007b > [13288.222093] f708f500 000000d8 00000000 c0468e60 00000001 c1816f8c > f7093f94 c01424e4 > [13288.222093] Call Trace: > [13288.222093] [<c0347415>] ? __schedule+0xc8/0x855 > [13288.222093] [<c012b204>] ? irq_exit+0x39/0x5c > [13288.222093] [<c01100ba>] ? smp_error_interrupt+0x61/0x63 > [13288.222093] [<c01424e4>] ? tick_nohz_restart_sched_tick+0x2c/0x139 > [13288.222093] [<c0347baf>] ? schedule+0xd/0x1c > [13288.222093] [<c0101c5d>] ? cpu_idle+0x60/0x63 > [13288.222093] [<c0344d11>] ? start_secondary+0x195/0x19a > [13288.222093] Code: 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 5d c3 > 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 > <8a> 10 eb f6 5d c3 55 89 e5 0f 1f 44 00 00 9c 5a fa f0 83 28 01 > [13288.222093] ---[ end trace 94cd80322ca2a919 ]--- > > > And another instance: > > : > [45012.674548] Modules linked in: vfat fat iptable_filter ip_tables > x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log > dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp > snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul > snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm > snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd > snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq > 3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport > amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse > amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev > serio_raw processor thermal_sys parport_pc parport floppy > [45012.674548] > [45012.674548] Pid: 0, comm: swapper Not tainted (2.6.31-rc6 #9) Unknown > [45012.674548] EIP: 0060:[<c1267059>] EFLAGS: 00000097 CPU: 1 > [45012.674548] EIP is at _spin_lock_irq+0x15/0x1f > [45012.674548] EAX: c1c9a200 EBX: c1c9a200 ECX: f708f550 EDX: 0000a7a6 > [45012.674548] ESI: 00000001 EDI: f708f708 EBP: f7093f1c ESP: f7093f1c > [45012.674548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [45012.674548] Process swapper (pid: 0, ti=f7092000 task=f708f550 > task.ti=f7092000) > [45012.674548] Stack: > [45012.674548] f7093f94 c1265404 c1032f33 f7093f30 c13f0200 f708f704 > c10032e6 c13ec1f4 > [45012.674548] <0> f708f708 c13f0200 00000001 00000000 f7093f94 f7092000 > 00000001 0000007b > [45012.674548] <0> 000000d8 f708f550 c1395288 00000001 c1c9648c f7093f94 > c104babe f7092000 > [45012.674548] Call Trace: > [45012.674548] [<c1265404>] ? schedule+0xca/0x86c > [45012.674548] [<c1032f33>] ? irq_exit+0x39/0x5c > [45012.674548] [<c10032e6>] ? error_interrupt+0x2a/0x30 > [45012.674548] [<c104babe>] ? tick_nohz_restart_sched_tick+0x2c/0x139 > [45012.674548] [<c1001c15>] ? cpu_idle+0x60/0x63 > [45012.674548] [<c1262bc7>] ? start_secondary+0x195/0x19a > [45012.674548] Code: 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 > 89 c8 c9 c3 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 > <74> 06 f3 90 8a 10 eb f6 c9 c3 55 89 e5 0f 1f 44 00 00 9c 5a fa > [45012.674548] ---[ end trace 7323d2c20b35b0b0 ]--- > [45012.678548] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c1013ff8, > registers: > [45012.678548] Modules linked in: vfat fat iptable_filter ip_tables > x_tables binfmt_misc ppdev lp autofs4 nfsd exportfs nfs lockd nfs_acl > auth_rpcgss sunrpc ipv6 sd_mod dm_snapshot dm_mirror dm_region_hash dm_log > dm_mod snd_usb_audio snd_usb_lib usbhid usb_storage hid usblp > snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_emul > snd_emu10k1 snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm > snd_page_alloc snd_util_mem snd_hwdep ohci_hcd snd_seq_dummy ehci_hcd > snd_seq_oss snd_seq_midi snd_rawmidi usbcore snd_seq_midi_event snd_seq > 3c59x snd_timer snd_seq_device snd nls_base soundcore emu10k1_gp gameport > amd_k7_agp i2c_amd756 mii i2c_core agpgart ide_cd_mod thermal psmouse > amd_rng shpchp button pci_hotplug container pcspkr cdrom rng_core evdev > serio_raw processor thermal_sys parport_pc parport floppy > [45012.678548] > [45012.678548] Pid: 9294, comm: backup2l Tainted: G D (2.6.31-rc6 > #9) Unknown > [45012.678548] EIP: 0060:[<c1013ff8>] EFLAGS: 00000002 CPU: 0 > [45012.678548] EIP is at default_send_IPI_mask_logical+0x5c/0x90 > [45012.678548] EAX: 000018f0 EBX: 00000082 ECX: c1356b5c EDX: 00000800 > [45012.678548] ESI: 000000fd EDI: 00000002 EBP: ceaf7e18 ESP: ceaf7e08 > [45012.678548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > [45012.678548] Process backup2l (pid: 9294, ti=ceaf6000 task=db276eb0 > task.ti=ceaf6000) > [45012.678548] Stack: > [45012.678548] c126fcc8 000000a7 db13e9c0 c1c9a200 ceaf7e20 c10127ce > ceaf7e2c c1020ebf > [45012.678548] <0> c126fc08 ceaf7e34 c1020f2e ceaf7e64 c1029d20 00000000 > 00000000 00000001 > [45012.678548] <0> 00000001 00000001 00000000 00000086 ce995e40 e1cee7f8 > 00000001 ceaf7e6c > [45012.678548] Call Trace: > [45012.678548] [<c10127ce>] ? native_smp_send_reschedule+0x44/0x46 > [45012.678548] [<c1020ebf>] ? resched_task+0x5f/0x62 > [45012.678548] [<c1020f2e>] ? check_preempt_curr_idle+0x13/0x15 > [45012.678548] [<c1029d20>] ? try_to_wake_up+0x1f7/0x22b > [45012.678548] [<c1029d64>] ? default_wake_function+0x10/0x12 > [45012.678548] [<c1040645>] ? autoremove_wake_function+0x14/0x34 > [45012.678548] [<c101e110>] ? __wake_up_common+0x39/0x61 > [45012.678548] [<c1021048>] ? __wake_up_sync_key+0x38/0x4a > [45012.678548] [<c1021069>] ? __wake_up_sync+0xf/0x12 > [45012.678548] [<c10b0fd8>] ? pipe_release+0x5e/0x92 > [45012.678548] [<c10b103c>] ? pipe_write_release+0x14/0x16 > [45012.678548] [<c10ac42e>] ? __fput+0xcf/0x177 > [45012.678548] [<c10ac4f0>] ? fput+0x1a/0x1c > [45012.678548] [<c10a982d>] ? filp_close+0x56/0x60 > [45012.678548] [<c102f56b>] ? put_files_struct+0x5d/0xa1 > [45012.678548] [<c102f5ea>] ? exit_files+0x3b/0x40 > [45012.678548] [<c1030ca5>] ? do_exit+0x1ae/0x5b3 > [45012.678548] [<c103110c>] ? do_group_exit+0x62/0x89 > [45012.678548] [<c103114b>] ? sys_exit_group+0x18/0x1c > [45012.678548] [<c1002944>] ? sysenter_do_call+0x12/0x22 > [45012.678548] Code: a1 18 6c 35 c1 83 fe 02 8b 50 20 75 10 89 55 f0 ff 90 > b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 d4 97 35 c1 8b 80 00 c3 ff ff f6 c4 10 > <75> ee a1 d4 97 35 c1 c1 e7 18 2d f0 3c 00 00 89 38 89 f0 09 d0 > [45012.678548] ---[ end trace 7323d2c20b35b0b1 ]--- > [45014.223303] Clocksource tsc unstable (delta = 9374024843 ns) > [45014.229055] Fixing recursive fault but reboot is needed! > > > Always happens where one CPU is sending an IPI and the other has the rq > spinlock. Seems to be that the IPI expects the other CPU to not have > interrupts disabled or something? > > Note, I've seen this on 2.6.30-rc6 as well (yes that's 2.6.30). But this > does not happen on 2.6.29. Unfortunately, 2.6.29 makes my NIC go kaputt > for some reason. > > I've enabled LOCKDEP and it just makes the bug trigger easier. > > Anyway, anyone have any ideas? > > Since this can take up to a week to trigger, I'll start doing a git bisect > on it from 2.6.29 to 2.6.30-rc6. And see if I can find the change that > broke. Although this box is my DHCP/web/mail server, so it will be a PITA > :-( > > Thanks, > > -- Steve > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-27 22:41 ` Steven Rostedt @ 2009-08-27 22:45 ` Steven Rostedt 2009-08-28 2:46 ` Tejun Heo 1 sibling, 0 replies; 13+ messages in thread From: Steven Rostedt @ 2009-08-27 22:45 UTC (permalink / raw) To: LKML; +Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Tejun Heo, Andrew Morton On Thu, 27 Aug 2009, Steven Rostedt wrote: > > > > Always happens where one CPU is sending an IPI and the other has the rq > > spinlock. Seems to be that the IPI expects the other CPU to not have > > interrupts disabled or something? > > > > Note, I've seen this on 2.6.30-rc6 as well (yes that's 2.6.30). But this > > does not happen on 2.6.29. Unfortunately, 2.6.29 makes my NIC go kaputt > > for some reason. Clearification, it did happen on 2.6.29, I thought I was running 29, but it turned out, my previous kernel was 27. -- Steve ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-27 22:41 ` Steven Rostedt 2009-08-27 22:45 ` Steven Rostedt @ 2009-08-28 2:46 ` Tejun Heo 2009-08-28 2:52 ` Tejun Heo 2009-08-28 4:05 ` Linus Torvalds 1 sibling, 2 replies; 13+ messages in thread From: Tejun Heo @ 2009-08-28 2:46 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton Hello, Steven Rostedt. Steven Rostedt wrote: > commit 02cf94c370e0dc9bf408fe45eb86fe9ad58eaf7f > Author: Tejun Heo <tj@kernel.org> > Date: Wed Jan 21 17:26:06 2009 +0900 > > x86: make x86_32 use tlb_64.c > > Impact: less contention when issuing invalidate IPI, cleanup > > Make x86_32 use the same tlb code as 64bit. The 64bit code uses > multiple IPI vectors for tlb shootdown to reduce contention. This > patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the > code paths. > > Note that the usage of asmlinkage is inconsistent for x86_32 and 64 > and calls for further cleanup. This has been noted with a FIXME > comment in tlb_64.c. > > Signed-off-by: Tejun Heo <tj@kernel.org> > > I can easily hit this bug at this commit, but I ran for a week on the > commit before it. Thus I'm assuming this is the bug (but I'm not 100% > sure). Drat, why does it have to be mine? ;-) Joke aside, thank you very much for bisecting it. ... >> [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 >> [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 >> [13288.222084] EAX: fffff000 EBX: 000000fd ECX: c042cccc EDX: 00000800 >> [13288.222084] ESI: 00000002 EDI: 00000086 EBP: f7065f10 ESP: f7065f00 >> [13288.222084] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 >> [13288.222084] Process migration/0 (pid: 3, ti=f7064000 task=f705f0c0 >> task.ti=f7064000) >> [13288.222084] Stack: >> [13288.222084] c0350550 0000007e efbfea20 c1819c60 f7065f18 c010f1fa >> f7065f24 c011b69c >> [13288.222084] c0350490 f7065f2c c011b70b f7065f5c c01213d3 00000000 >> 00000000 00000001 >> [13288.222084] 00000000 00000001 00000000 00000092 c5685f08 c5685f44 >> 00000003 f7065f64 >> [13288.222084] Call Trace: >> [13288.222084] [<c010f1fa>] ? native_smp_send_reschedule+0x44/0x46 >> [13288.222084] [<c011b69c>] ? resched_task+0x5f/0x62 >> [13288.222084] [<c011b70b>] ? check_preempt_curr_idle+0x13/0x15 >> [13288.222084] [<c01213d3>] ? try_to_wake_up+0x1f4/0x228 >> [13288.222084] [<c0121417>] ? default_wake_function+0x10/0x12 >> [13288.222084] [<c011a22f>] ? __wake_up_common+0x39/0x61 >> [13288.222084] [<c011b7da>] ? complete+0x30/0x43 >> [13288.222084] [<c0123eaa>] ? migration_thread+0x19d/0x1db >> [13288.222084] [<c0123d0d>] ? migration_thread+0x0/0x1db >> [13288.222084] [<c0137411>] ? kthread+0x4b/0x6f >> [13288.222084] [<c01373c6>] ? kthread+0x0/0x6f >> [13288.222084] [<c01033cf>] ? kernel_thread_helper+0x7/0x10 >> [13288.222084] Code: eb 27 3b c0 e8 56 62 01 00 a1 88 cd 42 c0 83 fb 02 8b >> 50 20 75 10 89 55 f0 ff 90 b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 f4 f8 42 c0 >> <8b> 80 00 c3 ff ff f6 c4 10 75 ee a1 f4 f8 42 c0 c1 e6 18 2d f0 0: eb 27 jmp 29 <a+0x29> 2: 3b c0 cmp %eax,%eax 4: e8 56 62 01 00 call 1625f <a+0x1625f> 9: a1 88 cd 42 c0 mov 0xc042cd88,%eax e: 83 fb 02 cmp $0x2,%ebx 11: 8b 50 20 mov 0x20(%eax),%edx 14: 75 10 jne 26 <a+0x26> 16: 89 55 f0 mov %edx,-0x10(%ebp) 19: ff 90 b8 00 00 00 call *0xb8(%eax) 1f: 8b 55 f0 mov -0x10(%ebp),%edx 22: eb 12 jmp 36 <a+0x36> 24: f3 90 pause 26: a1 f4 f8 42 c0 mov 0xc042f8f4,%eax 2b: 8b 80 00 c3 ff ff mov -0x3d00(%eax),%eax 31: f6 c4 10 test $0x10,%ah 34: 75 ee jne 24 <a+0x24> 36: a1 f4 f8 42 c0 mov 0xc042f8f4,%eax 3b: c1 e6 18 shl $0x18,%esi 3e: 2d .byte 0x2d 3f: f0 lock This is from __default_send_IPI_dest_field() which calls __xapic_wait_icr_idle() which busy waits for APIC_ICR_BUSY bit to go off. The f3 90 is REP; NOP; which is cpu_relax(). The 8b is loading the apic register to test for the bit. >> [13288.222093] Pid: 0, comm: swapper Tainted: G D (2.6.30-rc6 #3) >> Unknown >> [13288.222093] EIP: 0060:[<c03492d5>] EFLAGS: 00000097 CPU: 1 >> [13288.222093] EIP is at _spin_lock_irq+0x19/0x1f >> [13288.222093] EAX: c1819c60 EBX: c1819c60 ECX: f708f6b0 EDX: 00007e7d >> [13288.222093] ESI: c04bfc60 EDI: c04bcd18 EBP: f7093f18 ESP: f7093f18 >> [13288.222093] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 >> [13288.222093] Process swapper (pid: 0, ti=f7092000 task=f708f500 >> task.ti=f7092000) >> [13288.222093] Stack: >> [13288.222093] f7093f8c c0347415 f7093f28 c012b204 f708f6b0 c01100ba >> c04bcd18 c04bfc60 >> [13288.222093] f708f6b4 c1816488 00000000 00000001 00000000 f7093f94 >> f7092000 0000007b >> [13288.222093] f708f500 000000d8 00000000 c0468e60 00000001 c1816f8c >> f7093f94 c01424e4 >> [13288.222093] Call Trace: >> [13288.222093] [<c0347415>] ? __schedule+0xc8/0x855 >> [13288.222093] [<c012b204>] ? irq_exit+0x39/0x5c >> [13288.222093] [<c01100ba>] ? smp_error_interrupt+0x61/0x63 >> [13288.222093] [<c01424e4>] ? tick_nohz_restart_sched_tick+0x2c/0x139 >> [13288.222093] [<c0347baf>] ? schedule+0xd/0x1c >> [13288.222093] [<c0101c5d>] ? cpu_idle+0x60/0x63 >> [13288.222093] [<c0344d11>] ? start_secondary+0x195/0x19a >> [13288.222093] Code: 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 5d c3 >> 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 >> <8a> 10 eb f6 5d c3 55 89 e5 0f 1f 44 00 00 9c 5a fa f0 83 28 01 >> [13288.222093] ---[ end trace 94cd80322ca2a919 ]--- This one is in the same wait loop but it's calculating the address of the apic register to load. >> [45012.674548] Pid: 0, comm: swapper Not tainted (2.6.31-rc6 #9) Unknown >> [45012.674548] EIP: 0060:[<c1267059>] EFLAGS: 00000097 CPU: 1 >> [45012.674548] EIP is at _spin_lock_irq+0x15/0x1f >> [45012.674548] EAX: c1c9a200 EBX: c1c9a200 ECX: f708f550 EDX: 0000a7a6 >> [45012.674548] ESI: 00000001 EDI: f708f708 EBP: f7093f1c ESP: f7093f1c >> [45012.674548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 >> [45012.674548] Process swapper (pid: 0, ti=f7092000 task=f708f550 >> task.ti=f7092000) >> [45012.674548] Stack: >> [45012.674548] f7093f94 c1265404 c1032f33 f7093f30 c13f0200 f708f704 >> c10032e6 c13ec1f4 >> [45012.674548] <0> f708f708 c13f0200 00000001 00000000 f7093f94 f7092000 >> 00000001 0000007b >> [45012.674548] <0> 000000d8 f708f550 c1395288 00000001 c1c9648c f7093f94 >> c104babe f7092000 >> [45012.674548] Call Trace: >> [45012.674548] [<c1265404>] ? schedule+0xca/0x86c >> [45012.674548] [<c1032f33>] ? irq_exit+0x39/0x5c >> [45012.674548] [<c10032e6>] ? error_interrupt+0x2a/0x30 >> [45012.674548] [<c104babe>] ? tick_nohz_restart_sched_tick+0x2c/0x139 >> [45012.674548] [<c1001c15>] ? cpu_idle+0x60/0x63 >> [45012.674548] [<c1262bc7>] ? start_secondary+0x195/0x19a >> [45012.674548] Code: 01 00 00 f0 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 >> 89 c8 c9 c3 55 89 e5 0f 1f 44 00 00 fa ba 00 01 00 00 f0 66 0f c1 10 38 f2 >> <74> 06 f3 90 8a 10 eb f6 c9 c3 55 89 e5 0f 1f 44 00 00 9c 5a fa This is cpu1 waiting for cpu0 to release spin_lock_irq() which is shown below. >> [45012.674548] ---[ end trace 7323d2c20b35b0b0 ]--- >> [45012.678548] BUG: NMI Watchdog detected LOCKUP on CPU0, ip c1013ff8, >> [45012.678548] EIP: 0060:[<c1013ff8>] EFLAGS: 00000002 CPU: 0 >> [45012.678548] EIP is at default_send_IPI_mask_logical+0x5c/0x90 >> [45012.678548] EAX: 000018f0 EBX: 00000082 ECX: c1356b5c EDX: 00000800 >> [45012.678548] ESI: 000000fd EDI: 00000002 EBP: ceaf7e18 ESP: ceaf7e08 >> [45012.678548] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 >> [45012.678548] Process backup2l (pid: 9294, ti=ceaf6000 task=db276eb0 >> task.ti=ceaf6000) >> [45012.678548] Stack: >> [45012.678548] c126fcc8 000000a7 db13e9c0 c1c9a200 ceaf7e20 c10127ce >> ceaf7e2c c1020ebf >> [45012.678548] <0> c126fc08 ceaf7e34 c1020f2e ceaf7e64 c1029d20 00000000 >> 00000000 00000001 >> [45012.678548] <0> 00000001 00000001 00000000 00000086 ce995e40 e1cee7f8 >> 00000001 ceaf7e6c >> [45012.678548] Call Trace: >> [45012.678548] [<c10127ce>] ? native_smp_send_reschedule+0x44/0x46 >> [45012.678548] [<c1020ebf>] ? resched_task+0x5f/0x62 >> [45012.678548] [<c1020f2e>] ? check_preempt_curr_idle+0x13/0x15 >> [45012.678548] [<c1029d20>] ? try_to_wake_up+0x1f7/0x22b >> [45012.678548] [<c1029d64>] ? default_wake_function+0x10/0x12 >> [45012.678548] [<c1040645>] ? autoremove_wake_function+0x14/0x34 >> [45012.678548] [<c101e110>] ? __wake_up_common+0x39/0x61 >> [45012.678548] [<c1021048>] ? __wake_up_sync_key+0x38/0x4a >> [45012.678548] [<c1021069>] ? __wake_up_sync+0xf/0x12 >> [45012.678548] [<c10b0fd8>] ? pipe_release+0x5e/0x92 >> [45012.678548] [<c10b103c>] ? pipe_write_release+0x14/0x16 >> [45012.678548] [<c10ac42e>] ? __fput+0xcf/0x177 >> [45012.678548] [<c10ac4f0>] ? fput+0x1a/0x1c >> [45012.678548] [<c10a982d>] ? filp_close+0x56/0x60 >> [45012.678548] [<c102f56b>] ? put_files_struct+0x5d/0xa1 >> [45012.678548] [<c102f5ea>] ? exit_files+0x3b/0x40 >> [45012.678548] [<c1030ca5>] ? do_exit+0x1ae/0x5b3 >> [45012.678548] [<c103110c>] ? do_group_exit+0x62/0x89 >> [45012.678548] [<c103114b>] ? sys_exit_group+0x18/0x1c >> [45012.678548] [<c1002944>] ? sysenter_do_call+0x12/0x22 >> [45012.678548] Code: a1 18 6c 35 c1 83 fe 02 8b 50 20 75 10 89 55 f0 ff 90 >> b8 00 00 00 8b 55 f0 eb 12 f3 90 a1 d4 97 35 c1 8b 80 00 c3 ff ff f6 c4 10 >> <75> ee a1 d4 97 35 c1 c1 e7 18 2d f0 3c 00 00 89 38 89 f0 09 d0 The same loop but yet different instruction. It's about to jump back to cpu_relax(). >> Always happens where one CPU is sending an IPI and the other has the rq >> spinlock. Seems to be that the IPI expects the other CPU to not have >> interrupts disabled or something? I'm not too familiar with apics but AFAIK sending IPI isn't an interlocked operation (not at least at the software level) so I doubt it has much to do with the other cpu doing or not doing anything. It looks like the local apic is stuck hardware-wise. The only thing the commit changes is that cpu1 would be using vector 0xf1 instead of 0xf0 together with cpu0. (reading the doc...) Okay, here's something interesting. It's from section 9.8.4 of intel doc 253668.pdf - Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1. For the P6 family and Pentium processors, the IRR and ISR registers can queue no more than two interrupts per priority level, and will reject other interrupts that are received within the same priority level. And from AMD's 24593 - AMD64 Architecture Programmer's Manual Volume 2: System Programming, section 16.6.3. No more than two interrupts can be pending for the same interrupt vector number. Subsequent interrupt requests to the same interrupt vector number will be rejected. See Figure 16-23 on page 445. We're using the highest single interrupt priority level which starts at 0xf0 for all IPIs. This means that upto the ppro on intel and all AMD processors can deadlock via the apic bus in the following scenario. cpu0 cpu1 local_irq_disable() spin_lock_irqsave(A) and succeeds spin_lock_irqsave(A) and waits receives 0xf0-ff intr A, accepts receives 0xf0-ff intr B, accepts sends IPI C receives IPI C, ISR/IRR full, rejects processor continues, apic keeps retrying IPI C keeps rejecting IPI C processor tries to send IPI C again ... but APIC_ICR_BUSY never clears. ... I don't know why the specific commit triggered this behavior because it doesn't use more ISR/IRR slots. It's just using different vectors in different directions. The processor either is somehow ending up sending the invalidate ipi to itself too thus using an extra ISR/IRR slot or it's the changes in timing which somehow allows the cpu to accumulate two pending irqs in the highest priority group. It would be interesting to print out ISR/IRR while the machine hung from the NMI handler. All in all, this resource deadlock involving apic while is cool definitely is worrying as it looks like it can affect recent amd processors. If the analysis is correct, one solution would be distributing IPI vectors across multiple priority levels. More ideas? Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 2:46 ` Tejun Heo @ 2009-08-28 2:52 ` Tejun Heo 2009-08-28 6:36 ` Ingo Molnar 2009-08-28 4:05 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Tejun Heo @ 2009-08-28 2:52 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton Tejun Heo wrote: >>> Always happens where one CPU is sending an IPI and the other has the rq >>> spinlock. Seems to be that the IPI expects the other CPU to not have >>> interrupts disabled or something? > > I'm not too familiar with apics but AFAIK sending IPI isn't an > interlocked operation (not at least at the software level) so I doubt > it has much to do with the other cpu doing or not doing anything. It > looks like the local apic is stuck hardware-wise. The only thing the > commit changes is that cpu1 would be using vector 0xf1 instead of 0xf0 > together with cpu0. > > (reading the doc...) Okay, here's something interesting. It's from > section 9.8.4 of intel doc 253668.pdf - Intel 64 and IA-32 > Architectures Software Developer's Manual Volume 3A: System > Programming Guide, Part 1. > > For the P6 family and Pentium processors, the IRR and ISR registers > can queue no more than two interrupts per priority level, and will > reject other interrupts that are received within the same priority > level. > > And from AMD's 24593 - AMD64 Architecture Programmer's Manual Volume > 2: System Programming, section 16.6.3. > > No more than two interrupts can be pending for the same interrupt > vector number. Subsequent interrupt requests to the same interrupt > vector number will be rejected. See Figure 16-23 on page 445. > Oh... there are differences that I missed. All intels: If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and the ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued FOR THE SAME INTERRUPT VECTOR are COLLAPSED INTO THE SINGLE BIT in the IRR. Ppro: no more than two interrupts PER PRIORITY LEVEL, and will REJECT OTHER interrupts AMD64: Subsequent interrupt requests to THE SAME INTERRUPT VECTOR NUMBER will be REJECTED. Eh... don't have earlier AMD doc and gotta go now. Can somebody please check? But it looks like we can deadlock by simply sending RESCHEDULE_VECTOR more than two times while holding rq lock on AMD? Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 2:52 ` Tejun Heo @ 2009-08-28 6:36 ` Ingo Molnar 2009-08-28 6:59 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Ingo Molnar @ 2009-08-28 6:36 UTC (permalink / raw) To: Tejun Heo Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds * Tejun Heo <tj@kernel.org> wrote: > Tejun Heo wrote: > >>> Always happens where one CPU is sending an IPI and the other has the rq > >>> spinlock. Seems to be that the IPI expects the other CPU to not have > >>> interrupts disabled or something? > > > > I'm not too familiar with apics but AFAIK sending IPI isn't an > > interlocked operation (not at least at the software level) so I doubt > > it has much to do with the other cpu doing or not doing anything. It > > looks like the local apic is stuck hardware-wise. The only thing the > > commit changes is that cpu1 would be using vector 0xf1 instead of 0xf0 > > together with cpu0. > > > > (reading the doc...) Okay, here's something interesting. It's from > > section 9.8.4 of intel doc 253668.pdf - Intel 64 and IA-32 > > Architectures Software Developer's Manual Volume 3A: System > > Programming Guide, Part 1. > > > > For the P6 family and Pentium processors, the IRR and ISR registers > > can queue no more than two interrupts per priority level, and will > > reject other interrupts that are received within the same priority > > level. > > > > And from AMD's 24593 - AMD64 Architecture Programmer's Manual Volume > > 2: System Programming, section 16.6.3. > > > > No more than two interrupts can be pending for the same interrupt > > vector number. Subsequent interrupt requests to the same interrupt > > vector number will be rejected. See Figure 16-23 on page 445. > > > > Oh... there are differences that I missed. > > All intels: If more than one interrupt is generated with the same > vector number, the local APIC can set the bit for the > vector both in the IRR and the ISR. This means that for > the Pentium 4 and Intel Xeon processors, the IRR and ISR > can queue two interrupts for each interrupt vector: one > in the IRR and one in the ISR. Any additional interrupts > issued FOR THE SAME INTERRUPT VECTOR are COLLAPSED INTO > THE SINGLE BIT in the IRR. > > Ppro: no more than two interrupts PER PRIORITY LEVEL, and will REJECT > OTHER interrupts > > AMD64: Subsequent interrupt requests to THE SAME INTERRUPT VECTOR > NUMBER will be REJECTED. > > Eh... don't have earlier AMD doc and gotta go now. Can somebody > please check? But it looks like we can deadlock by simply sending > RESCHEDULE_VECTOR more than two times while holding rq lock on > AMD? We poll ICR in the send-IPI logic before sending it out - so this shouldnt happen. The restrictions above should at most cause extra polling latency (i.e. it's a performance detail, not a lockup source). See all the *wait_icr_idle() methods in the IPI sending logic in arch/x86. Neither TLB flushes nor reschedule IPIs are idempotent, so if this was broken and if we lost requested events on remote CPUs we'd notice it rather quickly via TLB flush related hangs or scheduling latencies or lost wakeups, on a rather large category of CPUs. I think Linus's suggestion that it's the zero mask quirk on certain older CPUs that is causing problems on that system should be examined ... does .31-rc8 work fine? Ingo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 6:36 ` Ingo Molnar @ 2009-08-28 6:59 ` Tejun Heo 0 siblings, 0 replies; 13+ messages in thread From: Tejun Heo @ 2009-08-28 6:59 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, Linus Torvalds Hello, Ingo Molnar wrote: >> Eh... don't have earlier AMD doc and gotta go now. Can somebody >> please check? But it looks like we can deadlock by simply sending >> RESCHEDULE_VECTOR more than two times while holding rq lock on >> AMD? > > We poll ICR in the send-IPI logic before sending it out - so this > shouldnt happen. The restrictions above should at most cause extra > polling latency (i.e. it's a performance detail, not a lockup > source). See all the *wait_icr_idle() methods in the IPI sending > logic in arch/x86. Ah... good. I'm not all that familiar with the area so I was kind of shooting in the dark. > Neither TLB flushes nor reschedule IPIs are idempotent, so if this > was broken and if we lost requested events on remote CPUs we'd > notice it rather quickly via TLB flush related hangs or scheduling > latencies or lost wakeups, on a rather large category of CPUs. But it still looks like we can quite easily fall into deadlock when there are multiple cpus. cpu0 holding rq_lock and sends RESCHEUDLE, cpu1 waiting on rq_lock with irq disabled and some other cpus already sent three other IPIs to cpu1 then cpu0 will lock up on the BUSY bit when it tries to send RESCHEDULE, no? > I think Linus's suggestion that it's the zero mask quirk on certain > older CPUs that is causing problems on that system should be > examined ... does .31-rc8 work fine? Yeap, it would be great if that's the case. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 2:46 ` Tejun Heo 2009-08-28 2:52 ` Tejun Heo @ 2009-08-28 4:05 ` Linus Torvalds 2009-08-28 16:15 ` Steven Rostedt 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2009-08-28 4:05 UTC (permalink / raw) To: Tejun Heo Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton On Fri, 28 Aug 2009, Tejun Heo wrote: > > > > x86: make x86_32 use tlb_64.c > > > > Impact: less contention when issuing invalidate IPI, cleanup > > > > Make x86_32 use the same tlb code as 64bit. The 64bit code uses > > multiple IPI vectors for tlb shootdown to reduce contention. This > > patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the > > code paths. > > > > Note that the usage of asmlinkage is inconsistent for x86_32 and 64 > > and calls for further cleanup. This has been noted with a FIXME > > comment in tlb_64.c. > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > > > I can easily hit this bug at this commit, but I ran for a week on the > > commit before it. Thus I'm assuming this is the bug (but I'm not 100% > > sure). > > Drat, why does it have to be mine? ;-) > > Joke aside, thank you very much for bisecting it. > > ... > >> [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > >> [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 Is this one perhaps fixed by b04e6373d694 ("x86: don't call '->send_IPI_mask()' with an empty mask") It sounds a _lot_ like that bug. Older dual-cpu x86 box, and APIC getting confused by the occasional empty CPU mask, and then subsequent IPI's will hang. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 4:05 ` Linus Torvalds @ 2009-08-28 16:15 ` Steven Rostedt 2009-08-28 18:33 ` Steven Rostedt 0 siblings, 1 reply; 13+ messages in thread From: Steven Rostedt @ 2009-08-28 16:15 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton On Thu, 27 Aug 2009, Linus Torvalds wrote: > > > On Fri, 28 Aug 2009, Tejun Heo wrote: > > > > > > x86: make x86_32 use tlb_64.c > > > > > > Impact: less contention when issuing invalidate IPI, cleanup > > > > > > Make x86_32 use the same tlb code as 64bit. The 64bit code uses > > > multiple IPI vectors for tlb shootdown to reduce contention. This > > > patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the > > > code paths. > > > > > > Note that the usage of asmlinkage is inconsistent for x86_32 and 64 > > > and calls for further cleanup. This has been noted with a FIXME > > > comment in tlb_64.c. > > > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > > > > > I can easily hit this bug at this commit, but I ran for a week on the > > > commit before it. Thus I'm assuming this is the bug (but I'm not 100% > > > sure). > > > > Drat, why does it have to be mine? ;-) > > > > Joke aside, thank you very much for bisecting it. > > > > ... > > >> [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > > >> [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 > > Is this one perhaps fixed by b04e6373d694 ("x86: don't call > '->send_IPI_mask()' with an empty mask") > > It sounds a _lot_ like that bug. Older dual-cpu x86 box, and APIC getting > confused by the occasional empty CPU mask, and then subsequent IPI's will > hang. I'll update my box to that kernel and see what happens and let you know. Thanks, -- Steve ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 16:15 ` Steven Rostedt @ 2009-08-28 18:33 ` Steven Rostedt 2009-09-09 2:29 ` Steven Rostedt 0 siblings, 1 reply; 13+ messages in thread From: Steven Rostedt @ 2009-08-28 18:33 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton On Fri, 28 Aug 2009, Steven Rostedt wrote: > On Thu, 27 Aug 2009, Linus Torvalds wrote: > > > ... > > > >> [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > > > >> [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 > > > > Is this one perhaps fixed by b04e6373d694 ("x86: don't call > > '->send_IPI_mask()' with an empty mask") > > > > It sounds a _lot_ like that bug. Older dual-cpu x86 box, and APIC getting > > confused by the occasional empty CPU mask, and then subsequent IPI's will > > hang. > > I'll update my box to that kernel and see what happens and let you know. I checked out the commit at b04e6373d694, compiled and booted that kernel. So far, so good. It survived logging into a gnome session (sometimes locks up there), and a full kernel compile (usually locks up there too). But there's been rare times that it would take days before hitting the bug. I'll keep it in this kernel for a while and if it lasts the week, I'll consider it fixed. This box acts as my email, dhcp and web server, as well as one of the boxes in my distcc farm, thus it gets a bit of use. It will not be idle during this time period. Thanks! -- Steve ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [BUG] lockup with the latest kernel 2009-08-28 18:33 ` Steven Rostedt @ 2009-09-09 2:29 ` Steven Rostedt 0 siblings, 0 replies; 13+ messages in thread From: Steven Rostedt @ 2009-09-09 2:29 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, LKML, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton On Fri, 2009-08-28 at 14:33 -0400, Steven Rostedt wrote: > On Fri, 28 Aug 2009, Steven Rostedt wrote: > > On Thu, 27 Aug 2009, Linus Torvalds wrote: > > > > ... > > > > >> [13288.222084] EIP: 0060:[<c0110821>] EFLAGS: 00000002 CPU: 0 > > > > >> [13288.222084] EIP is at default_send_IPI_mask_logical+0x53/0x92 > > > > > > Is this one perhaps fixed by b04e6373d694 ("x86: don't call > > > '->send_IPI_mask()' with an empty mask") > > > > > > It sounds a _lot_ like that bug. Older dual-cpu x86 box, and APIC getting > > > confused by the occasional empty CPU mask, and then subsequent IPI's will > > > hang. > > > > I'll update my box to that kernel and see what happens and let you know. > > I checked out the commit at b04e6373d694, compiled and booted that kernel. > So far, so good. It survived logging into a gnome session (sometimes locks > up there), and a full kernel compile (usually locks up there too). But > there's been rare times that it would take days before hitting the bug. > > I'll keep it in this kernel for a while and if it lasts the week, I'll > consider it fixed. This box acts as my email, dhcp and web server, as well > as one of the boxes in my distcc farm, thus it gets a bit of use. It will > not be idle during this time period. Just an update. The box currently has an uptime of 11 days. I have not hit the bug at all, and it has survived numerous compiles (both locally and as a distcc server). It has been serving up my email and web pages without a hitch. I'd say that commit fixed the bug. (knock on wood). -- Steve (or it could have been the penguins: http://rostedt.homelinux.com/gallery2/main.php?g2_itemId=7860 ) ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-09-09 2:29 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-19 15:49 [BUG] lockup with the latest kernel Steven Rostedt 2009-08-19 15:50 ` Steven Rostedt 2009-08-19 16:18 ` Andrew Morton 2009-08-27 22:41 ` Steven Rostedt 2009-08-27 22:45 ` Steven Rostedt 2009-08-28 2:46 ` Tejun Heo 2009-08-28 2:52 ` Tejun Heo 2009-08-28 6:36 ` Ingo Molnar 2009-08-28 6:59 ` Tejun Heo 2009-08-28 4:05 ` Linus Torvalds 2009-08-28 16:15 ` Steven Rostedt 2009-08-28 18:33 ` Steven Rostedt 2009-09-09 2:29 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox