public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* RFH: kvm-0.14.1 on 2.6.32.42 stuck in flash_tlb_others_ipi()
@ 2012-01-09 11:41 Philipp Hahn
  2012-03-30 17:44 ` RFH: kvm-1.0+git on 3.3.0 " Philipp Hahn
  0 siblings, 1 reply; 3+ messages in thread
From: Philipp Hahn @ 2012-01-09 11:41 UTC (permalink / raw)
  To: kvm


[-- Attachment #1.1: Type: text/plain, Size: 1366 bytes --]

Hello,

one of our VMs regularly get stuck: the VM is completely unresponsive (no ssh, 
no serial console, no VNC). Using "gdbserver" and a remote system to debug 
the running VM, I see 3 CPUs (1,3,4) stuck in
 pgd_alloc() → spin_lock_irqsave(pgd_lock)
while the 4th CPU (2) is waiting in
 pgd_alloc() → pgd_prepopulate_pmb() →... →  flush_tlb_others_ipi()

195                     while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
196                             cpu_relax();
(gdb) print f->flush_cpumask
$5 = {1}

CPU 1 is duing a do_exec() syscall, will CPU 2-4 are doing a do_fork() syscall 
according to "thread apply all backtrace".

After a "set variable f->flush_cpumask 0" from gdb the kernel continued 
dumping the trace-informations, which I attached.


Host: Debian linux-2.6.32-38-amd64 (=2.6.32.42), 8 Cores
Kvm: 0.14.1+dfsg
Guest: Debian linux-2.6.32-38-i686-bigmem, 4 CPUs

Is this a known bug and/or is a fix available?
I can gather more information from the VM if needd.

Sincerely
Philipp
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        Linux for Your Business        fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/

[-- Attachment #1.2: dmesg.txt --]
[-- Type: text/plain, Size: 7840 bytes --]

[261850.404057] BUG: soft lockup - CPU#1 stuck for 219010s! [sshd:2707]
[261850.404057] Modules linked in: binfmt_misc nfsd exportfs xt_tcpudp nf_conntrack_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc quota_v2 quota_tree sd_mod crc_t10dif ide_generic ide_gd_mod ide_core cirrusfb 8139cp mii snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 psmouse i2c_core i6300esb virtio_balloon joydev serio_raw pcspkr processor evdev usbhid hid ext3 jbd dm_snapshot dm_mirror dm_region_hash dm_log dm_mod uhci_hcd ata_generic virtio_blk virtio_net ehci_hcd ata_piix thermal usbcore nls_base virtio_pci floppy thermal_sys libata button [last unloaded: scsi_wait_scan]
[261850.404057]
[261850.404057] Pid: 2707, comm: sshd Not tainted (2.6.32-ucs52-686-bigmem #1) Bochs
[261850.404057] EIP: 0060:[<c1293e1d>] EFLAGS: 00000292 CPU: 1
[261850.404057] EIP is at _spin_unlock_irqrestore+0x9/0xf
[261850.404057] EAX: 00000292 EBX: 00000003 ECX: 00000800 EDX: 00000292
[261850.404057] ESI: 00000000 EDI: 00000001 EBP: f55f3eb4 ESP: f55f3e64
[261850.404057]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[261850.404057] CR0: 8005003b CR2: b72a1000 CR3: 35607000 CR4: 000006f0
[261850.404057] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[261850.404057] DR6: ffff0ff0 DR7: 00000400
[261850.404057] Call Trace:
[261850.404057]  [<c102422e>] ? pgd_alloc+0x1c7/0x21a
[261850.404057]  [<c1034908>] mm_init+0xa8/0xd4
[261850.404057]  [<c1034dd1>] ? dup_mm+0x6a/0x389
[261850.404057]  [<c10356bf>] ? copy_process+0x57a/0xf28
[261850.404057]  [<c1035a60>] ? copy_process+0x91b/0xf28
[261850.404057]  [<c10361a7>] ? do_fork+0x13a/0x2bc
[261850.404057]  [<c1006e6e>] ? sys_clone+0x21/0x27
[261850.404057]  [<c100829c>] ? syscall_call+0x7/0xb
[261850.404058] BUG: soft lockup - CPU#2 stuck for 218991s! [runsv:1236]
[261850.404058] Modules linked in: binfmt_misc nfsd exportfs xt_tcpudp nf_conntrack_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc quota_v2 quota_tree sd_mod crc_t10dif ide_generic ide_gd_mod ide_core cirrusfb 8139cp mii snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 psmouse i2c_core i6300esb virtio_balloon joydev serio_raw pcspkr processor evdev usbhid hid ext3 jbd dm_snapshot dm_mirror dm_region_hash dm_log dm_mod uhci_hcd ata_generic virtio_blk virtio_net ehci_hcd ata_piix thermal usbcore nls_base virtio_pci floppy thermal_sys libata button [last unloaded: scsi_wait_scan]
[261850.404058]
[261850.404058] Pid: 1236, comm: runsv Not tainted (2.6.32-ucs52-686-bigmem #1) Bochs
[261850.404058] EIP: 0060:[<c1293e1d>] EFLAGS: 00000296 CPU: 2
[261850.404058] EIP is at _spin_unlock_irqrestore+0x9/0xf
[261850.404058] EAX: 00000296 EBX: 00000003 ECX: 00000003 EDX: 00000296
[261850.404058] ESI: 00000000 EDI: 00000001 EBP: f602febc ESP: f602fe6c
[261850.404058]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[261850.404058] CR0: 8005003b CR2: 0810acf4 CR3: 36a91000 CR4: 000006f0
[261850.404058] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[261850.404058] DR6: ffff0ff0 DR7: 00000400
[261850.404058] Call Trace:
[261850.404058]  [<c102422e>] ? pgd_alloc+0x1c7/0x21a
[261850.404058]  [<c1034908>] mm_init+0xa8/0xd4
[261850.404058]  [<c1034dd1>] ? dup_mm+0x6a/0x389
[261850.404058]  [<c10356bf>] ? copy_process+0x57a/0xf28
[261850.404058]  [<c1035a60>] ? copy_process+0x91b/0xf28
[261850.404058]  [<c10361a7>] ? do_fork+0x13a/0x2bc
[261850.404058]  [<c10bae59>] ? do_sync_read+0x0/0x107
[261850.404058]  [<c10bb8ae>] ? vfs_read+0x7b/0xd3
[261850.404058]  [<c100dc12>] ? sys_fork+0x15/0x19
[261850.404058]  [<c100829c>] ? syscall_call+0x7/0xb
[261850.438247] BUG: soft lockup - CPU#0 stuck for 218999s! [pbuilder-update:2713]
[261850.438247] Modules linked in: binfmt_misc nfsd exportfs xt_tcpudp nf_conntrack_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc quota_v2 quota_tree sd_mod crc_t10dif ide_generic ide_gd_mod ide_core cirrusfb 8139cp mii snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 psmouse i2c_core i6300esb virtio_balloon joydev serio_raw pcspkr processor evdev usbhid hid ext3 jbd dm_snapshot dm_mirror dm_region_hash dm_log dm_mod uhci_hcd ata_generic virtio_blk virtio_net ehci_hcd ata_piix thermal usbcore nls_base virtio_pci floppy thermal_sys libata button [last unloaded: scsi_wait_scan]
[261850.438247]
[261850.438247] Pid: 2713, comm: pbuilder-update Not tainted (2.6.32-ucs52-686-bigmem #1) Bochs
[261850.438247] EIP: 0060:[<c1293e1d>] EFLAGS: 00200292 CPU: 0
[261850.438247] EIP is at _spin_unlock_irqrestore+0x9/0xf
[261850.438247] EAX: 00200292 EBX: 00000003 ECX: 00000000 EDX: 00200292
[261850.438247] ESI: 00000000 EDI: 00000001 EBP: f54ddf4c ESP: f54ddefc
[261850.438247]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[261850.438247] CR0: 8005003b CR2: b73611a0 CR3: 35683000 CR4: 000006f0
[261850.438247] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[261850.438247] DR6: ffff0ff0 DR7: 00000400
[261850.438247] Call Trace:
[261850.438247]  [<c102422e>] ? pgd_alloc+0x1c7/0x21a
[261850.438247]  [<c1292e1f>] ? _cond_resched+0x25/0x3c
[261850.438247]  [<c1292e56>] ? wait_for_common+0x20/0x100
[261850.438247]  [<c1034908>] mm_init+0xa8/0xd4
[261850.438247]  [<c10bfb14>] ? bprm_mm_init+0x14/0x163
[261850.438247]  [<c10bfffc>] ? do_execve+0xe3/0x25b
[261850.438247]  [<c1006e29>] ? sys_execve+0x23/0x47
[261850.438247]  [<c100829c>] ? syscall_call+0x7/0xb
[261850.483139] BUG: soft lockup - CPU#3 stuck for 218991s! [nrpe:1254]
[261850.483139] Modules linked in: binfmt_misc nfsd exportfs xt_tcpudp nf_conntrack_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_state iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc quota_v2 quota_tree sd_mod crc_t10dif ide_generic ide_gd_mod ide_core cirrusfb 8139cp mii snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 psmouse i2c_core i6300esb virtio_balloon joydev serio_raw pcspkr processor evdev usbhid hid ext3 jbd dm_snapshot dm_mirror dm_region_hash dm_log dm_mod uhci_hcd ata_generic virtio_blk virtio_net ehci_hcd ata_piix thermal usbcore nls_base virtio_pci floppy thermal_sys libata button [last unloaded: scsi_wait_scan]
[261850.483139]
[261850.483139] Pid: 1254, comm: nrpe Not tainted (2.6.32-ucs52-686-bigmem #1) Bochs
[261850.483139] EIP: 0060:[<c1293e1d>] EFLAGS: 00200292 CPU: 3
[261850.483139] EIP is at _spin_unlock_irqrestore+0x9/0xf
[261850.483139] EAX: 00200292 EBX: 00000003 ECX: 00000004 EDX: 00200292
[261850.483139] ESI: 00000000 EDI: 00000001 EBP: f647feb4 ESP: f647fe64
[261850.483139]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[261850.483139] CR0: 8005003b CR2: b74ee580 CR3: 36010000 CR4: 000006f0
[261850.483139] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[261850.483139] DR6: ffff0ff0 DR7: 00000400
[261850.483139] Call Trace:
[261850.483139]  [<c102422e>] ? pgd_alloc+0x1c7/0x21a
[261850.483139]  [<c1034908>] mm_init+0xa8/0xd4
[261850.483139]  [<c1034dd1>] ? dup_mm+0x6a/0x389
[261850.483139]  [<c10356bf>] ? copy_process+0x57a/0xf28
[261850.483139]  [<c1035a60>] ? copy_process+0x91b/0xf28
[261850.483139]  [<c10361a7>] ? do_fork+0x13a/0x2bc
[261850.483139]  [<c114cae3>] ? copy_to_user+0x29/0xf8
[261850.483139]  [<c1006e6e>] ? sys_clone+0x21/0x27
[261850.483139]  [<c100829c>] ? syscall_call+0x7/0xb
[261852.541339] nfs: RPC call returned error 88

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RFH: kvm-1.0+git on 3.3.0 stuck in flash_tlb_others_ipi()
  2012-01-09 11:41 RFH: kvm-0.14.1 on 2.6.32.42 stuck in flash_tlb_others_ipi() Philipp Hahn
@ 2012-03-30 17:44 ` Philipp Hahn
  2012-04-19 13:42   ` [SOLVED] 2.6.32 stuck in flush_tlb_others_ipi() Philipp Hahn
  0 siblings, 1 reply; 3+ messages in thread
From: Philipp Hahn @ 2012-03-30 17:44 UTC (permalink / raw)
  To: kvm

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

Hello,

On Monday 09 January 2012 12:41:41 Philipp Hahn wrote:
> one of our VMs regularly get stuck: the VM is completely unresponsive (no
> ssh, no serial console, no VNC). Using "gdbserver" and a remote system to
> debug the running VM, I see 3 CPUs (1,3,4) stuck in
>  pgd_alloc() → spin_lock_irqsave(pgd_lock)
> while the 4th CPU (2) is waiting in
>  pgd_alloc() → pgd_prepopulate_pmb() →... →  flush_tlb_others_ipi()
>
> 195                     while
> (!cpumask_empty(to_cpumask(f->flush_cpumask))) 196                         
>    cpu_relax();
> (gdb) print f->flush_cpumask
> $5 = {1}
>
> CPU 1 is duing a do_exec() syscall, will CPU 2-4 are doing a do_fork()
> syscall according to "thread apply all backtrace".
>
> After a "set variable f->flush_cpumask 0" from gdb the kernel continued
> dumping the trace-informations, which I attached.

I repeatet the test today with current version:

> Host: Debian linux-2.6.32-38-amd64 (=2.6.32.42), 8 Cores

3.3.0 doesn't hep either.

> Kvm: 0.14.1+dfsg

qemu-kvm-1.0-1587-ga0bc8c3 also not.

> Guest: Debian linux-2.6.32-38-i686-bigmem, 4 CPUs

I can reproduce the bug very reliable when building OpenOffice.org and/or 
samba4 with -j4.

> Is this a known bug and/or is a fix available?
> I can gather more information from the VM if needd.

Any ideas where to look next?
How can I check that the IPI-handling in KVM works?

Sincerely
Philipp
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        be open.                       fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [SOLVED] 2.6.32 stuck in flush_tlb_others_ipi()
  2012-03-30 17:44 ` RFH: kvm-1.0+git on 3.3.0 " Philipp Hahn
@ 2012-04-19 13:42   ` Philipp Hahn
  0 siblings, 0 replies; 3+ messages in thread
From: Philipp Hahn @ 2012-04-19 13:42 UTC (permalink / raw)
  To: kvm

[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]

Hello,

good news:

On Friday 30 March 2012 19:44:50 you wrote:
> On Monday 09 January 2012 12:41:41 Philipp Hahn wrote:
> > one of our VMs regularly get stuck: the VM is completely unresponsive (no
> > ssh, no serial console, no VNC). Using "gdbserver" and a remote system to
> > debug the running VM, I see 3 CPUs (1,3,4) stuck in
> >  pgd_alloc() → spin_lock_irqsave(pgd_lock)
> > while the 4th CPU (2) is waiting in
> >  pgd_alloc() → pgd_prepopulate_pmb() →... →  flush_tlb_others_ipi()
> >
> > 195                     while
> > (!cpumask_empty(to_cpumask(f->flush_cpumask))) 196
> >    cpu_relax();
> > (gdb) print f->flush_cpumask
> > $5 = {1}
> >
> > CPU 1 is duing a do_exec() syscall, will CPU 2-4 are doing a do_fork()
> > syscall according to "thread apply all backtrace".

It'a guest kernel bug already fixed in v2.6.38 [1], but not (yet) back-ported 
to 2.6.32-longterm. [2] fixed a bug with TLB flushing when using PAE, which 
made the hidden bug trigger a lot more often. It only happens when using a 
PAE enabled guest kernel with >=2 CPUs.
Full details are in our German Bugzilla [3].

[1] 
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;h=a79e53d85683c6dd9f99c90511028adc2043031f>
[2] 
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;h=4981d01eada5354d81c8929d5b2836829ba3df7b>
[3] <https://forge.univention.org/bugzilla/show_bug.cgi?id=26661>

Sincerely
Philipp
-- 
Philipp Hahn           Open Source Software Engineer      hahn@univention.de
Univention GmbH        be open.                       fon: +49 421 22 232- 0
Mary-Somerville-Str.1  D-28359 Bremen                 fax: +49 421 22 232-99
                                                   http://www.univention.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-04-19 13:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-09 11:41 RFH: kvm-0.14.1 on 2.6.32.42 stuck in flash_tlb_others_ipi() Philipp Hahn
2012-03-30 17:44 ` RFH: kvm-1.0+git on 3.3.0 " Philipp Hahn
2012-04-19 13:42   ` [SOLVED] 2.6.32 stuck in flush_tlb_others_ipi() Philipp Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox