From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y Date: Mon, 29 Dec 2008 12:58:21 -0200 Message-ID: <20081229145821.GA3823@amt.cnet> References: <20081210202326.GA7565@dmt.cnet> <49523031.1000305@redhat.com> <20081224152844.GE29319@random.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Avi Kivity , kvm-devel To: Andrea Arcangeli , Nick Piggin Return-path: Received: from mx2.redhat.com ([66.187.237.31]:52652 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750835AbYL2O6n (ORCPT ); Mon, 29 Dec 2008 09:58:43 -0500 Content-Disposition: inline In-Reply-To: <20081224152844.GE29319@random.random> Sender: kvm-owner@vger.kernel.org List-ID: On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote: > On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote: > > Marcelo Tosatti wrote: > >> The destructor for huge pages uses the backing inode for adjusting > >> hugetlbfs accounting. > >> > >> Hugepage mappings are destroyed by exit_mmap, after > >> mmu_notifier_release, so there are no notifications through > >> unmap_hugepage_range at this point. > >> > >> The hugetlbfs inode can be freed with pages backed by it referenced > >> by the shadow. When the shadow releases its reference, the huge page > >> destructor will access a now freed inode. > >> > >> Implement the release operation for kvm mmu notifiers to release page > >> refs before the hugetlbfs inode is gone. > >> > >> > > > > I see this isn't it. Andrea, comments? > > Yeah, the patch looks good, I talked a bit with Marcelo about this by > PM. The issue is that it's not as strightforward as it seems, > basically when I implemented the ->release handlers and had sptes > teardown running before the files were closed (instead of waiting the > kvm anon inode release handler to fire) I was getting bugchecks from > debug options including preempt=y (certain debug checks only becomes > functional with preempt enabled unfortunately), so eventually I > removed ->release because for kvm ->release wasn't useful because no > guest mode can run any more by the time mmu notifier ->release is > invoked, and that avoided the issues with the bugchecks. > > We'll be using the mmu notifiers ->release because it's always called > just before the filehandle are destroyed, it's not really about the > guest mode or secondary mmu but just an ordering issue with hugetlbfs > internals. > > So in short if no bugcheck triggers this is fine (at least until > hugetlbfs provides a way to register some callback to invoke at the > start of the hugetlbfs->release handler). The only bugcheck I see, which triggers on vanilla kvm upstream with CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC<4>ttyS1: 1 input overrun(s) last sysfs file: /sys/class/net/tap0/address CPU 0 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #164 RIP: 0010:[] [] __purge_vmap_area_lazy+0x12c/0x163 RSP: 0018:ffff88021e1f9a38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003 RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001 RBP: ffff88021e1f9a78 R08: 0000000000000286 R09: ffffffff80a1bf50 R10: ffff880119c270f8 R11: ffff88021e1f99b8 R12: ffff88021e1f9a38 R13: ffff88021e1f9a90 R14: ffff88021e1f9a98 R15: 000000000000813a FS: 0000000000000000(0000) GS:ffffffff8080d900(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000008d9828 CR3: 0000000000201000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 4768, threadinfo ffff88021e1f8000, task ffff880119c270f8) Stack: ffff88022bdfd840 ffff880119da11b8 ffffc20011c30000 000000000000813a 0000000000000000 0000000000000001 ffff88022ec11c18 ffff88022f061838 ffff88021e1f9aa8 ffffffff8028ab1d ffff88021e1f9aa8 ffffc20021976000 Call Trace: [] free_unmap_vmap_area_noflush+0x69/0x70 [] remove_vm_area+0x25/0x71 [] __vunmap+0x3a/0xca [] vfree+0x29/0x2b [] kvm_free_physmem_slot+0x25/0x7c [kvm] [] kvm_free_physmem+0x27/0x36 [kvm] [] kvm_arch_destroy_vm+0xa6/0xda [kvm] [] kvm_put_kvm+0x8d/0xa7 [kvm] [] kvm_vcpu_release+0x13/0x17 [kvm] [] __fput+0xeb/0x1a3 [] fput+0x15/0x17 [] filp_close+0x67/0x72 [] put_files_struct+0x74/0xc8 [] exit_files+0x47/0x4f [] do_exit+0x1eb/0x7a7 [] ? _spin_unlock_irq+0x2b/0x51 [] do_group_exit+0x73/0xa0 [] get_signal_to_deliver+0x30c/0x32c [] ? sysret_signal+0x19/0x29 [] do_notify_resume+0x8c/0x851 [] ? do_futex+0x90/0x92a [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? _spin_unlock_irqrestore+0x4c/0x68 [] ? __rcu_read_unlock+0x92/0x9e [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? trace_hardirqs_on+0xd/0xf [] ? getnstimeofday+0x3a/0x96 [] ? ktime_get_ts+0x49/0x4e [] ? sysret_signal+0x5/0x29 [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? sysret_signal+0x19/0x29 [] ptregscall_common+0x67/0xb0 Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 0c db 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 RIP [] __purge_vmap_area_lazy+0x12c/0x163 RSP ---[ end trace fde3e64ebe4bbca2 ]--- Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: qemu-system-x86/4768/0x00000003 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Tainted: G D 2.6.28-00165-g4f27e3e-dirty #164 Call Trace: [] ? __debug_show_held_locks+0x1b/0x24 [] __schedule_bug+0x8c/0x95 [] schedule+0xd3/0x902 [] ? trace_hardirqs_on+0xd/0xf [] ? put_io_context+0x67/0x72 [] do_exit+0xda/0x7a7 [] oops_begin+0x0/0x90 [] die+0x5d/0x66 [] do_general_protection+0x128/0x130 [] ? do_general_protection+0x0/0x130 [] error_exit+0x0/0xa9 [] ? __purge_vmap_area_lazy+0x12c/0x163 [] ? __purge_vmap_area_lazy+0x124/0x163 [] free_unmap_vmap_area_noflush+0x69/0x70 [] remove_vm_area+0x25/0x71 [] __vunmap+0x3a/0xca [] vfree+0x29/0x2b [] kvm_free_physmem_slot+0x25/0x7c [kvm] [] kvm_free_physmem+0x27/0x36 [kvm] [] kvm_arch_destroy_vm+0xa6/0xda [kvm] [] kvm_put_kvm+0x8d/0xa7 [kvm] [] kvm_vcpu_release+0x13/0x17 [kvm] [] __fput+0xeb/0x1a3 [] fput+0x15/0x17 [] filp_close+0x67/0x72 [] put_files_struct+0x74/0xc8 [] exit_files+0x47/0x4f [] do_exit+0x1eb/0x7a7 [] ? _spin_unlock_irq+0x2b/0x51 [] do_group_exit+0x73/0xa0 [] get_signal_to_deliver+0x30c/0x32c [] ? sysret_signal+0x19/0x29 [] do_notify_resume+0x8c/0x851 [] ? do_futex+0x90/0x92a [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? _spin_unlock_irqrestore+0x4c/0x68 [] ? __rcu_read_unlock+0x92/0x9e [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? trace_hardirqs_on+0xd/0xf [] ? getnstimeofday+0x3a/0x96 [] ? ktime_get_ts+0x49/0x4e [] ? sysret_signal+0x5/0x29 [] ? trace_hardirqs_on_caller+0xf0/0x114 [] ? sysret_signal+0x19/0x29 [] ptregscall_common+0x67/0xb0 ttyS1: 26 input overrun(s) And its not specific to vm shutdown path. Another instance: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/class/net/tap0/address CPU 5 Modules linked in: ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun kvm_intel kvm bridge stp llc dm_multipath scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd [last unloaded: x_tables] Pid: 4440, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #163 RIP: 0010:[] [] __purge_vmap_area_lazy+0x12c/0x163 RSP: 0018:ffff88011f4c7be8 EFLAGS: 00010246 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003 RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001 RBP: ffff88011f4c7c28 R08: 0000000000000282 R09: ffffffff80a1bf50 R10: ffff88022e9dc0f8 R11: ffff88011f4c7b68 R12: ffff88011f4c7be8 R13: ffff88011f4c7c40 R14: ffff88011f4c7c48 R15: 0000000000008001 FS: 0000000040abf950(0063) GS:ffff88022f25ed18(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000229d34000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 4440, threadinfo ffff88011f4c6000, task ffff88022e9dc0f8) Stack: ffff8802291a14b0 ffff880229003d58 ffffc20021000000 0000000000008001 ffff880229526000 0000000000000000 ffff88022d073000 ffff88011f58c0c0 ffff88011f4c7c58 ffffffff8028ab1d ffff88011f4c7c58 ffffffffa015c000 Call Trace: [] free_unmap_vmap_area_noflush+0x69/0x70 [] remove_vm_area+0x25/0x71 [] __vunmap+0x3a/0xca [] vunmap+0x26/0x28 [] pio_copy_data+0xcf/0x113 [kvm] [] ? trace_hardirqs_on+0xd/0xf [] complete_pio+0x99/0x1ef [kvm] [] ? sigprocmask+0xc6/0xd0 [] kvm_arch_vcpu_ioctl_run+0x9a/0x889 [kvm] [] kvm_vcpu_ioctl+0xfc/0x48b [kvm] [] vfs_ioctl+0x2a/0x78 [] ? __rcu_read_unlock+0x92/0x9e [] do_vfs_ioctl+0x398/0x3c6 [] ? trace_hardirqs_on+0xd/0xf [] sys_ioctl+0x42/0x65 [] system_call_fastpath+0x16/0x1b Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 8c da 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 RIP [] __purge_vmap_area_lazy+0x12c/0x163 RSP ---[ end trace 31811279a2e983e8 ]--- note: qemu-system-x86[4440] exited with preempt_count 2 (gdb) l *(__purge_vmap_area_lazy + 0x12c) 0xffffffff80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516). 511 if (nr || force_flush) 512 flush_tlb_kernel_range(*start, *end); 513 514 if (nr) { 515 spin_lock(&vmap_area_lock); 516 list_for_each_entry(va, &valist, purge_list) 517 __free_vmap_area(va); 518 spin_unlock(&vmap_area_lock); 519 } 520 spin_unlock(&purge_lock); 0xffffffff80289c9a <__purge_vmap_area_lazy+292>: mov 0x40(%rbx),%rax 0xffffffff80289c9e <__purge_vmap_area_lazy+296>: lea -0x40(%rax),%rbx 0xffffffff80289ca2 <__purge_vmap_area_lazy+300>: mov 0x40(%rbx),%rax ^^^^^^^^^^^^^^^^^^^ 0xffffffff80289ca6 <__purge_vmap_area_lazy+304>: prefetcht0 (%rax) Which vanishes once PREEMPT_RCU is disabled. Nick? KVM does not make direct use of RCU. Same issue happens if the entire __purge_vmap_area_lazy runs with vmap_area_lock held.