From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: __purge_vmap_area_lazy crash with CONFIG_PREEMPT_RCU=y
Date: Mon, 29 Dec 2008 12:58:21 -0200
Message-ID: <20081229145821.GA3823@amt.cnet>
References: <20081210202326.GA7565@dmt.cnet> <49523031.1000305@redhat.com> <20081224152844.GE29319@random.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Avi Kivity <avi@redhat.com>, kvm-devel <kvm@vger.kernel.org>
To: Andrea Arcangeli <aarcange@redhat.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:52652 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750835AbYL2O6n (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 29 Dec 2008 09:58:43 -0500
Content-Disposition: inline
In-Reply-To: <20081224152844.GE29319@random.random>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote:
> On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote:
> > Marcelo Tosatti wrote:
> >> The destructor for huge pages uses the backing inode for adjusting
> >> hugetlbfs accounting.
> >>
> >> Hugepage mappings are destroyed by exit_mmap, after
> >> mmu_notifier_release, so there are no notifications through
> >> unmap_hugepage_range at this point.
> >>
> >> The hugetlbfs inode can be freed with pages backed by it referenced
> >> by the shadow. When the shadow releases its reference, the huge page
> >> destructor will access a now freed inode.
> >>
> >> Implement the release operation for kvm mmu notifiers to release page
> >> refs before the hugetlbfs inode is gone.
> >>
> >>   
> >
> > I see this isn't it.  Andrea, comments?
> 
> Yeah, the patch looks good, I talked a bit with Marcelo about this by
> PM. The issue is that it's not as strightforward as it seems,
> basically when I implemented the ->release handlers and had sptes
> teardown running before the files were closed (instead of waiting the
> kvm anon inode release handler to fire) I was getting bugchecks from
> debug options including preempt=y (certain debug checks only becomes
> functional with preempt enabled unfortunately), so eventually I
> removed ->release because for kvm ->release wasn't useful because no
> guest mode can run any more by the time mmu notifier ->release is
> invoked, and that avoided the issues with the bugchecks.
> 
> We'll be using the mmu notifiers ->release because it's always called
> just before the filehandle are destroyed, it's not really about the
> guest mode or secondary mmu but just an ordering issue with hugetlbfs
> internals.
> 
> So in short if no bugcheck triggers this is fine (at least until
> hugetlbfs provides a way to register some callback to invoke at the
> start of the hugetlbfs->release handler).

The only bugcheck I see, which triggers on vanilla kvm upstream with 
CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is:

general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC<4>ttyS1: 1 input overrun(s)

last sysfs file: /sys/class/net/tap0/address
CPU 0 
Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd
Pid: 4768, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #164
RIP: 0010:[<ffffffff8028a5b6>]  [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
RSP: 0018:ffff88021e1f9a38  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003
RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001
RBP: ffff88021e1f9a78 R08: 0000000000000286 R09: ffffffff80a1bf50
R10: ffff880119c270f8 R11: ffff88021e1f99b8 R12: ffff88021e1f9a38
R13: ffff88021e1f9a90 R14: ffff88021e1f9a98 R15: 000000000000813a
FS:  0000000000000000(0000) GS:ffffffff8080d900(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000008d9828 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 4768, threadinfo ffff88021e1f8000, task ffff880119c270f8)
Stack:
 ffff88022bdfd840 ffff880119da11b8 ffffc20011c30000 000000000000813a
 0000000000000000 0000000000000001 ffff88022ec11c18 ffff88022f061838
 ffff88021e1f9aa8 ffffffff8028ab1d ffff88021e1f9aa8 ffffc20021976000
Call Trace:
 [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71
 [<ffffffff8028ac54>] __vunmap+0x3a/0xca
 [<ffffffff8028ad35>] vfree+0x29/0x2b
 [<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm]
 [<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm]
 [<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm]
 [<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm]
 [<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm]
 [<ffffffff802a1c07>] __fput+0xeb/0x1a3
 [<ffffffff802a1cd4>] fput+0x15/0x17
 [<ffffffff8029f26c>] filp_close+0x67/0x72
 [<ffffffff802378a8>] put_files_struct+0x74/0xc8
 [<ffffffff80237943>] exit_files+0x47/0x4f
 [<ffffffff80238fe5>] do_exit+0x1eb/0x7a7
 [<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51
 [<ffffffff80239614>] do_group_exit+0x73/0xa0
 [<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c
 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
 [<ffffffff8020a80f>] do_notify_resume+0x8c/0x851
 [<ffffffff8025b811>] ? do_futex+0x90/0x92a
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68
 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96
 [<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e
 [<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
 [<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0
Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 0c db 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 
RIP  [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
 RSP <ffff88021e1f9a38>
---[ end trace fde3e64ebe4bbca2 ]---
Fixing recursive fault but reboot is needed!
BUG: scheduling while atomic: qemu-system-x86/4768/0x00000003
Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd
Pid: 4768, comm: qemu-system-x86 Tainted: G      D    2.6.28-00165-g4f27e3e-dirty #164
Call Trace:
 [<ffffffff8025585e>] ? __debug_show_held_locks+0x1b/0x24
 [<ffffffff8023187b>] __schedule_bug+0x8c/0x95
 [<ffffffff805851e1>] schedule+0xd3/0x902
 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8037a938>] ? put_io_context+0x67/0x72
 [<ffffffff80238ed4>] do_exit+0xda/0x7a7
 [<ffffffff805892c9>] oops_begin+0x0/0x90
 [<ffffffff8020e3c9>] die+0x5d/0x66
 [<ffffffff80588ff7>] do_general_protection+0x128/0x130
 [<ffffffff80588ecf>] ? do_general_protection+0x0/0x130
 [<ffffffff80588702>] error_exit+0x0/0xa9
 [<ffffffff8028a5b6>] ? __purge_vmap_area_lazy+0x12c/0x163
 [<ffffffff8028a5ae>] ? __purge_vmap_area_lazy+0x124/0x163
 [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71
 [<ffffffff8028ac54>] __vunmap+0x3a/0xca
 [<ffffffff8028ad35>] vfree+0x29/0x2b
 [<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm]
 [<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm]
 [<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm]
 [<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm]
 [<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm]
 [<ffffffff802a1c07>] __fput+0xeb/0x1a3
 [<ffffffff802a1cd4>] fput+0x15/0x17
 [<ffffffff8029f26c>] filp_close+0x67/0x72
 [<ffffffff802378a8>] put_files_struct+0x74/0xc8
 [<ffffffff80237943>] exit_files+0x47/0x4f
 [<ffffffff80238fe5>] do_exit+0x1eb/0x7a7
 [<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51
 [<ffffffff80239614>] do_group_exit+0x73/0xa0
 [<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c
 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
 [<ffffffff8020a80f>] do_notify_resume+0x8c/0x851
 [<ffffffff8025b811>] ? do_futex+0x90/0x92a
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68
 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96
 [<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e
 [<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29
 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29
 [<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0
ttyS1: 26 input overrun(s)

And its not specific to vm shutdown path. Another instance:


general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/class/net/tap0/address
CPU 5 
Modules linked in: ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun kvm_intel kvm bridge stp llc dm_multipath scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd [last unloaded: x_tables]
Pid: 4440, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #163
RIP: 0010:[<ffffffff8028a5b6>]  [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
RSP: 0018:ffff88011f4c7be8  EFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003
RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001
RBP: ffff88011f4c7c28 R08: 0000000000000282 R09: ffffffff80a1bf50
R10: ffff88022e9dc0f8 R11: ffff88011f4c7b68 R12: ffff88011f4c7be8
R13: ffff88011f4c7c40 R14: ffff88011f4c7c48 R15: 0000000000008001
FS:  0000000040abf950(0063) GS:ffff88022f25ed18(0000) knlGS:0000000000000000
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000229d34000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 4440, threadinfo ffff88011f4c6000, task ffff88022e9dc0f8)
Stack:
 ffff8802291a14b0 ffff880229003d58 ffffc20021000000 0000000000008001
 ffff880229526000 0000000000000000 ffff88022d073000 ffff88011f58c0c0
 ffff88011f4c7c58 ffffffff8028ab1d ffff88011f4c7c58 ffffffffa015c000
Call Trace:
 [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70
 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71
 [<ffffffff8028ac54>] __vunmap+0x3a/0xca
 [<ffffffff8028ad0a>] vunmap+0x26/0x28
 [<ffffffffa01be092>] pio_copy_data+0xcf/0x113 [kvm]
 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffffa01be16f>] complete_pio+0x99/0x1ef [kvm]
 [<ffffffff8023fcd2>] ? sigprocmask+0xc6/0xd0
 [<ffffffffa01c0295>] kvm_arch_vcpu_ioctl_run+0x9a/0x889 [kvm]
 [<ffffffffa01b84f4>] kvm_vcpu_ioctl+0xfc/0x48b [kvm]
 [<ffffffff802ac760>] vfs_ioctl+0x2a/0x78
 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e
 [<ffffffff802acb46>] do_vfs_ioctl+0x398/0x3c6
 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff802acbb6>] sys_ioctl+0x42/0x65
 [<ffffffff8020b43b>] system_call_fastpath+0x16/0x1b
Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 8c da 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 
RIP  [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163
 RSP <ffff88011f4c7be8>
---[ end trace 31811279a2e983e8 ]---
note: qemu-system-x86[4440] exited with preempt_count 2


(gdb) l *(__purge_vmap_area_lazy + 0x12c)
0xffffffff80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516).
511     if (nr || force_flush)
512         flush_tlb_kernel_range(*start, *end);
513 
514     if (nr) {
515         spin_lock(&vmap_area_lock);
516         list_for_each_entry(va, &valist, purge_list)
517             __free_vmap_area(va);
518         spin_unlock(&vmap_area_lock);
519     }
520     spin_unlock(&purge_lock);

0xffffffff80289c9a <__purge_vmap_area_lazy+292>:    mov 0x40(%rbx),%rax
0xffffffff80289c9e <__purge_vmap_area_lazy+296>:    lea -0x40(%rax),%rbx
0xffffffff80289ca2 <__purge_vmap_area_lazy+300>:    mov 0x40(%rbx),%rax
                                                    ^^^^^^^^^^^^^^^^^^^
0xffffffff80289ca6 <__purge_vmap_area_lazy+304>:    prefetcht0 (%rax)


Which vanishes once PREEMPT_RCU is disabled. 

Nick? KVM does not make direct use of RCU. Same issue happens if the
entire __purge_vmap_area_lazy runs with vmap_area_lock held.