KVM Guest mmap.c bug

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM Guest mmap.c bug
@ 2010-03-02 20:25 BRUNO CESAR RIBAS
  2010-03-08 13:32 ` Avi Kivity
  0 siblings, 1 reply; 4+ messages in thread
From: BRUNO CESAR RIBAS @ 2010-03-02 20:25 UTC (permalink / raw)
  To: kvm

Hi,

I run a bunch of virtual servers using KVM. And I a mmap.c bug on the guest
machine. The virtual machines are "desktop servers" for Thin Clients.

My host is running a 2.6.33 kernel and have 32GB of rami, opteron with
amd-v.

The guest is running 2.6.27.45 (tried 2.6.31.12, 2.6.32.9, 2.6.33), some
guests are using 10GB, 4GB or 20GB of ram.

My qemu-kvm version is 0.12.3

All guests are using NFSROOT as the ROOT FS and virtio as the network
driver.

I run the guest with:
kvm  -cpu kvm64 -smp 4 -vnc :101 -daemonize -name ${NOME} -localtime -m $RAM
-net nic,macaddr=$VLAN0,model=virtio,vlan=0 -net tap,vlan=0,ifname=${NOME}0\
-net nic,macaddr=$VLAN121,model=virtio,vlan=121 -net tap,vlan=121,ifname=${NOME}121\
-net nic,macaddr=$VLAN112,model=virtio,vlan=112 -net tap,vlan=112,ifname=${NOME}112\
-kernel /root/vmlinuz-2.6.27.45-amd64-aufs-guest \
-append "root=/dev/nfs rw ip=dhcp nfsroot=$5 init=/sbin/boot.sh"


I have a machine running an identical kernel (without virtio stuff) for a
dedicated machine (as it does not have amd-v) and it stays up for days and
even months. But when running a guest machine with qemu-kvm i get some bug
message and lots of process in D state and i can't 'ps aux' or look inside
/proc and /sys without losing my shell (it hangs).

In `console` I get the folowing message, repeated for different processor,
different Pid and diferent  mmap.c line (line 486 appears to).

------------[ cut here ]------------
kernel BUG at mm/mmap.c:869!
invalid opcode: 0000 [1] SMP 
CPU 2 
Pid: 31334, comm: nautilus Not tainted 2.6.27.45-amd64-aufs-guest-00267
 #2
RIP: 0010:[<ffffffff8027b2e1>]  [<ffffffff8027b2e1>] find_mergeable_ano
f1/0x200
RSP: 0000:ffff8804d933fb38  EFLAGS: 00010283
RAX: ffff8804cb44b9a8 RBX: ffff8804cb44b978 RCX: ffff8804fe6d3088
RDX: 00000000f4803000 RSI: ffff8804fe6d3088 RDI: ffff88049fa56138
RBP: ffff88049fa56138 R08: ffff8804d933e000 R09: 0000000000000000
R10: 00000000ffffffff R11: 00000000ffffffff R12: 0000000000100073
R13: 0000000000100073 R14: 00000000f4803000 R15: ffffffff806ce6c0
FS:  0000000000000000(0000) GS:ffff88051cc7d440(0063) knlGS:00000000f41
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f4803000 CR3: 00000004a7d39000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process nautilus (pid: 31334, threadinfo ffff8804d933e000, task ffff880
)
Stack:  ffffffff8052e62d 0000000000000000 0000000000000000 ffff88049fa5
 ffff88051a5aac40 ffffffff80280382 ffff8804cb41b790 ffff880498919018
 0000000000000000 ffff88049f8dad20 00003ffffffff000 ffffffff802770aa
Call Trace:
 [<ffffffff8052e62d>] ? _spin_lock_irq+0xd/0x10
 [<ffffffff80280382>] ? anon_vma_prepare+0x52/0x100
 [<ffffffff802770aa>] ? handle_mm_fault+0x65a/0x900
 [<ffffffff802de6d8>] ? proc_alloc_inode+0x58/0x90
 [<ffffffff8052e545>] ? __down_read+0x85/0xbc
 [<ffffffff80223331>] ? do_page_fault+0x2a1/0xab0
 [<ffffffff803d6899>] ? vsnprintf+0x4d9/0x750
 [<ffffffff8029d7a1>] ? do_lookup+0x81/0x240
 [<ffffffff8027265d>] ? zone_statistics+0x7d/0x80
 [<ffffffff8052ea3a>] ? error_exit+0x0/0x70
 [<ffffffff803d706d>] ? copy_user_generic_string+0x2d/0x40
 [<ffffffff802e35ec>] ? proc_file_read+0x12c/0x2e0
 [<ffffffff802e34c0>] ? proc_file_read+0x0/0x2e0
 [<ffffffff802dec1a>] ? proc_reg_read+0x8a/0xe0
 [<ffffffff80295995>] ? vfs_read+0xb5/0x160
 [<ffffffff80295b2e>] ? sys_read+0x4e/0x90
 [<ffffffff80227004>] ? ia32_sysret+0x0/0x5


Code: 29 d0 48 c1 e8 0c 48 01 f8 48 3b 83 88 00 00 00 0f 85 5b fe ff ff
 78 e9 c5 fe ff ff 0f 1f 00 31 f6 31 db e9 a9 fe ff ff <0f> 0b eb fe 66
 1f 84 00 00 00 00 00 48 83 ec 08 48 8b 
RIP  [<ffffffff8027b2e1>] find_mergeable_anon_vma+0x1f1/0x200
 RSP <ffff8804d933fb38>
---[ end trace e5ca25224cd7d1d4 ]---


Does anyone has a sugestion? Where to look? What else should I trace?

Thanks in advance,
-- 
Bruno Ribas - ribas@c3sl.ufpr.br
http://www.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM Guest mmap.c bug
  2010-03-02 20:25 KVM Guest mmap.c bug BRUNO CESAR RIBAS
@ 2010-03-08 13:32 ` Avi Kivity
  2010-03-08 14:49   ` Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Avi Kivity @ 2010-03-08 13:32 UTC (permalink / raw)
  To: BRUNO CESAR RIBAS; +Cc: kvm, Andrea Arcangeli

On 03/02/2010 10:25 PM, BRUNO CESAR RIBAS wrote:
> Hi,
>
> I run a bunch of virtual servers using KVM. And I a mmap.c bug on the guest
> machine. The virtual machines are "desktop servers" for Thin Clients.
>
> My host is running a 2.6.33 kernel and have 32GB of rami, opteron with
> amd-v.
>
> The guest is running 2.6.27.45 (tried 2.6.31.12, 2.6.32.9, 2.6.33), some
> guests are using 10GB, 4GB or 20GB of ram.
>
> My qemu-kvm version is 0.12.3
>
> All guests are using NFSROOT as the ROOT FS and virtio as the network
> driver.
>
> I run the guest with:
> kvm  -cpu kvm64 -smp 4 -vnc :101 -daemonize -name ${NOME} -localtime -m $RAM
> -net nic,macaddr=$VLAN0,model=virtio,vlan=0 -net tap,vlan=0,ifname=${NOME}0\
> -net nic,macaddr=$VLAN121,model=virtio,vlan=121 -net tap,vlan=121,ifname=${NOME}121\
> -net nic,macaddr=$VLAN112,model=virtio,vlan=112 -net tap,vlan=112,ifname=${NOME}112\
> -kernel /root/vmlinuz-2.6.27.45-amd64-aufs-guest \
> -append "root=/dev/nfs rw ip=dhcp nfsroot=$5 init=/sbin/boot.sh"
>
>
> I have a machine running an identical kernel (without virtio stuff) for a
> dedicated machine (as it does not have amd-v) and it stays up for days and
> even months. But when running a guest machine with qemu-kvm i get some bug
> message and lots of process in D state and i can't 'ps aux' or look inside
> /proc and /sys without losing my shell (it hangs).
>
>    

> In `console` I get the folowing message, repeated for different processor,
> different Pid and diferent  mmap.c line (line 486 appears to).
>
> ------------[ cut here ]------------
> kernel BUG at mm/mmap.c:869!
> invalid opcode: 0000 [1] SMP
> CPU 2
> Pid: 31334, comm: nautilus Not tainted 2.6.27.45-amd64-aufs-guest-00267
>   #2
> RIP: 0010:[<ffffffff8027b2e1>]  [<ffffffff8027b2e1>] find_mergeable_ano
> f1/0x200
> RSP: 0000:ffff8804d933fb38  EFLAGS: 00010283
> RAX: ffff8804cb44b9a8 RBX: ffff8804cb44b978 RCX: ffff8804fe6d3088
> RDX: 00000000f4803000 RSI: ffff8804fe6d3088 RDI: ffff88049fa56138
> RBP: ffff88049fa56138 R08: ffff8804d933e000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 00000000ffffffff R12: 0000000000100073
> R13: 0000000000100073 R14: 00000000f4803000 R15: ffffffff806ce6c0
> FS:  0000000000000000(0000) GS:ffff88051cc7d440(0063) knlGS:00000000f41
> CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: 00000000f4803000 CR3: 00000004a7d39000 CR4: 00000000000006a0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process nautilus (pid: 31334, threadinfo ffff8804d933e000, task ffff880
> )
> Stack:  ffffffff8052e62d 0000000000000000 0000000000000000 ffff88049fa5
>   ffff88051a5aac40 ffffffff80280382 ffff8804cb41b790 ffff880498919018
>   0000000000000000 ffff88049f8dad20 00003ffffffff000 ffffffff802770aa
> Call Trace:
>   [<ffffffff8052e62d>] ? _spin_lock_irq+0xd/0x10
>   [<ffffffff80280382>] ? anon_vma_prepare+0x52/0x100
>   [<ffffffff802770aa>] ? handle_mm_fault+0x65a/0x900
>   [<ffffffff802de6d8>] ? proc_alloc_inode+0x58/0x90
>   [<ffffffff8052e545>] ? __down_read+0x85/0xbc
>   [<ffffffff80223331>] ? do_page_fault+0x2a1/0xab0
>   [<ffffffff803d6899>] ? vsnprintf+0x4d9/0x750
>   [<ffffffff8029d7a1>] ? do_lookup+0x81/0x240
>   [<ffffffff8027265d>] ? zone_statistics+0x7d/0x80
>   [<ffffffff8052ea3a>] ? error_exit+0x0/0x70
>   [<ffffffff803d706d>] ? copy_user_generic_string+0x2d/0x40
>   [<ffffffff802e35ec>] ? proc_file_read+0x12c/0x2e0
>   [<ffffffff802e34c0>] ? proc_file_read+0x0/0x2e0
>   [<ffffffff802dec1a>] ? proc_reg_read+0x8a/0xe0
>   [<ffffffff80295995>] ? vfs_read+0xb5/0x160
>   [<ffffffff80295b2e>] ? sys_read+0x4e/0x90
>   [<ffffffff80227004>] ? ia32_sysret+0x0/0x5
>
>
> Code: 29 d0 48 c1 e8 0c 48 01 f8 48 3b 83 88 00 00 00 0f 85 5b fe ff ff
>   78 e9 c5 fe ff ff 0f 1f 00 31 f6 31 db e9 a9 fe ff ff<0f>  0b eb fe 66
>   1f 84 00 00 00 00 00 48 83 ec 08 48 8b
> RIP  [<ffffffff8027b2e1>] find_mergeable_anon_vma+0x1f1/0x200
>   RSP<ffff8804d933fb38>
> ---[ end trace e5ca25224cd7d1d4 ]---
>
>
> Does anyone has a sugestion? Where to look? What else should I trace?
>
>    

It looks unrelated to kvm, though of course random memory corruption 
cannot be ruled out.

Is npt enabled on the host (cat /sys/module/kvm_amd/parameters/npt)?

Andrea, any idea?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM Guest mmap.c bug
  2010-03-08 13:32 ` Avi Kivity
@ 2010-03-08 14:49   ` Andrea Arcangeli
  2010-03-09 18:46     ` Bruno Cesar Ribas
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2010-03-08 14:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: BRUNO CESAR RIBAS, kvm

On Mon, Mar 08, 2010 at 03:32:19PM +0200, Avi Kivity wrote:
> It looks unrelated to kvm, though of course random memory corruption 
> cannot be ruled out.
> 
> Is npt enabled on the host (cat /sys/module/kvm_amd/parameters/npt)?
> 
> Andrea, any idea?

Basically find_vma(vma->vm_mm, vma->vm_start) doesn't return "vma"
despite "vma" is the one with the smaller vm_end where the comparison
"vma->vm_start < vma->vm_end" is true (the next vma is null and the
prev will have vma->vm_start == prev->vm_end, not <).

The bug check looks right, it doesn't seem false positive and this
bugcheck indicates that the vma rbtree is memory-corrupted somehow.

so yes fiddling with npt on and off sounds a good start, if it's a bug
in shadow paging it's unlikely the exact same bug materializes with
both npt and without. If the crash happens with npt on and off, then
maybe it's not hypervisor related. Could also be bad RAM if it only
happens on a single host and all other hosts are fine with same binary
guest/host kernels (rbtree walk might stress the memory bus more than
other operations). Said that vm_next being null (and if it's null,
likely vm_next pointer has no ram bitflip) is a bit weird and not
common scenario and this page fault seems triggered with procfs
copy_user call which is non standard, so maybe this is a guest bug. It
would be interesting to know what is the vm_start address, at the end
there are stack, vdso and vsyscall areas.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: KVM Guest mmap.c bug
  2010-03-08 14:49   ` Andrea Arcangeli
@ 2010-03-09 18:46     ` Bruno Cesar Ribas
  0 siblings, 0 replies; 4+ messages in thread
From: Bruno Cesar Ribas @ 2010-03-09 18:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Avi Kivity, kvm

On Mon, Mar 08, 2010 at 03:49:01PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 08, 2010 at 03:32:19PM +0200, Avi Kivity wrote:
> > It looks unrelated to kvm, though of course random memory corruption 
> > cannot be ruled out.
> > 
> > Is npt enabled on the host (cat /sys/module/kvm_amd/parameters/npt)?
> > 
> > Andrea, any idea?
> 
> Basically find_vma(vma->vm_mm, vma->vm_start) doesn't return "vma"
> despite "vma" is the one with the smaller vm_end where the comparison
> "vma->vm_start < vma->vm_end" is true (the next vma is null and the
> prev will have vma->vm_start == prev->vm_end, not <).
> 
> The bug check looks right, it doesn't seem false positive and this
> bugcheck indicates that the vma rbtree is memory-corrupted somehow.
> 
> so yes fiddling with npt on and off sounds a good start, if it's a bug

I can confirm it happens with npt on and off.

And it also happens on a Nehalem XEON (it just happened).

> in shadow paging it's unlikely the exact same bug materializes with
> both npt and without. If the crash happens with npt on and off, then
> maybe it's not hypervisor related. Could also be bad RAM if it only

I doubt it is bad ram! This machine is working (wihtout KVM) for almost 2
years and MCE does not report any problems on the host machine.

And it happens on two identical machines (Opteron) and now o the new (5 days
old) Intel Nehalem XEON.

All guest are Running the same kernel. It happens with a kernel compiled by
me and from debian SID both 2.6.32.9, and from previous kernel I tried
(2.6.31.12 and 2.6.27.45)

> happens on a single host and all other hosts are fine with same binary
> guest/host kernels (rbtree walk might stress the memory bus more than
> other operations). Said that vm_next being null (and if it's null,
> likely vm_next pointer has no ram bitflip) is a bit weird and not
> common scenario and this page fault seems triggered with procfs
> copy_user call which is non standard, so maybe this is a guest bug. It
> would be interesting to know what is the vm_start address, at the end
> there are stack, vdso and vsyscall areas.

I'll make it print vm_start for next reboot.

-- 
Bruno Ribas - ribas@c3sl.ufpr.br
http://www.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-03-09 18:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-02 20:25 KVM Guest mmap.c bug BRUNO CESAR RIBAS
2010-03-08 13:32 ` Avi Kivity
2010-03-08 14:49   ` Andrea Arcangeli
2010-03-09 18:46     ` Bruno Cesar Ribas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox