KVM guest crashes

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM guest crashes
@ 2009-01-20 15:49 Alexander Graf
  2009-01-20 20:07 ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-20 15:49 UTC (permalink / raw)
  To: kvm@vger.kernel.org; +Cc: Avi Kivity, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Hi list,

recently I've been hitting some KVM bugs others seem to have reported as
well, including

- CIFS timeouts
- Stuck ?? errors
- Random segmentation faults in the guest

so I figured, I'll put together a stress test that can be used to
reproduce these issues. This is done by using a CIFS mount on the host
and unpacking data from that mount to the mount. I have been able to
bring kvm down to its knees a lot just by doing this.
Simply run the test in an endless-loop. FWIW enabling NPT helps
triggering the issue.

The guest kernels included here are openSUSE 11.0 (2.6.25) and 11.1
(2.6.27) kernels.

Find the tests here: http://alex.csgraf.de/kvm-tests.tar.bz2
And some logs here (NPT enabled): http://alex.csgraf.de/kvm-logs.tar.bz2

I'm somewhat lost on the reason for these failures, so if you do have
some time on your hands, please give me a hand debugging this! If I'd
had to guess, I'd say it's either an APIC issue and/or guest memory
corruption.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-20 15:49 KVM guest crashes Alexander Graf
@ 2009-01-20 20:07 ` Avi Kivity
  2009-01-20 20:20   ` Alexander Graf
  2009-01-21  8:14   ` Alexander Graf
  0 siblings, 2 replies; 18+ messages in thread
From: Avi Kivity @ 2009-01-20 20:07 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Alexander Graf wrote:
> Hi list,
>
> recently I've been hitting some KVM bugs others seem to have reported as
> well, including
>
> - CIFS timeouts
> - Stuck ?? errors
> - Random segmentation faults in the guest
>
> so I figured, I'll put together a stress test that can be used to
> reproduce these issues. This is done by using a CIFS mount on the host
> and unpacking data from that mount to the mount. I have been able to
> bring kvm down to its knees a lot just by doing this.
> Simply run the test in an endless-loop. FWIW enabling NPT helps
> triggering the issue.
>
>   

Are the problems specific to AMD?  What does "helps triggering" mean - 
does it happen with NPT disabled?

> The guest kernels included here are openSUSE 11.0 (2.6.25) and 11.1
> (2.6.27) kernels.
>
> Find the tests here: http://alex.csgraf.de/kvm-tests.tar.bz2
> And some logs here (NPT enabled): http://alex.csgraf.de/kvm-logs.tar.bz2
>
> I'm somewhat lost on the reason for these failures, so if you do have
> some time on your hands, please give me a hand debugging this! If I'd
> had to guess, I'd say it's either an APIC issue and/or guest memory
> corruption.
>   

I'd guess memory corruption.

Does running a uniprocessor guest help?  What about a uniprocessor guest 
pinned to one host core?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-20 20:07 ` Avi Kivity
@ 2009-01-20 20:20   ` Alexander Graf
  2009-01-21  8:14   ` Alexander Graf
  1 sibling, 0 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-20 20:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang





On 20.01.2009, at 21:07, Avi Kivity <avi@redhat.com> wrote:

> Alexander Graf wrote:
>> Hi list,
>>
>> recently I've been hitting some KVM bugs others seem to have  
>> reported as
>> well, including
>>
>> - CIFS timeouts
>> - Stuck ?? errors
>> - Random segmentation faults in the guest
>>
>> so I figured, I'll put together a stress test that can be used to
>> reproduce these issues. This is done by using a CIFS mount on the  
>> host
>> and unpacking data from that mount to the mount. I have been able to
>> bring kvm down to its knees a lot just by doing this.
>> Simply run the test in an endless-loop. FWIW enabling NPT helps
>> triggering the issue.
>>
>>
>
> Are the problems specific to AMD?

I don't know, as all machines I tried it on were AMD so far. But  
judging from user reports on the ml, it happens on Intel too.

> What does "helps triggering" mean - does it happen with NPT disabled?

It seems like the chances for breakage are higher with NPT enabled. I  
do see them without as well though.

>
>
>> The guest kernels included here are openSUSE 11.0 (2.6.25) and 11.1
>> (2.6.27) kernels.
>>
>> Find the tests here: http://alex.csgraf.de/kvm-tests.tar.bz2
>> And some logs here (NPT enabled): http://alex.csgraf.de/kvm-logs.tar.bz2
>>
>> I'm somewhat lost on the reason for these failures, so if you do have
>> some time on your hands, please give me a hand debugging this! If I'd
>> had to guess, I'd say it's either an APIC issue and/or guest memory
>> corruption.
>>
>
> I'd guess memory corruption.
>
> Does running a uniprocessor guest help?  What about a uniprocessor  
> guest pinned to one host core?

I'll try to start tests tomorrow.

Alex

>
>
> -- 
> Do not meddle in the internals of kernels, for they are subtle and  
> quick to panic.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-20 20:07 ` Avi Kivity
  2009-01-20 20:20   ` Alexander Graf
@ 2009-01-21  8:14   ` Alexander Graf
  2009-01-21  9:05     ` Avi Kivity
  1 sibling, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-21  8:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Avi Kivity wrote:
> Alexander Graf wrote:
>> The guest kernels included here are openSUSE 11.0 (2.6.25) and 11.1
>> (2.6.27) kernels.
>>
>> Find the tests here: http://alex.csgraf.de/kvm-tests.tar.bz2
>> And some logs here (NPT enabled): http://alex.csgraf.de/kvm-logs.tar.bz2
>>
>> I'm somewhat lost on the reason for these failures, so if you do have
>> some time on your hands, please give me a hand debugging this! If I'd
>> had to guess, I'd say it's either an APIC issue and/or guest memory
>> corruption.
>>   
>
> I'd guess memory corruption.
>
> Does running a uniprocessor guest help?  What about a uniprocessor
> guest pinned to one host core?

So last night I started several guests with -smp 8 but without network
to see if IO load is causing the problems. All VMs are down, but one
panic log is rather new:

Stuck ??
Stuck ??
Stuck ??
Stuck ??
Stuck ??
Stuck ??
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff80237454>] cpu_attach_domain+0x84/0x207
PGD 0
Oops: 0000 [1] SMP
last sysfs file:
CPU 1
Modules linked in:
Supported: Yes
Pid: 1, comm: swapper Tainted: G S        2.6.27.11-1-default #1
RIP: 0010:[<ffffffff80237454>]  [<ffffffff80237454>]
cpu_attach_domain+0x84/0x207
RSP: 0018:ffff88007a419c50  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880001077a60 RCX: ffff88007a419c40
RDX: 000000000000044d RSI: 0000000000000200 RDI: 0000000000000000
RBP: ffff88007a419c90 R08: 0000000000000000 R09: 0000000000000200
R10: 0000000000000008 R11: 0000000000018600 R12: ffff8800010778d0
R13: ffff880001077a78 R14: ffff8800010775b0 R15: ffff88000107f700
FS:  0000000000000000(0000) GS:ffff88007afeb540(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88007a418000, task ffff88007a406040)
Stack:  000000047a4616c0 ffff88007a548000 0000002f0000044d 0000000000000004
 ffffffff80a275b0 0000000000000000 ffff88007a460e00 ffff88007a45c140
 ffff88007a419ec0 ffffffff80238190 ffff88007a419dc0 ffff88007a419e00
Call Trace:
 [<ffffffff80238190>] __build_sched_domains+0xbb9/0xbf5
 [<ffffffff80981ae4>] sched_init_smp+0xa9/0x1d8
 [<ffffffff8096b850>] kernel_init+0x74/0xea
 [<ffffffff8020cf79>] child_rip+0xa/0x11


Code: 00 4c 89 ef 89 45 d4 8b 83 88 00 00 00 89 45 d0 e8 d1 05 13 00 ff
c8 74 5d 8b 93 88 00 00 00 f7 c2 8f 02 00 00 74 0d 48 8b 43 10 <48> 3b
00 0f 85 24 01 00 00 80 e2 70 0f 85 1b 01 00 00 eb 37 48
RIP  [<ffffffff80237454>] cpu_attach_domain+0x84/0x207
 RSP <ffff88007a419c50>
CR2: 0000000000000000
---[ end trace 4eaa2a86a8e2da22 ]---
Kernel panic - not syncing: Attempted to kill init!


>From what I've seen it's always related to IPIs, but that's just a
guess. I'll start UP testing now.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-21  8:14   ` Alexander Graf
@ 2009-01-21  9:05     ` Avi Kivity
  2009-01-21  9:36       ` Avi Kivity
  0 siblings, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2009-01-21  9:05 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Alexander Graf wrote:
> Avi Kivity wrote:
>   
>> Alexander Graf wrote:
>>     
>>> The guest kernels included here are openSUSE 11.0 (2.6.25) and 11.1
>>> (2.6.27) kernels.
>>>
>>> Find the tests here: http://alex.csgraf.de/kvm-tests.tar.bz2
>>> And some logs here (NPT enabled): http://alex.csgraf.de/kvm-logs.tar.bz2
>>>
>>> I'm somewhat lost on the reason for these failures, so if you do have
>>> some time on your hands, please give me a hand debugging this! If I'd
>>> had to guess, I'd say it's either an APIC issue and/or guest memory
>>> corruption.
>>>   
>>>       
>> I'd guess memory corruption.
>>
>> Does running a uniprocessor guest help?  What about a uniprocessor
>> guest pinned to one host core?
>>     
>
> So last night I started several guests with -smp 8 but without network
> to see if IO load is causing the problems. All VMs are down, but one
> panic log is rather new:
>
> Stuck ??
> Stuck ??
> Stuck ??
> Stuck ??
> Stuck ??
> Stuck ??
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff80237454>] cpu_attach_domain+0x84/0x207
>   

This is right on startup, if I read things right.

I suggest checking if you have the latest BIOS update applied.  I've had 
bad experiences with un-updated processors.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-21  9:05     ` Avi Kivity
@ 2009-01-21  9:36       ` Avi Kivity
  2009-01-21 10:44         ` Alexander Graf
  2009-01-22 20:29         ` Alexander Graf
  0 siblings, 2 replies; 18+ messages in thread
From: Avi Kivity @ 2009-01-21  9:36 UTC (permalink / raw)
  To: Alexander Graf
  Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Avi Kivity wrote:
>
> I suggest checking if you have the latest BIOS update applied.  I've 
> had bad experiences with un-updated processors.
>

FWIW, I have an 8-way F9 guest (2.6.27.5-blah) running on an 2x4 
Barcelona host, happily make -j16ing an allmodconfig kernel.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-21  9:36       ` Avi Kivity
@ 2009-01-21 10:44         ` Alexander Graf
  2009-01-22 20:29         ` Alexander Graf
  1 sibling, 0 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-21 10:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> I suggest checking if you have the latest BIOS update applied.  I've
>> had bad experiences with un-updated processors.
>>
>
> FWIW, I have an 8-way F9 guest (2.6.27.5-blah) running on an 2x4
> Barcelona host, happily make -j16ing an allmodconfig kernel.

Strange. I started the tests again with an updated BIOS now, installing
an Intel machine to test on in parallel.

old:

# ./rdmsr /dev/cpu/0/msr $(( 0x0000008b ))
0x1000065

new:

# ./rdmsr /dev/cpu/0/msr $(( 0x0000008b ))
0x1000083


But I already got one guest crashing:

int3: 0000 [1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
CPU 2
Modules linked in: nls_utf8 cifs(X) af_packet virtio_net virtio_pci
virtio_ring virtio edd ext3 mbcache jbd fan ide_pci_generic ide_core
ata_generic sata_nv libata scsi_mod dock thermal processor thermal_sys
 hwmon
Supported: Yes, External
Pid: 0, comm: swapper Tainted: G S        2.6.27.7-9-default #1
RIP: 0010:[<ffffffff80a500f1>]  [<ffffffff80a500f1>]
per_cpu__cpu_state+0x1/0x4
RSP: 0018:ffff88007a493fa8  EFLAGS: 00000083
RAX: ffffffff806f5fa0 RBX: ffffffff80a500f0 RCX: 0000000000000000
RDX: ffff880001033200 RSI: 0000000000000000 RDI: ffffffffff5fc0b0
RBP: ffff88007a48beb0 R08: 0000000000000000 R09: ffff880001039638
R10: 00000000ffffffff R11: ffffffff8021c5d9 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007fe3252e4950(0000) GS:ffff88007a461f40(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000062d000 CR3: 000000007c10a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88007a48a000, task ffff88007a488280)
Stack:  ffff88007a48beb0 ffffffff8020ca2e ffff88007a48beb0 <EOI> 
0000007dd83ce327
 0000000000000086 ffff8800010396d0 0000000002625a00 0000000000000002
 000000010000eadc 0000007dd83ce327 0000000000000292 0000000000000292
Call Trace:
Inexact backtrace:

 <IRQ>  [<ffffffff8020ca2e>] ? ret_from_intr+0x0/0x29
 <EOI>  [<ffffffff804a6992>] ? notifier_call_chain+0x29/0x4c
 [<ffffffff80213465>] ? default_idle+0x38/0x54
 [<ffffffff8020b34a>] ? cpu_idle+0xa9/0xf1


Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc <cc> cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
RIP  [<ffffffff80a500f1>] per_cpu__cpu_state+0x1/0x4
 RSP <ffff88007a493fa8>
---[ end trace 17313f34f216af07 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
------------[ cut here ]------------
WARNING: at kernel/smp.c:331 smp_call_function_mask+0x38/0x1f2()
Modules linked in: nls_utf8 cifs(X) af_packet virtio_net virtio_pci
virtio_ring virtio edd ext3 mbcache jbd fan ide_pci_generic ide_core
ata_generic sata_nv libata scsi_mod dock thermal processor thermal_sys
 hwmon
Supported: Yes, External
Pid: 0, comm: swapper Tainted: G S    D   2.6.27.7-9-default #1

Call Trace:
 [<ffffffff8020e42e>] show_trace_log_lvl+0x41/0x58
 [<ffffffff804a1e97>] dump_stack+0x69/0x6f
 [<ffffffff80240eb2>] warn_on_slowpath+0x51/0x77
 [<ffffffff80261fef>] smp_call_function_mask+0x38/0x1f2
 [<ffffffff802621d2>] smp_call_function+0x29/0x2e
 [<ffffffff8021ba16>] native_smp_send_stop+0x1a/0x3f
 [<ffffffff804a1f59>] panic+0xbc/0x170
 [<ffffffff802449e2>] do_exit+0x6b/0x334
 [<ffffffff804a4b9b>] oops_begin+0x0/0x9e
 [<ffffffff804a524a>] do_int3+0x7d/0xa1
 [<ffffffff804a46e6>] int3+0xb6/0xf0
 [<ffffffff80a500f1>] per_cpu__cpu_state+0x1/0x4
DWARF2 unwinder stuck at per_cpu__cpu_state+0x1/0x4

Leftover inexact backtrace:

 <IRQ>  [<ffffffff8020ca2e>] ret_from_intr+0x0/0x29
 <EOI>  [<ffffffff804a6992>] notifier_call_chain+0x29/0x4c
 [<ffffffff80213465>] default_idle+0x38/0x54
 [<ffffffff8020b34a>] cpu_idle+0xa9/0xf1

---[ end trace 17313f34f216af07 ]---


The UP guests seemed to work fine - will start them again now.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-21  9:36       ` Avi Kivity
  2009-01-21 10:44         ` Alexander Graf
@ 2009-01-22 20:29         ` Alexander Graf
  2009-01-22 20:36           ` Alexander Graf
  2009-01-23 22:36           ` Marcelo Tosatti
  1 sibling, 2 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-22 20:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> I suggest checking if you have the latest BIOS update applied.  I've
>> had bad experiences with un-updated processors.
>>
>
> FWIW, I have an 8-way F9 guest (2.6.27.5-blah) running on an 2x4
> Barcelona host, happily make -j16ing an allmodconfig kernel.
>

Following the discussion on IRC, I tried -no-kvm-irqchip and found some
virtual machines broken after >1 day of stress testing again:

+ sudo -u contain2 env -i qemu-kvm -localtime -kernel virtio-kernel
-initrd virtio-initrd -nographic -append 'quiet clocksource=acpi_pm
cifsuser=contain2 cifspass=contain2 root=cifs://contain2:contain2@172.1
6.2.1/contain2 realroot=//172.16.2.1/users/contain2
ip=172.16.2.2:172.16.2.1::255.255.255.0::eth0:none console=ttyS0
dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:2 -net
tap,ifname=tap2,sc
ript=/bin/true -m 2000 -nographic -smp 4 -no-kvm-irqchip /dev/null
qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
Stuck ??
Stuck ??
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff802b539a>] kfree+0x18b/0x26e
PGD 0
Oops: 0000 [1] SMP
last sysfs file:
CPU 2
Modules linked in:
Supported: Yes
Pid: 0, comm: swapper Tainted: G S        2.6.27.7-9-default #1
RIP: 0010:[<ffffffff802b539a>]  [<ffffffff802b539a>] kfree+0x18b/0x26e
RSP: 0018:ffff88007a493e90  EFLAGS: 00010046
RAX: 0000000000000002 RBX: ffff8800010397f0 RCX: ffff88007a480778
RDX: ffffe20000000000 RSI: ffff8800010397f0 RDI: ffff88007a5ae140
RBP: 0000000000000000 R08: ffff8800010395d0 R09: ffff88007a493eb8
R10: ffffffff80a59980 R11: ffffffff8021c5d9 R12: 0000000000000001
R13: ffff88007ac04080 R14: 0000000010200042 R15: ffff88007a5ae140
FS:  0000000000000000(0000) GS:ffff88007a461f40(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88007a48a000, task ffff88007a488280)
Stack:  ffffffff8023df9c ffffffff8073a108 0000000000000286 ffffffff8024a1eb
 ffffffff80259d80 ffff8800010397f0 0000000000000000 0000000000000001
 000000000000000a 0000000010200042 0000000000000010 ffffffff802831d0
Call Trace:
 [<ffffffff802831d0>] __rcu_process_callbacks+0x189/0x203
 [<ffffffff80283271>] rcu_process_callbacks+0x27/0x47
 [<ffffffff802464ed>] __do_softirq+0x84/0x115
 [<ffffffff8020dc9c>] call_softirq+0x1c/0x28
 [<ffffffff8020f067>] do_softirq+0x3c/0x81
 [<ffffffff80246204>] irq_exit+0x3f/0x83
 [<ffffffff8021ce5f>] smp_apic_timer_interrupt+0x95/0xae
 [<ffffffff8020d4a3>] apic_timer_interrupt+0x83/0x90
 [<ffffffff80221f1d>] native_safe_halt+0x2/0x3
 [<ffffffff80213465>] default_idle+0x38/0x54
 [<ffffffff8020b34a>] cpu_idle+0xa9/0xf1


Code: 01 00 00 00 e8 4c fa ff ff 48 83 3d a0 19 44 00 00 49 8b 44 dd 08
48 8d 78 40 75 04 0f 0b eb fe e8 e5 cc f6 ff 90 e9 c7 00 00 00 <8b> 55
00 3b 55 04 73 0f 89 d0 4c 89 7c c5 18 8d 42 01 e9 ad 00
RIP  [<ffffffff802b539a>] kfree+0x18b/0x26e
 RSP <ffff88007a493e90>
CR2: 0000000000000000
---[ end trace 4eaa2a86a8e2da22 ]---


Also after two days of permanent stress testing I also got the Intel
machine w/ current git down:

+ sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
-kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
root=cifs://contain1:contain1@172.16.1.1/contain1
realroot=//172.16.1.1/users/contain1
ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
Stuck ??

No backtrace here though. That's all I got from the serial console.

The only issues I had with the UP guests so far was this:

+ taskset -c 6 sudo -u contain6 env -i qemu-kvm -localtime -kernel
virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
clocksource=acpi_pm cifsuser=contain6 cifspass=contain6
root=cifs://contain6:contain6@172.16.6.1/contain6
realroot=//172.16.6.1/users/contain6
ip=172.16.6.2:172.16.6.1::255.255.255.0::eth0:none console=ttyS0
dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:6 -net
tap,ifname=tap6,script=/bin/true -m 2000 -nographic /dev/null
qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work!  Boot with
apic=debug and send a report.  Then try booting with the 'noapic' option.

which can be annoying at times too. Can't we just detect that it's the
detection and give the guest its interrupts? Or should the PIT
reinjection thing help here?


Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-22 20:29         ` Alexander Graf
@ 2009-01-22 20:36           ` Alexander Graf
  2009-01-22 20:55             ` Alexander Graf
  2009-01-23 22:36           ` Marcelo Tosatti
  1 sibling, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-22 20:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Alexander Graf wrote:

[...]
> Also after two days of permanent stress testing I also got the Intel
> machine w/ current git down:
>
> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
> root=cifs://contain1:contain1@172.16.1.1/contain1
> realroot=//172.16.1.1/users/contain1
> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
> Stuck ??
>
> No backtrace here though. That's all I got from the serial console.
>   

+ sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
-kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
root=cifs://contain1:contain1@172.16.1.1/contain1
realroot=//172.16.1.1/users/contain1
ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
Stuck ??

(qemu) info cpus
* CPU #0: pc=0xffffffff80221f1d thread_id=15211
  CPU #1: pc=0xffffffff80221f1d thread_id=15212
  CPU #2: pc=0xffffffff80221f1d thread_id=15213
  CPU #3: pc=0xffffffff80221f1d thread_id=15214
  CPU #4: pc=0xffffffff8049f7d0 thread_id=15215
  CPU #5: pc=0xffffffff80221f1d thread_id=15216
  CPU #6: pc=0xffffffff80221f1d thread_id=15217
  CPU #7: pc=0x000000000009f02c thread_id=15218

(qemu) cpu 7
(qemu) info registers
EAX=00000c06 EBX=000005b8 ECX=00000000 EDX=00000000
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000002c EFL=00033002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 0000f300
CS =9f00 0009f000 0000ffff 0000f300
SS =0000 00000000 0000ffff 0000f300
DS =0000 00000000 0000ffff 0000f300
FS =0000 00000000 0000ffff 0000f300
GS =0000 00000000 0000ffff 0000f300
LDT=0000 00000000 0000ffff 00008200
TR =0000 fffbd000 00002088 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00000000
FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
XMM00=00000000000000000000000000000000
XMM01=00000000000000000000000000000000
XMM02=00000000000000000000000000000000
XMM03=00000000000000000000000000000000
XMM04=00000000000000000000000000000000
XMM05=00000000000000000000000000000000
XMM06=00000000000000000000000000000000
XMM07=00000000000000000000000000000000

Is that guest really seriously in BIOS code? After booting Linux?

(qemu) x /2i $pc-1
0x000000000009f02b:  hlt   
0x000000000009f02c:  jmp    0x9f02b

Where is this? Looks like panic code to me.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-22 20:36           ` Alexander Graf
@ 2009-01-22 20:55             ` Alexander Graf
  2009-01-23 16:36               ` Alexander Graf
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-22 20:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Alexander Graf wrote:
> Alexander Graf wrote:
>
> [...]
>   
>> Also after two days of permanent stress testing I also got the Intel
>> machine w/ current git down:
>>
>> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
>> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
>> root=cifs://contain1:contain1@172.16.1.1/contain1
>> realroot=//172.16.1.1/users/contain1
>> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
>> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>>
>> No backtrace here though. That's all I got from the serial console.
>>   
>>     
>
> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
> root=cifs://contain1:contain1@172.16.1.1/contain1
> realroot=//172.16.1.1/users/contain1
> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
> Stuck ??
>
> (qemu) info cpus
> * CPU #0: pc=0xffffffff80221f1d thread_id=15211
>   CPU #1: pc=0xffffffff80221f1d thread_id=15212
>   CPU #2: pc=0xffffffff80221f1d thread_id=15213
>   CPU #3: pc=0xffffffff80221f1d thread_id=15214
>   CPU #4: pc=0xffffffff8049f7d0 thread_id=15215
>   CPU #5: pc=0xffffffff80221f1d thread_id=15216
>   CPU #6: pc=0xffffffff80221f1d thread_id=15217
>   CPU #7: pc=0x000000000009f02c thread_id=15218
>
> (qemu) cpu 7
> (qemu) info registers
> EAX=00000c06 EBX=000005b8 ECX=00000000 EDX=00000000
> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
> EIP=0000002c EFL=00033002 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 0000f300
> CS =9f00 0009f000 0000ffff 0000f300
> SS =0000 00000000 0000ffff 0000f300
> DS =0000 00000000 0000ffff 0000f300
> FS =0000 00000000 0000ffff 0000f300
> GS =0000 00000000 0000ffff 0000f300
> LDT=0000 00000000 0000ffff 00008200
> TR =0000 fffbd000 00002088 00008b00
> GDT=     00000000 0000ffff
> IDT=     00000000 0000ffff
> CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
> DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
> DR6=ffff0ff0 DR7=00000400
> FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00000000
> FPR0=0000000000000000 0000 FPR1=0000000000000000 0000
> FPR2=0000000000000000 0000 FPR3=0000000000000000 0000
> FPR4=0000000000000000 0000 FPR5=0000000000000000 0000
> FPR6=0000000000000000 0000 FPR7=0000000000000000 0000
> XMM00=00000000000000000000000000000000
> XMM01=00000000000000000000000000000000
> XMM02=00000000000000000000000000000000
> XMM03=00000000000000000000000000000000
> XMM04=00000000000000000000000000000000
> XMM05=00000000000000000000000000000000
> XMM06=00000000000000000000000000000000
> XMM07=00000000000000000000000000000000
>
> Is that guest really seriously in BIOS code? After booting Linux?
>
> (qemu) x /2i $pc-1
> 0x000000000009f02b:  hlt   
> 0x000000000009f02c:  jmp    0x9f02b
>
> Where is this? Looks like panic code to me.
>   
0x000000000009f000:  cli   
0x000000000009f001:  xor    %ax,%ax
0x000000000009f003:  mov    %ax,%ds
0x000000000009f005:  mov    $0x510,%ebx
0x000000000009f00b:  addr32 mov (%ebx),%ecx
0x000000000009f00f:  test   %ecx,%ecx
0x000000000009f012:  je     0x9f026
0x000000000009f014:  addr32 mov 0x4(%ebx),%eax
0x000000000009f019:  addr32 mov 0x8(%ebx),%edx
0x000000000009f01e:  wrmsr 
0x000000000009f020:  add    $0xc,%ebx
0x000000000009f024:  jmp    0x9f00b
0x000000000009f026:  lock incw 1856
0x000000000009f02b:  hlt   
0x000000000009f02c:  jmp    0x9f02b

Looks a lot like this:

smp_ap_boot_code_start:
  cli
  xor %ax, %ax
  mov %ax, %ds

  mov $SMP_MSR_ADDR, %ebx
11:
  mov 0(%ebx), %ecx
  test %ecx, %ecx
  jz 12f
  mov 4(%ebx), %eax
  mov 8(%ebx), %edx
  wrmsr
  add $12, %ebx
  jmp 11b
12:

  lock incw smp_cpus
1:
  hlt
  jmp 1b


But that code shouldn't run after Linux booted, right? And without at
least a "Power Off" message I'd expect Linux to still be up.
The only thing the host's dmesg was saying is this:

Ignoring delivery mode 3 (repeated often)

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-22 20:55             ` Alexander Graf
@ 2009-01-23 16:36               ` Alexander Graf
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-23 16:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm@vger.kernel.org, Marcelo Tosatti, Joerg Roedel, Sheng Yang

Alexander Graf wrote:
> Alexander Graf wrote:
>   
>> Alexander Graf wrote:
>>
>> [...]
>>   
>>     
>>> Also after two days of permanent stress testing I also got the Intel
>>> machine w/ current git down:
>>>
>>> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
>>> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>>> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
>>> root=cifs://contain1:contain1@172.16.1.1/contain1
>>> realroot=//172.16.1.1/users/contain1
>>> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
>>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
>>> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
>>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>>> Stuck ??
>>>
>>> No backtrace here though. That's all I got from the serial console.
>>>   
>>>     
>>>       
>> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
>> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
>> root=cifs://contain1:contain1@172.16.1.1/contain1
>> realroot=//172.16.1.1/users/contain1
>> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
>> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>>     
[...]

In order to provide you with more dumps that might point to some
direction (I'm still lost on figuring where to look), here's another AMD
NPT guest crash with current git. It somehow looks as if the guest
pagetable is corrupted.

+ sudo -u contain3 env -i /usr/local/bin/qemu-system-x86_64 -localtime
-kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
clocksource=acpi_pm cifsuser=con
tain3 cifspass=contain3
root=cifs://contain3:contain3@172.16.3.1/contain3
realroot=//172.16.3.1/users/contain3
ip=172.16.3.2:172.16.3.1::255.255.255.0::eth0:none console=tty
S0 dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:3
-net tap,ifname=tap3,script=/bin/true -m 2000 -nographic -smp 8
-no-kvm-irqchip /dev/null
qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
pci 0000:00:01.0: PIIX3: Enabling Passive Release
IP-Config: Device `eth0' not found.
doing fast boot
Creating device nodes with udev
^MBoot logging started on /dev/ttyS0(/dev/console) at Thu Jan 22
23:05:55 2009^M
[NETWORK] using static config based on
ip=172.16.3.2:172.16.3.1::255.255.255.0::eth0:none^M
Trying manual resume from /dev/disk/by-id/ata-ST380815AS_5RW3M74V-part1^M
resume device /dev/disk/by-id/ata-ST380815AS_5RW3M74V-part1 not found
(ignoring)^M
Trying manual resume from /dev/disk/by-id/ata-ST380815AS_5RW3M74V-part1^M
resume device /dev/disk/by-id/ata-ST380815AS_5RW3M74V-part1 not found
(ignoring)^M
node name not found^M
Mounting root //172.16.3.1/contain3^M
RTNETLINK answers: File exists^M
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN ^M
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00^M
    inet 127.0.0.1/8 scope host lo^M
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UNKNOWN qlen 1000^M
    link/ether 52:54:00:12:34:03 brd ff:ff:ff:ff:ff:ff^M
    inet 172.16.3.2 peer 172.16.3.1/24 scope global eth0^M
BUG: unable to handle kernel paging request at 0000000000100100
IP: [<ffffffff8036a603>] strnlen+0x10/0x19
PGD 7c596067 PUD 7c9ed067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /sys/kernel/uevent_seqnum
CPU 7
Modules linked in: nls_utf8 cifs(X) af_packet virtio_net virtio_pci
virtio_ring virtio edd ext3 mbcache jbd fan ide_pci_generic ide_core
ata_generic sata_nv libata scsi_mod
dock thermal processor thermal_sys hwmon
Supported: Yes, External
Pid: 782, comm: halt Tainted: G S        2.6.27.7-9-default #1
RIP: 0010:[<ffffffff8036a603>]  [<ffffffff8036a603>] strnlen+0x10/0x19
RSP: 0018:ffff88007c46da70  EFLAGS: 00010082
RAX: 0000000000100100 RBX: 0000000000000000 RCX: 00000000ffffffff
RDX: 0000000000100100 RSI: fffffffffffffffe RDI: 0000000000100100
RBP: ffffffff80ae0fad R08: 00000000ffffffff R09: 0000000000000000
R10: 000000000000000a R11: 0000000000000000 R12: 0000000000100100
R13: 00000000ffffffff R14: ffffffff80ae13a0 R15: 00000000ffffffff
FS:  00007f0b2aee06f0(0000) GS:ffff88007a57bf40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000100100 CR3: 000000007c4e5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process halt (pid: 782, threadinfo ffff88007c46c000, task ffff88007c17e0c0)
Stack:  ffffffff8036b39d ffff88007c46ddb8 ffffffff80ae0fad ffffffff805d7e29
 0000000000000000 00000000ffffffff ffffffff8036b6f6 00007f0b2ace27e0
 ffff88007c595ab0 ffff88007c0624a8 0000000000000400 ffffffff80ae0fa0
Call Trace:
 [<ffffffff8036b39d>] string+0x34/0x91
 [<ffffffff8036b6f6>] vsnprintf+0x2fc/0x574
 [<ffffffff8036ba56>] vscnprintf+0x9/0x17
 [<ffffffff80241a12>] vprintk+0x12b/0x2df
 [<ffffffff80240e2f>] warn_slowpath+0x9f/0xd1
 [<ffffffff80366da2>] kobject_put+0x2f/0x42
 [<ffffffff8024fe90>] kernel_power_off+0xe/0x3b
 [<ffffffff80250108>] sys_reboot+0xf8/0x179
 [<ffffffff8020c37a>] system_call_fastpath+0x16/0x1b
 [<00007f0b2aa3aa26>] 0x7f0b2aa3aa26


Code: d5 70 80 20 75 eb 48 89 f8 c3 48 89 f8 eb 03 48 ff c0 80 38 00 75
f8 48 29 f8 c3 48 89 f8 eb 03 48 ff c0 48 85 f6 74 08 48 ff ce <80> 38
00 75 f0 48 29 f8 c3 31 c0 eb
12 41 38 c8 74 0a 48 ff c2
RIP  [<ffffffff8036a603>] strnlen+0x10/0x19
 RSP <ffff88007c46da70>
CR2: 0000000000100100
---[ end trace 1c45144e9c9b5946 ]---
boot/84-builder.sh: line 30:   782 Killed                  halt -fp^M


Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-22 20:29         ` Alexander Graf
  2009-01-22 20:36           ` Alexander Graf
@ 2009-01-23 22:36           ` Marcelo Tosatti
  2009-01-24  7:42             ` Alexander Graf
  2009-01-26 15:53             ` Alexander Graf
  1 sibling, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2009-01-23 22:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

Hi Alexander,

On Thu, Jan 22, 2009 at 09:29:46PM +0100, Alexander Graf wrote:

> Following the discussion on IRC, I tried -no-kvm-irqchip and found some
> virtual machines broken after >1 day of stress testing again:
> 
> + sudo -u contain2 env -i qemu-kvm -localtime -kernel virtio-kernel
> -initrd virtio-initrd -nographic -append 'quiet clocksource=acpi_pm
> cifsuser=contain2 cifspass=contain2 root=cifs://contain2:contain2@172.1
> 6.2.1/contain2 realroot=//172.16.2.1/users/contain2
> ip=172.16.2.2:172.16.2.1::255.255.255.0::eth0:none console=ttyS0
> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:2 -net
> tap,ifname=tap2,sc
> ript=/bin/true -m 2000 -nographic -smp 4 -no-kvm-irqchip /dev/null
> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
> Stuck ??
> Stuck ??
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<ffffffff802b539a>] kfree+0x18b/0x26e
> PGD 0
> Oops: 0000 [1] SMP
> last sysfs file:
> CPU 2
> Modules linked in:
> Supported: Yes
> Pid: 0, comm: swapper Tainted: G S        2.6.27.7-9-default #1
> RIP: 0010:[<ffffffff802b539a>]  [<ffffffff802b539a>] kfree+0x18b/0x26e
> RSP: 0018:ffff88007a493e90  EFLAGS: 00010046
> RAX: 0000000000000002 RBX: ffff8800010397f0 RCX: ffff88007a480778
> RDX: ffffe20000000000 RSI: ffff8800010397f0 RDI: ffff88007a5ae140
> RBP: 0000000000000000 R08: ffff8800010395d0 R09: ffff88007a493eb8
> R10: ffffffff80a59980 R11: ffffffff8021c5d9 R12: 0000000000000001
> R13: ffff88007ac04080 R14: 0000000010200042 R15: ffff88007a5ae140
> FS:  0000000000000000(0000) GS:ffff88007a461f40(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff88007a48a000, task ffff88007a488280)
> Stack:  ffffffff8023df9c ffffffff8073a108 0000000000000286 ffffffff8024a1eb
>  ffffffff80259d80 ffff8800010397f0 0000000000000000 0000000000000001
>  000000000000000a 0000000010200042 0000000000000010 ffffffff802831d0
> Call Trace:
>  [<ffffffff802831d0>] __rcu_process_callbacks+0x189/0x203
>  [<ffffffff80283271>] rcu_process_callbacks+0x27/0x47
>  [<ffffffff802464ed>] __do_softirq+0x84/0x115
>  [<ffffffff8020dc9c>] call_softirq+0x1c/0x28
>  [<ffffffff8020f067>] do_softirq+0x3c/0x81
>  [<ffffffff80246204>] irq_exit+0x3f/0x83
>  [<ffffffff8021ce5f>] smp_apic_timer_interrupt+0x95/0xae
>  [<ffffffff8020d4a3>] apic_timer_interrupt+0x83/0x90
>  [<ffffffff80221f1d>] native_safe_halt+0x2/0x3
>  [<ffffffff80213465>] default_idle+0x38/0x54
>  [<ffffffff8020b34a>] cpu_idle+0xa9/0xf1
> 
> 
> Code: 01 00 00 00 e8 4c fa ff ff 48 83 3d a0 19 44 00 00 49 8b 44 dd 08
> 48 8d 78 40 75 04 0f 0b eb fe e8 e5 cc f6 ff 90 e9 c7 00 00 00 <8b> 55
> 00 3b 55 04 73 0f 89 d0 4c 89 7c c5 18 8d 42 01 e9 ad 00
> RIP  [<ffffffff802b539a>] kfree+0x18b/0x26e
>  RSP <ffff88007a493e90>
> CR2: 0000000000000000
> ---[ end trace 4eaa2a86a8e2da22 ]---
> 
> 
> Also after two days of permanent stress testing I also got the Intel
> machine w/ current git down:
> 
> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
> root=cifs://contain1:contain1@172.16.1.1/contain1
> realroot=//172.16.1.1/users/contain1
> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
> Stuck ??
> 
> No backtrace here though. That's all I got from the serial console.
> 
> The only issues I had with the UP guests so far was this:
> 
> + taskset -c 6 sudo -u contain6 env -i qemu-kvm -localtime -kernel
> virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
> clocksource=acpi_pm cifsuser=contain6 cifspass=contain6
> root=cifs://contain6:contain6@172.16.6.1/contain6
> realroot=//172.16.6.1/users/contain6
> ip=172.16.6.2:172.16.6.1::255.255.255.0::eth0:none console=ttyS0
> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:6 -net
> tap,ifname=tap6,script=/bin/true -m 2000 -nographic /dev/null
> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> Kernel panic - not syncing: IO-APIC + timer doesn't work!  Boot with
> apic=debug and send a report.  Then try booting with the 'noapic' option.
> 
> which can be annoying at times too. Can't we just detect that it's the
> detection and give the guest its interrupts? Or should the PIT
> reinjection thing help here?

There are a number of problems that can result in this error, and the
problems are possibly different between the in-kernel PIT and userspace
PIT emulation (note it also happens with in-kernel PIT, just much more
rarely now). You can use the no_timer_check kernel option to bypass it.

Regarding the corruption problem, I have a few questions:

- It is SMP specific (ie both kernel/userspace irqchip fail).
	- which means UP guests are stable with both kernel/user
	  irqchip.

The "Stuck ??" messages seem to be coming from smpboot.c. So for some
reason vcpu's are being reset. Don't seem to be a triple fault because
in that case all vcpu's would be reset (so yes, the vcpu was really on
BIOS code).

Suggest the following:
- Confirm the problem happens with root on ext3 filesystem (can't you
  mount the CIFS and copy the data over to a local guest disk to
  simulate similar load?).

- Check that the kernel text is not corrupted. Save the "good" kernel 
  text with QEMU's "pmemsave" or "memsave" (you can see start/end in 
  the symbols _text/_etext, /proc/kallsyms) after booting. After you 
  see the crash, save the "bad" kernel text, compare. This can give 
  additional clues (or not).

Also, you mentioned "other reports" previously, can you point to them,
please?

Thanks


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-23 22:36           ` Marcelo Tosatti
@ 2009-01-24  7:42             ` Alexander Graf
  2009-01-24 13:06               ` Marcelo Tosatti
  2009-01-26 15:53             ` Alexander Graf
  1 sibling, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-24  7:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

Hi Marcelo,

On 23.01.2009, at 23:36, Marcelo Tosatti wrote:

> Hi Alexander,
>
> On Thu, Jan 22, 2009 at 09:29:46PM +0100, Alexander Graf wrote:
>
>> Following the discussion on IRC, I tried -no-kvm-irqchip and found  
>> some
>> virtual machines broken after >1 day of stress testing again:
>>
>> + sudo -u contain2 env -i qemu-kvm -localtime -kernel virtio-kernel
>> -initrd virtio-initrd -nographic -append 'quiet clocksource=acpi_pm
>> cifsuser=contain2 cifspass=contain2 root=cifs://contain2:contain2@172.1
>> 6.2.1/contain2 realroot=//172.16.2.1/users/contain2
>> ip=172.16.2.2:172.16.2.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:2 - 
>> net
>> tap,ifname=tap2,sc
>> ript=/bin/true -m 2000 -nographic -smp 4 -no-kvm-irqchip /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>> Stuck ??
>> BUG: unable to handle kernel NULL pointer dereference at  
>> 0000000000000000
>> IP: [<ffffffff802b539a>] kfree+0x18b/0x26e
>> PGD 0
>> Oops: 0000 [1] SMP
>> last sysfs file:
>> CPU 2
>> Modules linked in:
>> Supported: Yes
>> Pid: 0, comm: swapper Tainted: G S        2.6.27.7-9-default #1
>> RIP: 0010:[<ffffffff802b539a>]  [<ffffffff802b539a>] kfree+0x18b/ 
>> 0x26e
>> RSP: 0018:ffff88007a493e90  EFLAGS: 00010046
>> RAX: 0000000000000002 RBX: ffff8800010397f0 RCX: ffff88007a480778
>> RDX: ffffe20000000000 RSI: ffff8800010397f0 RDI: ffff88007a5ae140
>> RBP: 0000000000000000 R08: ffff8800010395d0 R09: ffff88007a493eb8
>> R10: ffffffff80a59980 R11: ffffffff8021c5d9 R12: 0000000000000001
>> R13: ffff88007ac04080 R14: 0000000010200042 R15: ffff88007a5ae140
>> FS:  0000000000000000(0000) GS:ffff88007a461f40(0000) knlGS: 
>> 0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper (pid: 0, threadinfo ffff88007a48a000, task  
>> ffff88007a488280)
>> Stack:  ffffffff8023df9c ffffffff8073a108 0000000000000286  
>> ffffffff8024a1eb
>> ffffffff80259d80 ffff8800010397f0 0000000000000000 0000000000000001
>> 000000000000000a 0000000010200042 0000000000000010 ffffffff802831d0
>> Call Trace:
>> [<ffffffff802831d0>] __rcu_process_callbacks+0x189/0x203
>> [<ffffffff80283271>] rcu_process_callbacks+0x27/0x47
>> [<ffffffff802464ed>] __do_softirq+0x84/0x115
>> [<ffffffff8020dc9c>] call_softirq+0x1c/0x28
>> [<ffffffff8020f067>] do_softirq+0x3c/0x81
>> [<ffffffff80246204>] irq_exit+0x3f/0x83
>> [<ffffffff8021ce5f>] smp_apic_timer_interrupt+0x95/0xae
>> [<ffffffff8020d4a3>] apic_timer_interrupt+0x83/0x90
>> [<ffffffff80221f1d>] native_safe_halt+0x2/0x3
>> [<ffffffff80213465>] default_idle+0x38/0x54
>> [<ffffffff8020b34a>] cpu_idle+0xa9/0xf1
>>
>>
>> Code: 01 00 00 00 e8 4c fa ff ff 48 83 3d a0 19 44 00 00 49 8b 44  
>> dd 08
>> 48 8d 78 40 75 04 0f 0b eb fe e8 e5 cc f6 ff 90 e9 c7 00 00 00 <8b>  
>> 55
>> 00 3b 55 04 73 0f 89 d0 4c 89 7c c5 18 8d 42 01 e9 ad 00
>> RIP  [<ffffffff802b539a>] kfree+0x18b/0x26e
>> RSP <ffff88007a493e90>
>> CR2: 0000000000000000
>> ---[ end trace 4eaa2a86a8e2da22 ]---
>>
>>
>> Also after two days of permanent stress testing I also got the Intel
>> machine w/ current git down:
>>
>> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 - 
>> localtime
>> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
>> root=cifs://contain1:contain1@172.16.1.1/contain1
>> realroot=//172.16.1.1/users/contain1
>> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 - 
>> net
>> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>>
>> No backtrace here though. That's all I got from the serial console.
>>
>> The only issues I had with the UP guests so far was this:
>>
>> + taskset -c 6 sudo -u contain6 env -i qemu-kvm -localtime -kernel
>> virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain6 cifspass=contain6
>> root=cifs://contain6:contain6@172.16.6.1/contain6
>> realroot=//172.16.6.1/users/contain6
>> ip=172.16.6.2:172.16.6.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:6 - 
>> net
>> tap,ifname=tap6,script=/bin/true -m 2000 -nographic /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>> Kernel panic - not syncing: IO-APIC + timer doesn't work!  Boot with
>> apic=debug and send a report.  Then try booting with the 'noapic'  
>> option.
>>
>> which can be annoying at times too. Can't we just detect that it's  
>> the
>> detection and give the guest its interrupts? Or should the PIT
>> reinjection thing help here?
>
> There are a number of problems that can result in this error, and the
> problems are possibly different between the in-kernel PIT and  
> userspace
> PIT emulation (note it also happens with in-kernel PIT, just much more
> rarely now). You can use the no_timer_check kernel option to bypass  
> it.

Ok :-). Thanks. The logic in the kernel for this is really stupid  
(basing timing on clock speed). What about disabling the check if we  
detect KVM?

> Regarding the corruption problem, I have a few questions:
>
> - It is SMP specific (ie both kernel/userspace irqchip fail).
> 	- which means UP guests are stable with both kernel/user
> 	  irqchip.

I have not been able to reproduce any of my issues with UP. I have to  
admit that I only tried UP with in-kernel irqchip.

> The "Stuck ??" messages seem to be coming from smpboot.c. So for some
> reason vcpu's are being reset. Don't seem to be a triple fault because
> in that case all vcpu's would be reset (so yes, the vcpu was really on
> BIOS code).

Hm. I know that OSX turns off CPUs it doesn't need as an alternative  
to deep-sleep. Does Linux do that too?

> Suggest the following:
> - Confirm the problem happens with root on ext3 filesystem (can't you
>  mount the CIFS and copy the data over to a local guest disk to
>  simulate similar load?).

I had Stuck ?? messages without networking, but if it helps I can try  
that too. In the project we're using this for we do things over cifs,  
so that's why I built the test case around it.

> - Check that the kernel text is not corrupted. Save the "good" kernel
>  text with QEMU's "pmemsave" or "memsave" (you can see start/end in
>  the symbols _text/_etext, /proc/kallsyms) after booting. After you
>  see the crash, save the "bad" kernel text, compare. This can give
>  additional clues (or not).

Good idea - I'll try.

> Also, you mentioned "other reports" previously, can you point to them,
> please?

Yes, will do later. I gotta run now! Thanks for the reply - it's good  
to know this isn't getting ignored :-).

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-24  7:42             ` Alexander Graf
@ 2009-01-24 13:06               ` Marcelo Tosatti
  2009-01-24 14:30                 ` Alexander Graf
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2009-01-24 13:06 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

On Sat, Jan 24, 2009 at 08:42:06AM +0100, Alexander Graf wrote:
>> rarely now). You can use the no_timer_check kernel option to bypass  
>> it.
>
> Ok :-). Thanks. The logic in the kernel for this is really stupid  
> (basing timing on clock speed). What about disabling the check if we  
> detect KVM?

Yes, this is an option. We've talked about it before, but no patch was
merged. The RHEL5.3 kernel skips those checks when it detects VMWare 
or KVM hypervisors.

We should understand what is happening to fix the fullvirt/old guest
case. For the in-kernel PIT, I believe there is a bug somewhere, either
in PIT itself or in the interaction with IOAPIC (failure to inject
interrupts for some reason). I started debugging it by constantly
reboot'ing an SMP guest but my testbox died. Hope to get back to it
soon.

>> Regarding the corruption problem, I have a few questions:
>>
>> - It is SMP specific (ie both kernel/userspace irqchip fail).
>> 	- which means UP guests are stable with both kernel/user
>> 	  irqchip.
>
> I have not been able to reproduce any of my issues with UP. I have to  
> admit that I only tried UP with in-kernel irqchip.

OK.

>> The "Stuck ??" messages seem to be coming from smpboot.c. So for some
>> reason vcpu's are being reset. Don't seem to be a triple fault because
>> in that case all vcpu's would be reset (so yes, the vcpu was really on
>> BIOS code).
>
> Hm. I know that OSX turns off CPUs it doesn't need as an alternative to 
> deep-sleep. Does Linux do that too?

Not that I know of, unless you offline CPU's manually, which does not
seem to be the case.

>> Suggest the following:
>> - Confirm the problem happens with root on ext3 filesystem (can't you
>>  mount the CIFS and copy the data over to a local guest disk to
>>  simulate similar load?).
>
> I had Stuck ?? messages without networking, but if it helps I can try  
> that too. In the project we're using this for we do things over cifs, so 
> that's why I built the test case around it.

OK. Just trying to decrease the variables involved. I'll setup a machine
to run a similar load next week.

>> - Check that the kernel text is not corrupted. Save the "good" kernel
>>  text with QEMU's "pmemsave" or "memsave" (you can see start/end in
>>  the symbols _text/_etext, /proc/kallsyms) after booting. After you
>>  see the crash, save the "bad" kernel text, compare. This can give
>>  additional clues (or not).
>
> Good idea - I'll try.
>
>> Also, you mentioned "other reports" previously, can you point to them,
>> please?
>
> Yes, will do later. I gotta run now! Thanks for the reply - it's good to 
> know this isn't getting ignored :-).

Have a good weekend.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-24 13:06               ` Marcelo Tosatti
@ 2009-01-24 14:30                 ` Alexander Graf
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-24 14:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang


On 24.01.2009, at 14:06, Marcelo Tosatti wrote:

> On Sat, Jan 24, 2009 at 08:42:06AM +0100, Alexander Graf wrote:
>>> rarely now). You can use the no_timer_check kernel option to bypass
>>> it.
>>
>> Ok :-). Thanks. The logic in the kernel for this is really stupid
>> (basing timing on clock speed). What about disabling the check if we
>> detect KVM?
>
> Yes, this is an option. We've talked about it before, but no patch was
> merged. The RHEL5.3 kernel skips those checks when it detects VMWare
> or KVM hypervisors.

That sounds clever. But I doubt I'll get anything as intrusive into  
the SLES11 kernel at this point in time :-(.

> We should understand what is happening to fix the fullvirt/old guest
> case. For the in-kernel PIT, I believe there is a bug somewhere,  
> either
> in PIT itself or in the interaction with IOAPIC (failure to inject
> interrupts for some reason). I started debugging it by constantly
> reboot'ing an SMP guest but my testbox died. Hope to get back to it
> soon.

Hm. If I ever get tracing working again, I can try to create one  
too :-).

>>> The "Stuck ??" messages seem to be coming from smpboot.c. So for  
>>> some
>>> reason vcpu's are being reset. Don't seem to be a triple fault  
>>> because
>>> in that case all vcpu's would be reset (so yes, the vcpu was  
>>> really on
>>> BIOS code).
>>
>> Hm. I know that OSX turns off CPUs it doesn't need as an  
>> alternative to
>> deep-sleep. Does Linux do that too?
>
> Not that I know of, unless you offline CPU's manually, which does not
> seem to be the case.

Nope, I don't hotplug anything (though the acpihp module is loaded).

>>> Suggest the following:
>>> - Confirm the problem happens with root on ext3 filesystem (can't  
>>> you
>>> mount the CIFS and copy the data over to a local guest disk to
>>> simulate similar load?).
>>
>> I had Stuck ?? messages without networking, but if it helps I can try
>> that too. In the project we're using this for we do things over  
>> cifs, so
>> that's why I built the test case around it.
>
> OK. Just trying to decrease the variables involved. I'll setup a  
> machine
> to run a similar load next week.

Sounds good :-). I put all the files I tested with online with a link  
in the first mail of this thread. So feel free to take that as an  
inspiration. For non-network testing I simply put -net none there, but  
still had the initrd boot and kill the machine.


>>> Also, you mentioned "other reports" previously, can you point to  
>>> them,
>>> please?
>>
>> Yes, will do later. I gotta run now! Thanks for the reply - it's  
>> good to
>> know this isn't getting ignored :-).
>
> Have a good weekend.

Same to you. I was running for a first-aid course though, not the  
weekend :-).

I was mainly talking here about the thread "Guest Hang Bugs". Though  
with 2.6.25 guests I did get "BUG: soft lockup - CPU#x stuck for ns!"  
messages instead of the "Stuck ??" FWIW.
Originally I created the whole test case to debug this exact bug we  
encountered as well: http://article.gmane.org/gmane.comp.emulators.kvm.devel/21828/

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-23 22:36           ` Marcelo Tosatti
  2009-01-24  7:42             ` Alexander Graf
@ 2009-01-26 15:53             ` Alexander Graf
  2009-01-26 16:21               ` Marcelo Tosatti
  1 sibling, 1 reply; 18+ messages in thread
From: Alexander Graf @ 2009-01-26 15:53 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

Marcelo Tosatti wrote:
> Hi Alexander,
>
> On Thu, Jan 22, 2009 at 09:29:46PM +0100, Alexander Graf wrote:
>
>   
>> Following the discussion on IRC, I tried -no-kvm-irqchip and found some
>> virtual machines broken after >1 day of stress testing again:
>>
>> + sudo -u contain2 env -i qemu-kvm -localtime -kernel virtio-kernel
>> -initrd virtio-initrd -nographic -append 'quiet clocksource=acpi_pm
>> cifsuser=contain2 cifspass=contain2 root=cifs://contain2:contain2@172.1
>> 6.2.1/contain2 realroot=//172.16.2.1/users/contain2
>> ip=172.16.2.2:172.16.2.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:2 -net
>> tap,ifname=tap2,sc
>> ript=/bin/true -m 2000 -nographic -smp 4 -no-kvm-irqchip /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>> Stuck ??
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>> IP: [<ffffffff802b539a>] kfree+0x18b/0x26e
>> PGD 0
>> Oops: 0000 [1] SMP
>> last sysfs file:
>> CPU 2
>> Modules linked in:
>> Supported: Yes
>> Pid: 0, comm: swapper Tainted: G S        2.6.27.7-9-default #1
>> RIP: 0010:[<ffffffff802b539a>]  [<ffffffff802b539a>] kfree+0x18b/0x26e
>> RSP: 0018:ffff88007a493e90  EFLAGS: 00010046
>> RAX: 0000000000000002 RBX: ffff8800010397f0 RCX: ffff88007a480778
>> RDX: ffffe20000000000 RSI: ffff8800010397f0 RDI: ffff88007a5ae140
>> RBP: 0000000000000000 R08: ffff8800010395d0 R09: ffff88007a493eb8
>> R10: ffffffff80a59980 R11: ffffffff8021c5d9 R12: 0000000000000001
>> R13: ffff88007ac04080 R14: 0000000010200042 R15: ffff88007a5ae140
>> FS:  0000000000000000(0000) GS:ffff88007a461f40(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper (pid: 0, threadinfo ffff88007a48a000, task ffff88007a488280)
>> Stack:  ffffffff8023df9c ffffffff8073a108 0000000000000286 ffffffff8024a1eb
>>  ffffffff80259d80 ffff8800010397f0 0000000000000000 0000000000000001
>>  000000000000000a 0000000010200042 0000000000000010 ffffffff802831d0
>> Call Trace:
>>  [<ffffffff802831d0>] __rcu_process_callbacks+0x189/0x203
>>  [<ffffffff80283271>] rcu_process_callbacks+0x27/0x47
>>  [<ffffffff802464ed>] __do_softirq+0x84/0x115
>>  [<ffffffff8020dc9c>] call_softirq+0x1c/0x28
>>  [<ffffffff8020f067>] do_softirq+0x3c/0x81
>>  [<ffffffff80246204>] irq_exit+0x3f/0x83
>>  [<ffffffff8021ce5f>] smp_apic_timer_interrupt+0x95/0xae
>>  [<ffffffff8020d4a3>] apic_timer_interrupt+0x83/0x90
>>  [<ffffffff80221f1d>] native_safe_halt+0x2/0x3
>>  [<ffffffff80213465>] default_idle+0x38/0x54
>>  [<ffffffff8020b34a>] cpu_idle+0xa9/0xf1
>>
>>
>> Code: 01 00 00 00 e8 4c fa ff ff 48 83 3d a0 19 44 00 00 49 8b 44 dd 08
>> 48 8d 78 40 75 04 0f 0b eb fe e8 e5 cc f6 ff 90 e9 c7 00 00 00 <8b> 55
>> 00 3b 55 04 73 0f 89 d0 4c 89 7c c5 18 8d 42 01 e9 ad 00
>> RIP  [<ffffffff802b539a>] kfree+0x18b/0x26e
>>  RSP <ffff88007a493e90>
>> CR2: 0000000000000000
>> ---[ end trace 4eaa2a86a8e2da22 ]---
>>
>>
>> Also after two days of permanent stress testing I also got the Intel
>> machine w/ current git down:
>>
>> + sudo -u contain1 env -i /usr/local/bin/qemu-system-x86_64 -localtime
>> -kernel virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain1 cifspass=contain1
>> root=cifs://contain1:contain1@172.16.1.1/contain1
>> realroot=//172.16.1.1/users/contain1
>> ip=172.16.1.2:172.16.1.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:1 -net
>> tap,ifname=tap1,script=/bin/true -m 2000 -nographic -smp 8 /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> Stuck ??
>>
>> No backtrace here though. That's all I got from the serial console.
>>
>> The only issues I had with the UP guests so far was this:
>>
>> + taskset -c 6 sudo -u contain6 env -i qemu-kvm -localtime -kernel
>> virtio-kernel -initrd virtio-initrd -nographic -append 'quiet
>> clocksource=acpi_pm cifsuser=contain6 cifspass=contain6
>> root=cifs://contain6:contain6@172.16.6.1/contain6
>> realroot=//172.16.6.1/users/contain6
>> ip=172.16.6.2:172.16.6.1::255.255.255.0::eth0:none console=ttyS0
>> dhcp=off builder=1' -net nic,model=virtio,macaddr=52:54:00:12:34:6 -net
>> tap,ifname=tap6,script=/bin/true -m 2000 -nographic /dev/null
>> qemu: loading initrd (0x1daf359 bytes) at 0x000000007b240000
>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>> Kernel panic - not syncing: IO-APIC + timer doesn't work!  Boot with
>> apic=debug and send a report.  Then try booting with the 'noapic' option.
>>
>> which can be annoying at times too. Can't we just detect that it's the
>> detection and give the guest its interrupts? Or should the PIT
>> reinjection thing help here?
>>     
>
> There are a number of problems that can result in this error, and the
> problems are possibly different between the in-kernel PIT and userspace
> PIT emulation (note it also happens with in-kernel PIT, just much more
> rarely now). You can use the no_timer_check kernel option to bypass it.
>   

Hm - that option disables the whole check, making it always fail. I
haven't seen any way to actually disable the check, telling Linux things
are OK :-(.

Alex


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-26 15:53             ` Alexander Graf
@ 2009-01-26 16:21               ` Marcelo Tosatti
  2009-01-26 16:33                 ` Alexander Graf
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2009-01-26 16:21 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

On Mon, Jan 26, 2009 at 04:53:21PM +0100, Alexander Graf wrote:
> > There are a number of problems that can result in this error, and the
> > problems are possibly different between the in-kernel PIT and userspace
> > PIT emulation (note it also happens with in-kernel PIT, just much more
> > rarely now). You can use the no_timer_check kernel option to bypass it.
> >   
> 
> Hm - that option disables the whole check, making it always fail. I
> haven't seen any way to actually disable the check, telling Linux things
> are OK :-(.

Hum, the option makes timer_irq_works always return true. Works for me
with in-kernel PIT.

What you see with "apic=debug no_timer_check" ?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: KVM guest crashes
  2009-01-26 16:21               ` Marcelo Tosatti
@ 2009-01-26 16:33                 ` Alexander Graf
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Graf @ 2009-01-26 16:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm@vger.kernel.org, Joerg Roedel, Sheng Yang

Marcelo Tosatti wrote:
> On Mon, Jan 26, 2009 at 04:53:21PM +0100, Alexander Graf wrote:
>   
>>> There are a number of problems that can result in this error, and the
>>> problems are possibly different between the in-kernel PIT and userspace
>>> PIT emulation (note it also happens with in-kernel PIT, just much more
>>> rarely now). You can use the no_timer_check kernel option to bypass it.
>>>   
>>>       
>> Hm - that option disables the whole check, making it always fail. I
>> haven't seen any way to actually disable the check, telling Linux things
>> are OK :-(.
>>     
>
> Hum, the option makes timer_irq_works always return true. Works for me
> with in-kernel PIT.
>
> What you see with "apic=debug no_timer_check" ?
>   


It does work with "noapic" for me, but that means I'm using the old PIC
(which isn't necessarily bad, right?). So I can at least work around the
issue for us now. It still needs to be fixed nevertheless.

with "apic=debug no_apic_timer" 2.6.27 does:

Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...
..... (found apic 0 pin 0) ...
....... works.


while 2.6.25 does:

..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-01-26 16:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-20 15:49 KVM guest crashes Alexander Graf
2009-01-20 20:07 ` Avi Kivity
2009-01-20 20:20   ` Alexander Graf
2009-01-21  8:14   ` Alexander Graf
2009-01-21  9:05     ` Avi Kivity
2009-01-21  9:36       ` Avi Kivity
2009-01-21 10:44         ` Alexander Graf
2009-01-22 20:29         ` Alexander Graf
2009-01-22 20:36           ` Alexander Graf
2009-01-22 20:55             ` Alexander Graf
2009-01-23 16:36               ` Alexander Graf
2009-01-23 22:36           ` Marcelo Tosatti
2009-01-24  7:42             ` Alexander Graf
2009-01-24 13:06               ` Marcelo Tosatti
2009-01-24 14:30                 ` Alexander Graf
2009-01-26 15:53             ` Alexander Graf
2009-01-26 16:21               ` Marcelo Tosatti
2009-01-26 16:33                 ` Alexander Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox