Detect guest panic

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Detect guest panic
@ 2008-11-18 15:36 Emmanuel Lacour
  2008-11-18 16:39 ` David Mair
       [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Emmanuel Lacour @ 2008-11-18 15:36 UTC (permalink / raw)
  To: kvm

Dear users/developers,

i have a guest which freeze 2 or 3 times per weeks (nothin in the logs,
blank vnc screen). I'm going to try to fix this by testing upgrade to
more recnt kernel/kvm, but I would like in the meantime to make a script
which restart the guest domain in case it freezes.

Is there a way to detect that the VM is in this kind of panic?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Detect guest panic
  2008-11-18 15:36 Detect guest panic Emmanuel Lacour
@ 2008-11-18 16:39 ` David Mair
  2008-11-18 16:49   ` Emmanuel Lacour
       [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
  1 sibling, 1 reply; 5+ messages in thread
From: David Mair @ 2008-11-18 16:39 UTC (permalink / raw)
  To: Emmanuel Lacour; +Cc: kvm

Emmanuel Lacour wrote:
> Dear users/developers,
>
> i have a guest which freeze 2 or 3 times per weeks (nothin in the logs,
> blank vnc screen). I'm going to try to fix this by testing upgrade to
> more recnt kernel/kvm, but I would like in the meantime to make a script
> which restart the guest domain in case it freezes.
>
> Is there a way to detect that the VM is in this kind of panic?
>   
If the guest has a reachable IP address the simplest way might be to 
ping the guest from the host every so often and, if it stops responding 
for long enough to make you believe it has frozen, kill the qemu process 
and run it again. I suppose you could also expose the qemu console via a 
socket or other host file descriptor then you can have the pinging 
program on the host try to reset the guest without killing the qemu process.

-- 
David.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Detect guest panic
  2008-11-18 16:39 ` David Mair
@ 2008-11-18 16:49   ` Emmanuel Lacour
  2008-11-18 17:13     ` Jan Kiszka
  0 siblings, 1 reply; 5+ messages in thread
From: Emmanuel Lacour @ 2008-11-18 16:49 UTC (permalink / raw)
  To: kvm

On Tue, Nov 18, 2008 at 09:39:35AM -0700, David Mair wrote:
>>   
> If the guest has a reachable IP address the simplest way might be to  
> ping the guest from the host every so often and, if it stops responding  
> for long enough to make you believe it has frozen, kill the qemu process  
> and run it again. I suppose you could also expose the qemu console via a  
> socket or other host file descriptor then you can have the pinging  
> program on the host try to reset the guest without killing the qemu 
> process.
>

Thanks for your help, but ping is not enough, if it doesn't answer it
doesn't mean that the WM is crashed, it can means that only the network
is crashed (and I have this kind of problems too (see other recent
thread for virtio_net ;)) and I have other fixes for those kind of
problems.

Well I'm looking for some sort of "watchdog" kvm device ;)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Detect guest panic
  2008-11-18 16:49   ` Emmanuel Lacour
@ 2008-11-18 17:13     ` Jan Kiszka
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kiszka @ 2008-11-18 17:13 UTC (permalink / raw)
  To: Emmanuel Lacour; +Cc: kvm

Emmanuel Lacour wrote:
> On Tue, Nov 18, 2008 at 09:39:35AM -0700, David Mair wrote:
>>>   
>> If the guest has a reachable IP address the simplest way might be to  
>> ping the guest from the host every so often and, if it stops responding  
>> for long enough to make you believe it has frozen, kill the qemu process  
>> and run it again. I suppose you could also expose the qemu console via a  
>> socket or other host file descriptor then you can have the pinging  
>> program on the host try to reset the guest without killing the qemu 
>> process.
>>
> 
> Thanks for your help, but ping is not enough, if it doesn't answer it
> doesn't mean that the WM is crashed, it can means that only the network
> is crashed (and I have this kind of problems too (see other recent
> thread for virtio_net ;)) and I have other fixes for those kind of
> problems.
> 
> Well I'm looking for some sort of "watchdog" kvm device ;)

nmi_watchdog=1 (NMI watchdog via IO-APIC) is working for Linux guests if
the host uses kvm-intel (kvm-amd is not yet implemented). Other OSes
that can exploit this trick as well should also be able to benefit from
it. There is just one open issue regarding NMIs for which a patch is
pending, but expect the next kvm release to include a fix.

Otherwise, you are free to define and implement some virt-watchdog (what
would be a hardware watchdog with a link to some reset pin in real
life), letting the emulation code trigger a system_reset when the timer
fires. You could also choose to emulate an existing watchdog interface
for which there are already drivers for your guest OS (we've done that
for virtualizing a custom board).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2 ES-OS
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fwd: Detect guest panic
       [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
@ 2008-11-18 19:41   ` Roland Lammel
  0 siblings, 0 replies; 5+ messages in thread
From: Roland Lammel @ 2008-11-18 19:41 UTC (permalink / raw)
  To: kvm

On Tue, Nov 18, 2008 at 4:36 PM, Emmanuel Lacour
<elacour@easter-eggs.com> wrote:
>
> Dear users/developers,
>
> i have a guest which freeze 2 or 3 times per weeks (nothin in the logs,
> blank vnc screen). I'm going to try to fix this by testing upgrade to
> more recnt kernel/kvm, but I would like in the meantime to make a script
> which restart the guest domain in case it freezes.

I saw similar issues when running from a debian lenny 2.6.26-1-amd64
64bit kvm host (which is kvm72  on currently) and the guests are
debian lenny 2.6.26-1-486 32bit.
I have configured ntpd in the host system and the guest systems, but
of course ntpd crashes after that severe clock jump.

The problem shows exactly the same systems, but the system is able to
recover from time to time, which allowed me to see the actual cause of
the problem, which seems to be a severe backward time jump (it is
mostly somerwhere in Nov 1912, so it seems to be correlated as a
backward shift form the current time (e.g. int overflow) which causes
the VM to hang.

In case it is able to recover I saw a very big clock jump (for the
kernel timer it is a forward jump but it seems to cause the system
clock to be in Nov 1912).
Nov 12 20:56:03 bit kernel: [   38.061596] warning: `ntpd' uses 32-bit
capabilities (legacy support in use)
Nov 13 06:25:03 bit kernel: imklog 3.18.2, log source = /proc/kmsg started.
Nov 30 06:25:48 bit kernel: imklog 3.18.2, log source = /proc/kmsg started.
Nov 30 06:25:48 bit kernel: imklog 3.18.2, log source = /proc/kmsg started.
Nov 30 06:25:51 bit kernel: [1266940721.901855] INFO: task
postdrop:19268 blocked for more than 120 seconds.
Nov 30 06:25:51 bit kernel: [1266940721.902793] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 30 06:25:51 bit kernel: [1266940721.905843] postdrop      D
c014f55e     0 19268  19267
Nov 30 06:25:51 bit kernel: [1266940721.906697]        dd8f9c00
00000086 00000000 c014f55e 54541a81 1194f8cd dd8f9d8c 00015f63
Nov 30 06:25:51 bit kernel: [1266940721.907799]        00000000
be709d78 be709d78 c657c3c4 dec3b400 c02a5b89 dd92ded4 dda43ed4
Nov 30 06:25:51 bit kernel: [1266940721.908245]        be709d78
c0121fc7 dd8f9c00 c03ec700 c02a5b84 74736f70 706f7264 642d7000
Nov 30 06:25:51 bit kernel: [1266940721.909838] Call Trace:
Nov 30 06:25:51 bit kernel: [1266940721.910957]  [<c014f55e>]
write_cache_pages+0x227/0x26d
Nov 30 06:25:51 bit kernel: [1266940721.911801]  [<c02a5b89>]
schedule_timeout+0x69/0x86
Nov 30 06:25:51 bit kernel: [1266940721.912646]  [<c0121fc7>]
process_timeout+0x0/0x5
Nov 30 06:25:51 bit kernel: [1266940721.913463]  [<c02a5b84>]
schedule_timeout+0x64/0x86
Nov 30 06:25:51 bit kernel: [1266940721.914288]  [<e00852e4>]
journal_stop+0x7d/0x12b [jbd]
Nov 30 06:25:51 bit kernel: [1266940721.915134]  [<c017bfcd>]
__writeback_single_inode+0x13f/0x231
Nov 30 06:25:51 bit kernel: [1266940721.916017]  [<c014f5ee>]
do_writepages+0x29/0x30
Nov 30 06:25:51 bit kernel: [1266940721.916834]  [<c014ace8>]
__filemap_fdatawrite_range+0x65/0x70
Nov 30 06:25:51 bit kernel: [1266940721.917722]  [<e00fbeab>]
ext3_sync_file+0x87/0x9c [ext3]
Nov 30 06:25:51 bit kernel: [1266940721.918580]  [<c017e6f0>] do_fsync+0x3d/0x7e
Nov 30 06:25:51 bit kernel: [1266940721.919356]  [<c017e74e>]
__do_fsync+0x1d/0x2b
Nov 30 06:25:51 bit kernel: [1266940721.920142]  [<c010372f>]
sysenter_past_esp+0x78/0xb9
Nov 30 06:25:51 bit kernel: [1266940721.920993]  =======================

The guest is not really usable anymore as all diskio (mostly write but
also read) tend to hang the system completly.

I now manually compiled kvm-79 (including the kernel modules) and am
running from it with 3 instances now, non of them has crashed so far,
but it's only 20 hours so far.

For me the ping check is actually enough to detect if the host is ok,
and I'll probably use mon or something similar to just shutdown and
restart the instance.

Cheers

+rl

Roland Lammel
QuikIT - IT Lösungen - flexibel und schnell
Web: http://www.quikit.at
Email: info@quikit.at

"Enjoy your job, make lots of money, work within the law. Choose any two."

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-11-18 19:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-18 15:36 Detect guest panic Emmanuel Lacour
2008-11-18 16:39 ` David Mair
2008-11-18 16:49   ` Emmanuel Lacour
2008-11-18 17:13     ` Jan Kiszka
     [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
2008-11-18 19:41   ` Fwd: " Roland Lammel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox