* Detect guest panic
@ 2008-11-18 15:36 Emmanuel Lacour
2008-11-18 16:39 ` David Mair
[not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
0 siblings, 2 replies; 5+ messages in thread
From: Emmanuel Lacour @ 2008-11-18 15:36 UTC (permalink / raw)
To: kvm
Dear users/developers,
i have a guest which freeze 2 or 3 times per weeks (nothin in the logs,
blank vnc screen). I'm going to try to fix this by testing upgrade to
more recnt kernel/kvm, but I would like in the meantime to make a script
which restart the guest domain in case it freezes.
Is there a way to detect that the VM is in this kind of panic?
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Detect guest panic 2008-11-18 15:36 Detect guest panic Emmanuel Lacour @ 2008-11-18 16:39 ` David Mair 2008-11-18 16:49 ` Emmanuel Lacour [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com> 1 sibling, 1 reply; 5+ messages in thread From: David Mair @ 2008-11-18 16:39 UTC (permalink / raw) To: Emmanuel Lacour; +Cc: kvm Emmanuel Lacour wrote: > Dear users/developers, > > i have a guest which freeze 2 or 3 times per weeks (nothin in the logs, > blank vnc screen). I'm going to try to fix this by testing upgrade to > more recnt kernel/kvm, but I would like in the meantime to make a script > which restart the guest domain in case it freezes. > > Is there a way to detect that the VM is in this kind of panic? > If the guest has a reachable IP address the simplest way might be to ping the guest from the host every so often and, if it stops responding for long enough to make you believe it has frozen, kill the qemu process and run it again. I suppose you could also expose the qemu console via a socket or other host file descriptor then you can have the pinging program on the host try to reset the guest without killing the qemu process. -- David. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Detect guest panic 2008-11-18 16:39 ` David Mair @ 2008-11-18 16:49 ` Emmanuel Lacour 2008-11-18 17:13 ` Jan Kiszka 0 siblings, 1 reply; 5+ messages in thread From: Emmanuel Lacour @ 2008-11-18 16:49 UTC (permalink / raw) To: kvm On Tue, Nov 18, 2008 at 09:39:35AM -0700, David Mair wrote: >> > If the guest has a reachable IP address the simplest way might be to > ping the guest from the host every so often and, if it stops responding > for long enough to make you believe it has frozen, kill the qemu process > and run it again. I suppose you could also expose the qemu console via a > socket or other host file descriptor then you can have the pinging > program on the host try to reset the guest without killing the qemu > process. > Thanks for your help, but ping is not enough, if it doesn't answer it doesn't mean that the WM is crashed, it can means that only the network is crashed (and I have this kind of problems too (see other recent thread for virtio_net ;)) and I have other fixes for those kind of problems. Well I'm looking for some sort of "watchdog" kvm device ;) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Detect guest panic 2008-11-18 16:49 ` Emmanuel Lacour @ 2008-11-18 17:13 ` Jan Kiszka 0 siblings, 0 replies; 5+ messages in thread From: Jan Kiszka @ 2008-11-18 17:13 UTC (permalink / raw) To: Emmanuel Lacour; +Cc: kvm Emmanuel Lacour wrote: > On Tue, Nov 18, 2008 at 09:39:35AM -0700, David Mair wrote: >>> >> If the guest has a reachable IP address the simplest way might be to >> ping the guest from the host every so often and, if it stops responding >> for long enough to make you believe it has frozen, kill the qemu process >> and run it again. I suppose you could also expose the qemu console via a >> socket or other host file descriptor then you can have the pinging >> program on the host try to reset the guest without killing the qemu >> process. >> > > Thanks for your help, but ping is not enough, if it doesn't answer it > doesn't mean that the WM is crashed, it can means that only the network > is crashed (and I have this kind of problems too (see other recent > thread for virtio_net ;)) and I have other fixes for those kind of > problems. > > Well I'm looking for some sort of "watchdog" kvm device ;) nmi_watchdog=1 (NMI watchdog via IO-APIC) is working for Linux guests if the host uses kvm-intel (kvm-amd is not yet implemented). Other OSes that can exploit this trick as well should also be able to benefit from it. There is just one open issue regarding NMIs for which a patch is pending, but expect the next kvm release to include a fix. Otherwise, you are free to define and implement some virt-watchdog (what would be a hardware watchdog with a link to some reset pin in real life), letting the emulation code trigger a system_reset when the timer fires. You could also choose to emulate an existing watchdog interface for which there are already drivers for your guest OS (we've done that for virtualizing a custom board). Jan -- Siemens AG, Corporate Technology, CT SE 2 ES-OS Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>]
* Fwd: Detect guest panic [not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com> @ 2008-11-18 19:41 ` Roland Lammel 0 siblings, 0 replies; 5+ messages in thread From: Roland Lammel @ 2008-11-18 19:41 UTC (permalink / raw) To: kvm On Tue, Nov 18, 2008 at 4:36 PM, Emmanuel Lacour <elacour@easter-eggs.com> wrote: > > Dear users/developers, > > i have a guest which freeze 2 or 3 times per weeks (nothin in the logs, > blank vnc screen). I'm going to try to fix this by testing upgrade to > more recnt kernel/kvm, but I would like in the meantime to make a script > which restart the guest domain in case it freezes. I saw similar issues when running from a debian lenny 2.6.26-1-amd64 64bit kvm host (which is kvm72 on currently) and the guests are debian lenny 2.6.26-1-486 32bit. I have configured ntpd in the host system and the guest systems, but of course ntpd crashes after that severe clock jump. The problem shows exactly the same systems, but the system is able to recover from time to time, which allowed me to see the actual cause of the problem, which seems to be a severe backward time jump (it is mostly somerwhere in Nov 1912, so it seems to be correlated as a backward shift form the current time (e.g. int overflow) which causes the VM to hang. In case it is able to recover I saw a very big clock jump (for the kernel timer it is a forward jump but it seems to cause the system clock to be in Nov 1912). Nov 12 20:56:03 bit kernel: [ 38.061596] warning: `ntpd' uses 32-bit capabilities (legacy support in use) Nov 13 06:25:03 bit kernel: imklog 3.18.2, log source = /proc/kmsg started. Nov 30 06:25:48 bit kernel: imklog 3.18.2, log source = /proc/kmsg started. Nov 30 06:25:48 bit kernel: imklog 3.18.2, log source = /proc/kmsg started. Nov 30 06:25:51 bit kernel: [1266940721.901855] INFO: task postdrop:19268 blocked for more than 120 seconds. Nov 30 06:25:51 bit kernel: [1266940721.902793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 30 06:25:51 bit kernel: [1266940721.905843] postdrop D c014f55e 0 19268 19267 Nov 30 06:25:51 bit kernel: [1266940721.906697] dd8f9c00 00000086 00000000 c014f55e 54541a81 1194f8cd dd8f9d8c 00015f63 Nov 30 06:25:51 bit kernel: [1266940721.907799] 00000000 be709d78 be709d78 c657c3c4 dec3b400 c02a5b89 dd92ded4 dda43ed4 Nov 30 06:25:51 bit kernel: [1266940721.908245] be709d78 c0121fc7 dd8f9c00 c03ec700 c02a5b84 74736f70 706f7264 642d7000 Nov 30 06:25:51 bit kernel: [1266940721.909838] Call Trace: Nov 30 06:25:51 bit kernel: [1266940721.910957] [<c014f55e>] write_cache_pages+0x227/0x26d Nov 30 06:25:51 bit kernel: [1266940721.911801] [<c02a5b89>] schedule_timeout+0x69/0x86 Nov 30 06:25:51 bit kernel: [1266940721.912646] [<c0121fc7>] process_timeout+0x0/0x5 Nov 30 06:25:51 bit kernel: [1266940721.913463] [<c02a5b84>] schedule_timeout+0x64/0x86 Nov 30 06:25:51 bit kernel: [1266940721.914288] [<e00852e4>] journal_stop+0x7d/0x12b [jbd] Nov 30 06:25:51 bit kernel: [1266940721.915134] [<c017bfcd>] __writeback_single_inode+0x13f/0x231 Nov 30 06:25:51 bit kernel: [1266940721.916017] [<c014f5ee>] do_writepages+0x29/0x30 Nov 30 06:25:51 bit kernel: [1266940721.916834] [<c014ace8>] __filemap_fdatawrite_range+0x65/0x70 Nov 30 06:25:51 bit kernel: [1266940721.917722] [<e00fbeab>] ext3_sync_file+0x87/0x9c [ext3] Nov 30 06:25:51 bit kernel: [1266940721.918580] [<c017e6f0>] do_fsync+0x3d/0x7e Nov 30 06:25:51 bit kernel: [1266940721.919356] [<c017e74e>] __do_fsync+0x1d/0x2b Nov 30 06:25:51 bit kernel: [1266940721.920142] [<c010372f>] sysenter_past_esp+0x78/0xb9 Nov 30 06:25:51 bit kernel: [1266940721.920993] ======================= The guest is not really usable anymore as all diskio (mostly write but also read) tend to hang the system completly. I now manually compiled kvm-79 (including the kernel modules) and am running from it with 3 instances now, non of them has crashed so far, but it's only 20 hours so far. For me the ping check is actually enough to detect if the host is ok, and I'll probably use mon or something similar to just shutdown and restart the instance. Cheers +rl Roland Lammel QuikIT - IT Lösungen - flexibel und schnell Web: http://www.quikit.at Email: info@quikit.at "Enjoy your job, make lots of money, work within the law. Choose any two." ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-11-18 19:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-18 15:36 Detect guest panic Emmanuel Lacour
2008-11-18 16:39 ` David Mair
2008-11-18 16:49 ` Emmanuel Lacour
2008-11-18 17:13 ` Jan Kiszka
[not found] ` <9b51ffb30811181134q62ea322eo1b7addbffa4aeecd@mail.gmail.com>
2008-11-18 19:41 ` Fwd: " Roland Lammel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox