From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LckNU-0008Dy-1t for qemu-devel@nongnu.org; Thu, 26 Feb 2009 12:50:32 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LckNT-0008DB-3H for qemu-devel@nongnu.org; Thu, 26 Feb 2009 12:50:31 -0500 Received: from [199.232.76.173] (port=47270 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LckNS-0008Cx-UD for qemu-devel@nongnu.org; Thu, 26 Feb 2009 12:50:30 -0500 Received: from mail2.shareable.org ([80.68.89.115]:59839) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1LckNS-0000Nh-Ex for qemu-devel@nongnu.org; Thu, 26 Feb 2009 12:50:30 -0500 Received: from jamie by mail2.shareable.org with local (Exim 4.63) (envelope-from ) id 1LckNN-0002ku-Ep for qemu-devel@nongnu.org; Thu, 26 Feb 2009 17:50:25 +0000 Date: Thu, 26 Feb 2009 17:50:25 +0000 From: Jamie Lokier Subject: Re: [Qemu-devel] Hardware watchdogs (patch for discussion only) Message-ID: <20090226175025.GA10284@shareable.org> References: <20090225233718.GA15750@amd.home.annexia.org> <20090226105106.GD22494@redhat.com> <1235658682.5894.152.camel@ecrins.fosdick.home.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1235658682.5894.152.camel@ecrins.fosdick.home.net> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Steve Fosdick wrote: > Perhaps we could have a second timer such that if, on asking the guest > to shut down via ACPI, the guest does not respond within a certain time > limit with an ACPI request to turn the power off we go for one of the > other options below? Good idea. ACPI is notoriously flaky, especially on a guest which has already crashed its kernel... > 1. Ensure continuity of service. When a guest OS gets stuck for some > reason make sure it is re-started. This is probably the only use case > on a real physical machine. For real continuity of service you'd also want QEMU itself to have a watchdog. Either a software watchdog internally (SIGALRM => kill/exec self, or child process expecting regular pings over a pipe), or by QEMU itself becoming a client of the host watchdog. I say this because I've experienced KVM lock up several times. > 2. Limit the resource consumption of a crashed guest when the host > serves other guests. This probably only of concern for virtual > machines, i.e. it is specific to the emulated watchdog and its > interaction with qemu rather than being part of how a physical watchdog > works. Related to this is "omg the database guest has crashed - and frankly we don't rtust the automatic recovery process - stop it for now and we'll inspect for damage manually before starting it again". > Do we want to offer the guest the option of a clean shutdown if it can > still manage that and then reboot, i.e. the shutdown option but for use > case 1? > > If so we need to be able to turn the APCI power off request into a reset > instead. We already have the -no-reboot option to turn a reboot into a > power off - this is the opposite. Interesting idea. > In fact, some people may find that option useful anyway even without the > watchdog. In an environment where someone has privileged access to a > guest but no direct access to the host OS he could shut down a guest > accidentally when intending to reboot (or logoff). It may be useful to > trap that and turn the shutdown into a reboot. I've done that a few times. It's only minorly annoying in that you lose the VNC connection and have to login and restart the VM. Side notes: It would be nice to be able to change the "shutdown-when-asked-to-reboot" (et al) option from the monitor. It would also be nice to "pause-when-asked-to-shutdown/reboot", which is useful during automatic OS installs - the host script changes the media and/or hardware at each reboot. -- Jamie