From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:53878) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RzqRO-0002os-RX for qemu-devel@nongnu.org; Tue, 21 Feb 2012 09:11:44 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RzqRI-0007Se-PK for qemu-devel@nongnu.org; Tue, 21 Feb 2012 09:11:38 -0500 Received: from mx1.redhat.com ([209.132.183.28]:17557) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RzqRI-0007ST-HM for qemu-devel@nongnu.org; Tue, 21 Feb 2012 09:11:32 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q1LEBUH9029581 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Tue, 21 Feb 2012 09:11:30 -0500 Message-ID: <4F43A610.5070203@redhat.com> Date: Tue, 21 Feb 2012 09:11:28 -0500 From: Jeff Cody MIME-Version: 1.0 References: <58b44a8409ab790a76212958928fa6f3ccbf9096.1329758006.git.jcody@redhat.com> <4F428778.8040409@redhat.com> In-Reply-To: <4F428778.8040409@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 3/3] qapi: Introduce blockdev-query-group-snapshot-failure Reply-To: jcody@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake Cc: Kevin Wolf , Luiz Capitulino , qemu-devel@nongnu.org, Markus Armbruster On 02/20/2012 12:48 PM, Eric Blake wrote: > On 02/20/2012 10:31 AM, Jeff Cody wrote: >> In the case of a failure in a group snapshot, it is possible for >> multiple file image failures to occur - for instance, failure of >> an original snapshot, and then failure of one or more of the >> attempted reopens of the original. >> >> Knowing all of the file images which failed could be useful or >> critical information, so this command returns a list of strings >> containing the filenames of all failures from the last >> invocation of blockdev-group-snapshot-sync. > > Meta-question: > > Suppose that the guest is running when we issue > blockdev-group-snapshot-sync - in that case, qemu is responsible for > pausing and then resuming the guest. On success, this makes sense. But > what happens on failure? The guest is not paused in blockdev-group-snapshot-sync; I don't think that qemu should enforce pause/resume in the live snapshot commands. > > If we only fail at creating one snapshot, but successfully roll back the > rest of the set, should the guest be resumed (as if the command had > never been attempted), or should the guest be left paused? > > On the other hand, if we fail at creating one snapshot, as well as fail > at rolling back, then that argues that we _cannot_ resume the guest, > because we no longer have a block device open. Is that really true, though? Depending on what drive failed, the guest may still be runnable. It would be roughly equivalent to the guest as a drive failure; a bad event, but not always fatal. But, I think v2 of the patch may make this moot - I was talking with Kevin, and he had some good ideas on how to do this without requiring a close & reopen in the case of the snapshot failure; which means that we shouldn't have to worry about the second scenario. I am going to incorporate those changes into v2. > > This policy needs to be documented in one (or both) of the two new > monitor commands, and we probably ought to make sure that if the guest > is left paused where it had originally started as running, then an > appropriate event is also emitted. I agree, the documentation should make it clear what is going on - I will add that to v2. > > For blockdev-snapshot-sync, libvirt was always pausing qemu before > issuing the snapshot, then resuming afterwards; but now that we have the > ability to make the set atomic, I'm debating about whether libvirt still > needs to pause qemu, or whether it can now rely on qemu doing the right > things about pausing and resuming as part of the snapshot command. > Again, it doesn't pause automatically, so that is up to libvirt. The guest agent is also available to freeze the filesystem, if libvirt wants to trust it (and it is running); if not, then libvirt can still issue a pause/resume around the snapshot command (and libvirt may be in a better position to decide what to do in case of failure, if it has some knowledge of the drives that failed and how they are used).