From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=34492 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PuY2w-0004ps-61 for qemu-devel@nongnu.org; Tue, 01 Mar 2011 17:27:59 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PuY2u-0006Ln-J7 for qemu-devel@nongnu.org; Tue, 01 Mar 2011 17:27:58 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58080) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PuY2u-0006Lg-98 for qemu-devel@nongnu.org; Tue, 01 Mar 2011 17:27:56 -0500 Message-ID: <4D6D72E5.9070106@redhat.com> Date: Wed, 02 Mar 2011 00:27:49 +0200 From: Dor Laor MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy References: <20110222170004.808373778@redhat.com> <20110222210735.GA9372@amt.cnet> <4D64266A.3060106@codemonkey.ws> <20110222230935.GA11082@amt.cnet> <4D644343.4050800@codemonkey.ws> <4D65051A.6070707@redhat.com> <4D651B20.70405@codemonkey.ws> <4D652852.60505@redhat.com> <4D652F73.3000305@codemonkey.ws> <4D65324A.5080408@redhat.com> <4D65359E.3040008@codemonkey.ws> <4D65416D.8040803@redhat.com> <4D656B97.5030301@codemonkey.ws> <4D661CB8.6010305@redhat.com> <4D667287.9010005@codemonkey.ws> <4D6677BE.2030009@redhat.com> <4D669C46.40909@codemonkey.ws> <4D6A150B.8030205@redhat.com> <4D6A58E0.9020607@codemonkey.ws> <4D6A6E38.4030700@redhat.com> <4D6A8CC9.4090304@codemonkey.ws> <4D6B5EFA.8060106@redhat.com> <4D6B98FD.7020103@codemonkey.ws> <4D6BA16A.2020204@redhat.com> <4D6BDFA1.3000100@redhat.com> <4D6CBECF.8090805@redhat.com> <4D6D160E.4060208@codemonkey.! ws> In-Reply-To: <4D6D160E.4060208@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: dlaor@redhat.com List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Jes.Sorensen@redhat.com, Marcelo Tosatti , Avi Kivity , qemu-devel@nongnu.org On 03/01/2011 05:51 PM, Anthony Liguori wrote: > On 03/01/2011 04:39 AM, Avi Kivity wrote: >> On 02/28/2011 08:12 PM, Anthony Liguori wrote: >>> >>> >>> On Feb 28, 2011 11:47 AM, "Avi Kivity" >> > wrote: >>> > >>> > On 02/28/2011 07:33 PM, Anthony Liguori wrote: >>> >> >>> >> >>> >> > >>> >> > You're just ignoring what I've written. >>> >> >>> >> No, you're just impervious to my subtle attempt to refocus the >>> discussion on solving a practical problem. >>> >> >>> >> There's a lot of good, reasonably straight forward changes we can >>> make that have a high return on investment. >>> >> >>> > >>> > Is making qemu the authoritative source of configuration >>> information a straightforward change? Is the return on it high? Is >>> the investment low? >>> >>> I think this is where we fundamentally disagree. My position is that >>> QEMU is already the authoritative source. Having a state file doesn't >>> change anything. >>> >>> Do a hot unplug of a network device with upstream libvirt with >>> acpiphp unloaded, consult libvirt and then consult the monitor to see >>> who has the right view of the guests config. >>> >> >> libvirt is right and the monitor is wrong. >> >> On real hardware, calling _EJ0 doesn't affect the configuration one >> little bit (if I understand it correctly). It just turns off power to >> the slot. If you power-cycle, the card will be there. > > It's up to the hardware vendor. Since it's ACPI, it can result in any > number of operations. Usually, there's some logic to flip on an LED or > something. > > There's nothing that prevents a vendor from ejecting the card. My point > is that there aren't cleanly separated lines in the real world. > >>> To me, that's the definition of authoritative. >>> >>> > "No" to all three (ignoring for the moment whether it is good or >>> not, which we were debating). >>> > >>> > >>> >> The only suggestion I'm making beyond Marcelo's original patch is >>> that we use a structured format and that we make it possible to use >>> the same file to solve this problem in multiple places. >>> >> >>> > >>> > No, you're suggesting a lot more than that. >>> >>> That's exactly what I'm suggesting from a technical perspective. >>> >> >> Unless I'm hallucinating, you're suggesting quite a bit more. A >> revolution in how qemu is to be managed. > > Let me take another route to see if I can't persuade you. > > First, let's clarify your proposal. You want to introduce a new block > format No. That was Avi's initial proposal, after we talked we realized that it is not needed and we can use plain files w/o any new configuration. Pretty much similar to what you're proposing below, just w/o the configuration files. > that references to block devices. It may also store a dirty bitmap to keep > track of which blocks are out of sync. Hopefully, it goes without saying > that the dirty bitmap is strictly optional (it's a performance > optimization) so > let's ignore it. > > Your format, as a text file, looks like: > > [raid1] > primary=diska.img > secondary=diskb.img > active=primary > > To use it, here's the sequence: > > 0) qemu uses disk A for a block device > > 1) create a raid1 block device pointing to disk A and disk B. > > 2) management tool asks qemu to us the new raid1 block device. > > 3) qemu acks (2) > > 4) at some point, the mirror completes, writes are going to both disks > > 5) qemu sends out an event indicating that the disks are in sync > > 6) management tool then sends a command to fail over to disk B > > 7) qemu acks (6) 7) is not a must when there is no raid. > > We're making the management tool the "authoritative" source of how to > launch > QEMU. That means that the management tool ultimately determines which > command > line to relaunch QEMU with. This is what we have today regardless of live copy. How else would you track many hot plug/unplug operations and live migration afterwards? For enterprise usage, that's the best case. It's also true for a single host w/ libvirt and virt-manager. > > Here are the races: > > A) If QEMU crashes between (2) and (3), it may have issues a write to > the new > raid1 block device before the management tool sees (3). If this happens, > when the management tool restarts QEMU with disk A, we're left with a > dangling raid1 block device. Not a critical failure, but not ideal. Once there is no raid there is no race. > > B) If QEMU crashes between (6) and (7), QEMU may have started writing to > disk > B before the management tool sees (7). This means that the management tool > will create the guest with the raid1 block device which no longer is the > correct disk. This could fail in subtly bad ways. Depending on how read > is implemented (if you try to do striping for instance), bad data could be > returned. You could try to implement a policy of always reading from B if > the block has been copied but this gets harry really quickly. It's > definitely not RAID1 anymore. Exactly! Drop the raid and always read from B post #6. This is what I was suggesting before. > > You may observe that the problem is not the RAID1 mechanism, but > changing from > using a normal device and the RAID1 mechanism. It would then be wise to > say, > let's always use this image format. Since that eliminates the race, we > don't > really need the copy bitmap anymore. > > Now we're left with a simple format that just refers to two filenames. Ok, looks good. A management app won't need the files below since it manages everything by its own. > However, > block devices are more than just a filename. It needs a format, cache > settings, etc. So let's put this all in the RAID1 block format. We also > need > a way to indicate which block device is selected. > > Let's make it a text file for purposes of discussion. It will look > something > like: > > [primary] > filename=diska.img > cache=none > format=raw > > [secondary] > filename=diskb.img > cache=writethrough > format=qcow2 > > [global] > active=primary > > Since we might want to mirror multiple drives at once, we should probablyn > support having multiple drives configured which means we need to not > just have > a single active entry, but an entry associated with a particular device. > > [drive "diskA"] > filename=diska.img > cache=none > format=raw > > [drive "diskB"] > filename=diskb.img > cache=writethrough > format=qcow2 > > [device "vda"] > drive=diskB > > And this is exactly what I'm proposing. It's really the natural > generalization > of what you're proposing. > > So basically, the only differences are: > > 1) always use the new RAID1 format > 2) drop the progress bitmap > 3) support multiple devices per file > 4) let drive properties be specified beyond filename > > All reasonable things to do. > > Regards, > > Anthony Liguori >