From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=40208 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PuM3M-0004Fb-Fo for qemu-devel@nongnu.org; Tue, 01 Mar 2011 04:39:38 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PuM3L-0000pD-4X for qemu-devel@nongnu.org; Tue, 01 Mar 2011 04:39:36 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50275) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PuM3K-0000p4-Oo for qemu-devel@nongnu.org; Tue, 01 Mar 2011 04:39:35 -0500 Message-ID: <4D6CBECF.8090805@redhat.com> Date: Tue, 01 Mar 2011 11:39:27 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [patch 2/3] Add support for live block copy References: <20110222170004.808373778@redhat.com> <4D642181.4080509@codemonkey.ws> <20110222210735.GA9372@amt.cnet> <4D64266A.3060106@codemonkey.ws> <20110222230935.GA11082@amt.cnet> <4D644343.4050800@codemonkey.ws> <4D65051A.6070707@redhat.com> <4D651B20.70405@codemonkey.ws> <4D652852.60505@redhat.com> <4D652F73.3000305@codemonkey.ws> <4D65324A.5080408@redhat.com> <4D65359E.3040008@codemonkey.ws> <4D65416D.8040803@redhat.com> <4D656B97.5030301@codemonkey.ws> <4D661CB8.6010305@redhat.com> <4D667287.9010005@codemonkey.ws> <4D6677BE.2030009@redhat.com> <4D669C46.40909@codemonkey.ws> <4D6A150B.8030205@redhat.com> <4D6A58E0.9020607@codemonkey.ws> <4D6A6E38.4030700@redhat.com> <4D6A8CC9.4090304@codemonkey.ws> <4D6B5EFA.8060106@redhat.com> <4D6B98FD.7020103@codemonkey.ws> <4D6BA16A.2020204@redhat.com> <4D6BDFA1.3000100@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Jes.Sorensen@redhat.com, Marcelo Tosatti , qemu-devel@nongnu.org On 02/28/2011 08:12 PM, Anthony Liguori wrote: > > > On Feb 28, 2011 11:47 AM, "Avi Kivity" > wrote: > > > > On 02/28/2011 07:33 PM, Anthony Liguori wrote: > >> > >> > >> > > >> > You're just ignoring what I've written. > >> > >> No, you're just impervious to my subtle attempt to refocus the > discussion on solving a practical problem. > >> > >> There's a lot of good, reasonably straight forward changes we can > make that have a high return on investment. > >> > > > > Is making qemu the authoritative source of configuration information > a straightforward change? Is the return on it high? Is the > investment low? > > I think this is where we fundamentally disagree. My position is that > QEMU is already the authoritative source. Having a state file doesn't > change anything. > > Do a hot unplug of a network device with upstream libvirt with acpiphp > unloaded, consult libvirt and then consult the monitor to see who has > the right view of the guests config. > libvirt is right and the monitor is wrong. On real hardware, calling _EJ0 doesn't affect the configuration one little bit (if I understand it correctly). It just turns off power to the slot. If you power-cycle, the card will be there. In the real world, the authoritative source of configuration is a human with a screwdriver. The virtualized equivalent is the management tool. > To me, that's the definition of authoritative. > > > "No" to all three (ignoring for the moment whether it is good or > not, which we were debating). > > > > > >> The only suggestion I'm making beyond Marcelo's original patch is > that we use a structured format and that we make it possible to use > the same file to solve this problem in multiple places. > >> > > > > No, you're suggesting a lot more than that. > > That's exactly what I'm suggesting from a technical perspective. > Unless I'm hallucinating, you're suggesting quite a bit more. A revolution in how qemu is to be managed. > >> I don't think this creates a fundamental break in how management > tools interact with QEMU. I don't think introducing RAID support in > the block layer is a reasonable alternative. > >> > >> > > > > Why not? > > Because its a lot of complexity and code that can go wrong while only > solving the race for one specific case. Not to mention that we double > the iop rate. > IMO it's of similar complexity. The number of I/Os don't change (reads stay the same, and any write that has already been mirrored needs to be re-mirrored in both cases. We do gain lower latency switchover and we package the code as a block format driver instead of core block code. We decouple the dependencies from live migration. > > Something that avoids the whole state thing altogether: > > > > - instead of atomically switching when live copy is done, keep on > issuing writes to both the origin and the live copy > > - issue a notification to management > > - management receives the notification, and issues an atomic > blockdev switch command > > > this is really the RAID-1 solution but without the state file > (credit Dor). An advantage is that there is no additional latency > when trying to catch up to the dirty bitmap. > > It still suffers from the two generals problem. You cannot solve this > without making one node reliable and that takes us back to it being > either QEMU (posted event and state file) or the management tool (sync > event). > > It works without either. If qemu fails, you simply re-mirror everything. If the management tool fails, it re-subscribes to the mirror-complete event, queries whether it already happened in its absence, and if it did, requests the switchover. -- error compiling committee.c: too many arguments to function