From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=60387 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oumoq-0004QL-9l for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:42:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OumnN-0000hR-5H for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:40:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:15584) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OumnM-0000hI-Tu for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:40:37 -0400 Message-ID: <4C8CD847.8030804@redhat.com> Date: Sun, 12 Sep 2010 15:40:23 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration References: <4C864118.7070206@linux.vnet.ibm.com> <4C864D65.6090004@redhat.com> <4C8652CB.9060801@linux.vnet.ibm.com> <4C8CCA91.4060001@redhat.com> <4C8CD4DB.9020905@codemonkey.ws> In-Reply-To: <4C8CD4DB.9020905@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Stefan Hajnoczi , qemu-devel , "libvir-list@redhat.com" , Stefan Hajnoczi On 09/12/2010 03:25 PM, Anthony Liguori wrote: > On 09/12/2010 07:41 AM, Avi Kivity wrote: >> On 09/07/2010 05:57 PM, Anthony Liguori wrote: >>>> I agree that streaming should be generic, like block migration. The >>>> trivial generic implementation is: >>>> >>>> void bdrv_stream(BlockDriverState* bs) >>>> { >>>> for (sector = 0; sector< bdrv_getlength(bs); sector += n) { >>>> if (!bdrv_is_allocated(bs, sector,&n)) { >>> >>> Three problems here. First problem is that bdrv_is_allocated is >>> synchronous. >> >> Put the whole thing in a thread. > > It doesn't fix anything. You don't want stream to serialize all I/O > operations. Why would it serialize all I/O operations? It's just like another vcpu issuing reads. > >>> The second problem is that streaming makes the most sense when it's >>> the smallest useful piece of work whereas bdrv_is_allocated() may >>> return a very large range. >>> >>> You could cap it here but you then need to make sure that cap is at >>> least cluster_size to avoid a lot of unnecessary I/O. >> >> That seems like a nice solution. You probably want a multiple of the >> cluster size to retain efficiency. > > What you basically do is: > > stream_step_three(): > complete() > > stream_step_two(offset, length): > bdrv_aio_readv(offset, length, buffer, stream_step_three) > > bdrv_aio_stream(): > bdrv_aio_find_free_cluster(stream_step_two) Isn't there a write() missing somewhere? > > And that's exactly what the current code looks like. The only change > to the patch that this does is make some of qed's internals be block > layer interfaces. Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset. IOW: bdrv_aio_stream(): bdrv_aio_read(offset, stream_2) stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3) stream_3(): bdrv_aio_write(offset, stream_4) stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that. > > One of the things Stefan has mentioned is that a lot of the QED code > could be reused by other formats. All formats implement things like > CoW on their own today but if you exposed interfaces like > bdrv_aio_find_free_cluster(), you could actually implement a lot more > in the generic block layer. > > So, I agree with you in principle that this all should be common > code. I think it's a larger effort though. Not that large I think; and it will make commit async as a side effect. >>> >>> The QED streaming implementation is 140 LOCs too so you quickly end >>> up adding more code to the block formats to support these new >>> interfaces than it takes to just implement it in the block format. >> >> bdrv_is_allocated() already exists (and is needed for commit), what >> else is needed? cluster size? > > Synchronous implementations are not reusable to implement asynchronous > anything. Surely this is easy to fix, at least for qed. What we need is thread infrastructure that allows us to convert between the two methods. > But you need the code to be cluster aware too. Yes, another variable in BlockDriverState. > >>> Third problem is that streaming really requires being able to do >>> zero write detection in a meaningful way. You don't want to always >>> do zero write detection so you need another interface to mark a >>> specific write as a write that should be checked for zeros. >> >> You can do that in bdrv_stream(), above, before the actual write, and >> call bdrv_unmap() if you detect zeros. > > My QED branch now does that FWIW. At the moment, it only detects zero > reads to unallocated clusters and writes a special zero cluster > marker. However, the detection code is in the generic path so once > the fsck() logic is working, we can implement a free list in QED. > > In QED, the detection code needs to have a lot of knowledge about > cluster boundaries and the format of the device. In principle, this > should be common code but it's not for the same reason copy-on-write > is not common code today. Parts of it are: commit. Of course, that's horribly synchronous. -- error compiling committee.c: too many arguments to function