From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=60387 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Oumoq-0004QL-9l
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:42:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OumnN-0000hR-5H
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:40:38 -0400
Received: from mx1.redhat.com ([209.132.183.28]:15584)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OumnM-0000hI-Tu
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:40:37 -0400
Message-ID: <4C8CD847.8030804@redhat.com>
Date: Sun, 12 Sep 2010 15:40:23 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy
	block migration
References: <4C864118.7070206@linux.vnet.ibm.com>	<4C864D65.6090004@redhat.com>	<AANLkTim7UHH3r__3C_Ad3oB1rnXyRsH7bcuZw+rBQP6=@mail.gmail.com>	<4C8652CB.9060801@linux.vnet.ibm.com>
	<4C8CCA91.4060001@redhat.com> <4C8CD4DB.9020905@codemonkey.ws>
In-Reply-To: <4C8CD4DB.9020905@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, qemu-devel <qemu-devel@nongnu.org>, "libvir-list@redhat.com" <libvir-list@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

  On 09/12/2010 03:25 PM, Anthony Liguori wrote:
> On 09/12/2010 07:41 AM, Avi Kivity wrote:
>>  On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>>>> I agree that streaming should be generic, like block migration.  The
>>>> trivial generic implementation is:
>>>>
>>>> void bdrv_stream(BlockDriverState* bs)
>>>> {
>>>>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>>>
>>> Three problems here.  First problem is that bdrv_is_allocated is 
>>> synchronous. 
>>
>> Put the whole thing in a thread.
>
> It doesn't fix anything.  You don't want stream to serialize all I/O 
> operations.

Why would it serialize all I/O operations?  It's just like another vcpu 
issuing reads.

>
>>> The second problem is that streaming makes the most sense when it's 
>>> the smallest useful piece of work whereas bdrv_is_allocated() may 
>>> return a very large range.
>>>
>>> You could cap it here but you then need to make sure that cap is at 
>>> least cluster_size to avoid a lot of unnecessary I/O.
>>
>> That seems like a nice solution.  You probably want a multiple of the 
>> cluster size to retain efficiency.
>
> What you basically do is:
>
> stream_step_three():
>    complete()
>
> stream_step_two(offset, length):
>    bdrv_aio_readv(offset, length, buffer, stream_step_three)
>
> bdrv_aio_stream():
>     bdrv_aio_find_free_cluster(stream_step_two)

Isn't there a write() missing somewhere?

>
> And that's exactly what the current code looks like.  The only change 
> to the patch that this does is make some of qed's internals be block 
> layer interfaces.

Why do you need find_free_cluster()?  That's a physical offset thing.  
Just write to the same logical offset.

IOW:

     bdrv_aio_stream():
         bdrv_aio_read(offset, stream_2)

     stream_2():
         if all zeros:
             increment offset
             if more:
                 bdrv_aio_stream()
        bdrv_aio_write(offset, stream_3)

     stream_3():
         bdrv_aio_write(offset, stream_4)

     stream_4():
         increment offset
         if more:
              bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.

>
> One of the things Stefan has mentioned is that a lot of the QED code 
> could be reused by other formats.  All formats implement things like 
> CoW on their own today but if you exposed interfaces like 
> bdrv_aio_find_free_cluster(), you could actually implement a lot more 
> in the generic block layer.
>
> So, I agree with you in principle that this all should be common 
> code.  I think it's a larger effort though.

Not that large I think; and it will make commit async as a side effect.

>>>
>>> The QED streaming implementation is 140 LOCs too so you quickly end 
>>> up adding more code to the block formats to support these new 
>>> interfaces than it takes to just implement it in the block format.
>>
>> bdrv_is_allocated() already exists (and is needed for commit), what 
>> else is needed?  cluster size?
>
> Synchronous implementations are not reusable to implement asynchronous 
> anything. 

Surely this is easy to fix, at least for qed.

What we need is thread infrastructure that allows us to convert between 
the two methods.

> But you need the code to be cluster aware too.

Yes, another variable in BlockDriverState.

>
>>> Third problem is that  streaming really requires being able to do 
>>> zero write detection in a meaningful way.  You don't want to always 
>>> do zero write detection so you need another interface to mark a 
>>> specific write as a write that should be checked for zeros.
>>
>> You can do that in bdrv_stream(), above, before the actual write, and 
>> call bdrv_unmap() if you detect zeros.
>
> My QED branch now does that FWIW.  At the moment, it only detects zero 
> reads to unallocated clusters and writes a special zero cluster 
> marker.  However, the detection code is in the generic path so once 
> the fsck() logic is working, we can implement a free list in QED.
>
> In QED, the detection code needs to have a lot of knowledge about 
> cluster boundaries and the format of the device.  In principle, this 
> should be common code but it's not for the same reason copy-on-write 
> is not common code today.

Parts of it are: commit.  Of course, that's horribly synchronous.

-- 
error compiling committee.c: too many arguments to function