From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=50454 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pi7uu-0004ro-Py
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:21 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pi7ut-00053V-K3
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:20 -0500
Received: from mail-qy0-f180.google.com ([209.85.216.180]:57442)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pi7ut-00053O-GR
	for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:19 -0500
Received: by qyk29 with SMTP id 29so1183689qyk.4
	for <qemu-devel@nongnu.org>; Wed, 26 Jan 2011 08:08:18 -0800 (PST)
Message-ID: <4D4046EF.3050108@codemonkey.ws>
Date: Wed, 26 Jan 2011 10:08:15 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to
	use	coroutines for async I/O
References: <1295688567-25496-1-git-send-email-stefanha@linux.vnet.ibm.com>	<1295688567-25496-12-git-send-email-stefanha@linux.vnet.ibm.com>	<4D40406B.2070302@redhat.com>
	<4D4042A8.2040903@redhat.com>
In-Reply-To: <4D4042A8.2040903@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Anthony Liguori <aliguori@us.ibm.com>, Avi Kivity <avi@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

On 01/26/2011 09:50 AM, Kevin Wolf wrote:
> Am 26.01.2011 16:40, schrieb Avi Kivity:
>    
>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>      
>>> Converting qcow2 to use coroutines is fairly simple since most of qcow2
>>> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
>>> transparently work when called from a coroutine, so all the synchronous
>>> code just works.
>>>
>>> The explicitly asynchronous code is adjusted to repeatedly call
>>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
>>> At that point the coroutine will return from its entry function and its
>>> resources are freed.
>>>
>>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
>>> from a BH.  This is necessary since the user callback code does not
>>> expect to be executed from a coroutine.
>>>
>>> This conversion is not completely correct because the safety the
>>> synchronous code does not carry over to the coroutine version.
>>> Previously, a synchronous code path could assume that it will never be
>>> interleaved with another request executing.  This is no longer true
>>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
>>> other requests can be processed during that time.
>>>
>>> The solution is to carefully introduce checks so that pending requests
>>> do not step on each other's toes.  That is left for a future patch...
>>>        
>> The way I thought of doing this is:
>>
>> qcow_aio_write(...)
>> {
>>       execute_in_coroutine {
>>           co_mutex_lock(&bs->mutex);
>>           do_qcow_aio_write(...); // original qcow code
>>           co_mutex_release(&bs->mutex);
>>      

The release has to be executed in the call back.

I think it's a bit nicer to not do a mutex, but rather to have a notion 
of freezing/unfreezing the block queue and instead do:

completion() {
    bdrv_unfreeze(bs);
}

coroutine {
    bdrv_freeze(bs);
    do_qcow_aio_write(completion);
}

Freeze/unfreeze is useful in a number of other places too (like 
snapshotting).

Regards,

Anthony Liguori

>>       }
>> }
>>
>> (similar changes for the the other callbacks)
>>
>> if the code happens to be asynchronous (no metadata changes), we'll take
>> the mutex and release it immediately after submitting the I/O, so no
>> extra serialization happens.  If the code does issue a synchronous
>> metadata write, we'll lock out all other operations on the same block
>> device, but still allow the vcpu to execute, since all the locking
>> happens in a coroutine.
>>
>> Essentially, a mutex becomes the dependency tracking mechnism.  A global
>> mutex means all synchronous operations are dependent.  Later, we can
>> convert the metadata cache entry dependency lists to local mutexes
>> inside the cache entry structures.
>>      
> I thought a bit about it since you mentioned it in the call yesterday
> and I think this approach makes sense. Even immediately after the
> conversion we should be in a better state than with Stefan's approach
> because I/O without metadata disk access won't be serialized.
>
> In the other thread you mentioned that you have written some code
> independently. Do you have it in some public git repository?
>
> Kevin
>
>