From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=50454 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pi7uu-0004ro-Py for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pi7ut-00053V-K3 for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:20 -0500 Received: from mail-qy0-f180.google.com ([209.85.216.180]:57442) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pi7ut-00053O-GR for qemu-devel@nongnu.org; Wed, 26 Jan 2011 11:08:19 -0500 Received: by qyk29 with SMTP id 29so1183689qyk.4 for ; Wed, 26 Jan 2011 08:08:18 -0800 (PST) Message-ID: <4D4046EF.3050108@codemonkey.ws> Date: Wed, 26 Jan 2011 10:08:15 -0600 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O References: <1295688567-25496-1-git-send-email-stefanha@linux.vnet.ibm.com> <1295688567-25496-12-git-send-email-stefanha@linux.vnet.ibm.com> <4D40406B.2070302@redhat.com> <4D4042A8.2040903@redhat.com> In-Reply-To: <4D4042A8.2040903@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Anthony Liguori , Avi Kivity , Stefan Hajnoczi On 01/26/2011 09:50 AM, Kevin Wolf wrote: > Am 26.01.2011 16:40, schrieb Avi Kivity: > >> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote: >> >>> Converting qcow2 to use coroutines is fairly simple since most of qcow2 >>> is synchronous. The synchronous I/O functions likes bdrv_pread() now >>> transparently work when called from a coroutine, so all the synchronous >>> code just works. >>> >>> The explicitly asynchronous code is adjusted to repeatedly call >>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes. >>> At that point the coroutine will return from its entry function and its >>> resources are freed. >>> >>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked >>> from a BH. This is necessary since the user callback code does not >>> expect to be executed from a coroutine. >>> >>> This conversion is not completely correct because the safety the >>> synchronous code does not carry over to the coroutine version. >>> Previously, a synchronous code path could assume that it will never be >>> interleaved with another request executing. This is no longer true >>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and >>> other requests can be processed during that time. >>> >>> The solution is to carefully introduce checks so that pending requests >>> do not step on each other's toes. That is left for a future patch... >>> >> The way I thought of doing this is: >> >> qcow_aio_write(...) >> { >> execute_in_coroutine { >> co_mutex_lock(&bs->mutex); >> do_qcow_aio_write(...); // original qcow code >> co_mutex_release(&bs->mutex); >> The release has to be executed in the call back. I think it's a bit nicer to not do a mutex, but rather to have a notion of freezing/unfreezing the block queue and instead do: completion() { bdrv_unfreeze(bs); } coroutine { bdrv_freeze(bs); do_qcow_aio_write(completion); } Freeze/unfreeze is useful in a number of other places too (like snapshotting). Regards, Anthony Liguori >> } >> } >> >> (similar changes for the the other callbacks) >> >> if the code happens to be asynchronous (no metadata changes), we'll take >> the mutex and release it immediately after submitting the I/O, so no >> extra serialization happens. If the code does issue a synchronous >> metadata write, we'll lock out all other operations on the same block >> device, but still allow the vcpu to execute, since all the locking >> happens in a coroutine. >> >> Essentially, a mutex becomes the dependency tracking mechnism. A global >> mutex means all synchronous operations are dependent. Later, we can >> convert the metadata cache entry dependency lists to local mutexes >> inside the cache entry structures. >> > I thought a bit about it since you mentioned it in the call yesterday > and I think this approach makes sense. Even immediately after the > conversion we should be in a better state than with Stefan's approach > because I/O without metadata disk access won't be serialized. > > In the other thread you mentioned that you have written some code > independently. Do you have it in some public git repository? > > Kevin > >