From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=53057 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PiO7F-0000BE-5J for qemu-devel@nongnu.org; Thu, 27 Jan 2011 04:26:10 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PiO7D-0000su-Qd for qemu-devel@nongnu.org; Thu, 27 Jan 2011 04:26:09 -0500 Received: from mx1.redhat.com ([209.132.183.28]:33804) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PiO7D-0000sq-H4 for qemu-devel@nongnu.org; Thu, 27 Jan 2011 04:26:07 -0500 Message-ID: <4D413A88.2060204@redhat.com> Date: Thu, 27 Jan 2011 10:27:36 +0100 From: Kevin Wolf MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O References: <1295688567-25496-1-git-send-email-stefanha@linux.vnet.ibm.com> <1295688567-25496-12-git-send-email-stefanha@linux.vnet.ibm.com> <4D40406B.2070302@redhat.com> <4D4042A8.2040903@redhat.com> <4D4046EF.3050108@codemonkey.ws> <4D40481E.9040309@redhat.com> <4D404B9F.7040801@codemonkey.ws> <4D404E1F.50809@redhat.com> <4D4055FB.6090104@codemonkey.ws> In-Reply-To: <4D4055FB.6090104@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: qemu-devel@nongnu.org, Avi Kivity , Stefan Hajnoczi Am 26.01.2011 18:12, schrieb Anthony Liguori: > On 01/26/2011 10:38 AM, Avi Kivity wrote: >> On 01/26/2011 06:28 PM, Anthony Liguori wrote: >>> On 01/26/2011 10:13 AM, Avi Kivity wrote: >>>> Serializing against a global mutex has the advantage that it can be >>>> treated as a global lock that is decomposed into fine-grained locks. >>>> >>>> For example, we can start the code conversion from an explict async >>>> model to a threaded sync model by converting the mutex into a >>>> shared/exclusive lock. Operations like read and write take the lock >>>> for shared access (and take a fine-grained mutex on the metadata >>>> cache entry), while operation like creating a snapshot take the lock >>>> for exclusive access. That doesn't work with freeze/thaw. >>> >>> The trouble with this is that you increase the amount of re-entrance >>> whereas freeze/thaw doesn't. >>> >>> The code from the beginning of the request to where the mutex is >>> acquired will be executed for every single request even while >>> requests are blocked at the mutex acquisition. >> >> It's just a few instructions. >> >>> >>> With freeze/thaw, you freeze the queue and prevent any request from >>> starting until you thaw. You only thaw and return control to allow >>> another request to execute when you begin executing an asynchronous >>> I/O callback. >>> >> >> What do you actually save? The longjmp() to the coroutine code, >> linking in to the mutex wait queue, and another switch back to the >> main coroutine? Given that we don't expect to block often, it seems >> hardly a cost worth optimizing. And if you cared, you could probably optimize it in the mutex implementation. > It's a matter of correctness, not optimization. > > Consider the following example: > > coroutine { > l2 = find_l2(offset); > // allocates but does not increment max cluster offset > l2[l2_offset(offset)] = alloc_cluster(); > > co_mutex_lock(&lock); > write_l2(l2); > co_mutex_unlock(&lock); > > l1[l1_offset(offset)] = l2; > > co_mutex_lock(&lock); > write_l1(l1); > co_mutex_unlock(&lock); > > commit_cluster(l2[l2_offset(offset)]); > } > > This code is incorrect. The code before the first co_mutex_lock() may > be called a dozen times before before anyone finishes this routine. > That means the same cluster is reused multiple times. But this is not a new problem. In qcow2 we have this today: cluster_offset = alloc_cluster(); bdrv_aio_write() ... update_l2() There is already code to handle this case, even though it could probably be simplified for coroutines. I don't think we want to make the code worse during the conversion by basically serializing everything. > This code was correct before we introduced coroutines because we > effectively had one big lock around the whole thing. So what's the lesson to learn? You need to take care when you convert a big lock into smaller ones? >>> I think my previous example was wrong, you really want to do: >>> >>> qcow2_aio_writev() { >>> coroutine { >>> freeze(); >>> sync_io(); // existing qcow2 code >>> thaw(); >>> // existing non I/O code >>> bdrv_aio_writev(callback); // no explicit freeze/thaw needed >>> } >>> } >>> >>> This is equivalent to our existing code because no new re-entrance is >>> introduced. The only re-entrancy points are in the >>> bdrv_aio_{readv,writev} calls. >> >> This requires you to know which code is sync, and which code is >> async. My conversion allows you to wrap the code blindly with a >> mutex, and have it do the right thing automatically. This is most >> useful where the code can be sync or async depending on data (which is >> the case for qcow2). Well, but in the case of qcow2, you don't want to have a big mutex around everything. We perfectly know which parts are asynchronous and which are synchronous, so we'd want to do it finer grained from the beginning. Kevin