From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=36980 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OvXz5-000153-Dj for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:03:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OvXz0-0007O7-3Z for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:03:51 -0400 Received: from mail-pw0-f45.google.com ([209.85.160.45]:56187) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OvXyz-0007Nu-S4 for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:03:46 -0400 Received: by pwj4 with SMTP id 4so2566313pwj.4 for ; Tue, 14 Sep 2010 09:03:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4C8F9920.7070908@redhat.com> References: <4C8F7394.8060802@redhat.com> <4C8F7BE4.5010102@codemonkey.ws> <4C8F9087.2050005@redhat.com> <4C8F92D9.2000908@codemonkey.ws> <4C8F9920.7070908@redhat.com> Date: Tue, 14 Sep 2010 17:03:43 +0100 Message-ID: Subject: Re: [Qemu-devel] qcow2 performance plan From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: Avi Kivity , qemu-devel On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf wrote: > Am 14.09.2010 17:20, schrieb Anthony Liguori: >> On 09/14/2010 10:11 AM, Kevin Wolf wrote: >>> Am 14.09.2010 15:43, schrieb Anthony Liguori: >>> >>>> Hi Avi, >>>> >>>> On 09/14/2010 08:07 AM, Avi Kivity wrote: >>>> >>>>> =A0 Here's a draft of a plan that should improve qcow2 performance. = =A0It's >>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines >>>>> starting with # are numbered lists, not comments. >>>>> >>>> Thanks for putting this together. =A0I think it's really useful to thi= nk >>>> through the problem before anyone jumps in and starts coding. >>>> >>>> >>>>> =3D Basics =3D >>>>> >>>>> At the minimum level, no operation should block the main thread. =A0T= his >>>>> could be done in two ways: extending the state machine so that each >>>>> blocking operation can be performed asynchronously >>>>> (bdrv_aio_*) >>>>> or by threading: each new operation is handed off to a worker thread. >>>>> Since a full state machine is prohibitively complex, this document >>>>> will discuss threading. >>>>> >>>> There's two distinct requirements that must be satisfied by a fast blo= ck >>>> device. =A0The device must have fast implementations of aio functions = and >>>> it must support concurrent request processing. >>>> >>>> If an aio function blocks in the process of submitting the request, it= 's >>>> by definition slow. =A0But even if you may the aio functions fast, you >>>> still need to be able to support concurrent request processing in orde= r >>>> to achieve high throughput. >>>> >>>> I'm not going to comment in depth on your threading proposal. =A0When = it >>>> comes to adding concurrency, I think any approach will require a rewri= te >>>> of the qcow2 code and if the author of that rewrite is more comfortabl= e >>>> implementing concurrency with threads than with a state machine, I'm >>>> happy with a threaded implementation. >>>> >>>> I'd suggest avoiding hyperbole like "a full state machine is >>>> prohibitively complex". =A0QED is a full state machine. =A0qcow2 adds = a >>>> number of additional states because of the additional metadata and syn= c >>>> operations but it's not an exponential increase in complexity. >>>> >>> It will be quite some additional states that qcow2 brings in, but I >>> suspect the really hard thing is getting the dependencies between >>> requests right. >>> >>> I just had a look at how QED is doing this, and it seems to take the >>> easy solution, namely allowing only one allocation at the same time. >> >> One L2 allocation, not cluster allocations. =A0You can allocate multiple >> clusters concurrently and you can read/write L2s concurrently. >> >> Since L2 allocation only happens every 2GB, it's a rare event. > > Then your state machine is too complicated for me to understand. :-) > > Let me try to chase function pointers for a simple cluster allocation: > > bdrv_qed_aio_writev > qed_aio_setup > qed_aio_next_io > qed_find_cluster > qed_read_l2_table > ... > qed_find_cluster_cb > > This function contains the code to check if the cluster is already > allocated, right? > > =A0 =A0n =3D qed_count_contiguous_clusters(s, request->l2_table->table, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0index, n, &offset); > =A0 =A0ret =3D offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2; > > The callback called from there is qed_aio_write_data(..., ret =3D > QED_CLUSTER_L2, ...) which means > > =A0 =A0bool need_alloc =3D ret !=3D QED_CLUSTER_FOUND; > =A0 =A0/* Freeze this request if another allocating write is in progress = */ > =A0 =A0if (need_alloc) { > =A0 =A0... > > So where did I start to follow the path of a L2 table allocation instead > of a simple cluster allocation? qed_aio_write_main() writes the main body of data into the cluster. Then it decides whether to update/allocate L2 tables if this is an allocating write. qed_aio_write_l2_update() is the function that gets called to touch the L2 table (it also handles the allocation case). Stefan