From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=35585 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OvYBa-0001Lq-As for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:49 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OvYBO-00010T-85 for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:36 -0400 Received: from mail-gx0-f173.google.com ([209.85.161.173]:38194) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OvYBO-00010I-0z for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:34 -0400 Received: by gxk22 with SMTP id 22so2332818gxk.4 for ; Tue, 14 Sep 2010 09:16:33 -0700 (PDT) Message-ID: <4C8F9FDE.8050004@codemonkey.ws> Date: Tue, 14 Sep 2010 11:16:30 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] qcow2 performance plan References: <4C8F7394.8060802@redhat.com> <4C8F7BE4.5010102@codemonkey.ws> <4C8F9087.2050005@redhat.com> <4C8F92D9.2000908@codemonkey.ws> <4C8F9920.7070908@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Avi Kivity , qemu-devel On 09/14/2010 11:03 AM, Stefan Hajnoczi wrote: > On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf wrote: > >> Am 14.09.2010 17:20, schrieb Anthony Liguori: >> >>> On 09/14/2010 10:11 AM, Kevin Wolf wrote: >>> >>>> Am 14.09.2010 15:43, schrieb Anthony Liguori: >>>> >>>> >>>>> Hi Avi, >>>>> >>>>> On 09/14/2010 08:07 AM, Avi Kivity wrote: >>>>> >>>>> >>>>>> Here's a draft of a plan that should improve qcow2 performance. It's >>>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines >>>>>> starting with # are numbered lists, not comments. >>>>>> >>>>>> >>>>> Thanks for putting this together. I think it's really useful to think >>>>> through the problem before anyone jumps in and starts coding. >>>>> >>>>> >>>>> >>>>>> = Basics = >>>>>> >>>>>> At the minimum level, no operation should block the main thread. This >>>>>> could be done in two ways: extending the state machine so that each >>>>>> blocking operation can be performed asynchronously >>>>>> (bdrv_aio_*) >>>>>> or by threading: each new operation is handed off to a worker thread. >>>>>> Since a full state machine is prohibitively complex, this document >>>>>> will discuss threading. >>>>>> >>>>>> >>>>> There's two distinct requirements that must be satisfied by a fast block >>>>> device. The device must have fast implementations of aio functions and >>>>> it must support concurrent request processing. >>>>> >>>>> If an aio function blocks in the process of submitting the request, it's >>>>> by definition slow. But even if you may the aio functions fast, you >>>>> still need to be able to support concurrent request processing in order >>>>> to achieve high throughput. >>>>> >>>>> I'm not going to comment in depth on your threading proposal. When it >>>>> comes to adding concurrency, I think any approach will require a rewrite >>>>> of the qcow2 code and if the author of that rewrite is more comfortable >>>>> implementing concurrency with threads than with a state machine, I'm >>>>> happy with a threaded implementation. >>>>> >>>>> I'd suggest avoiding hyperbole like "a full state machine is >>>>> prohibitively complex". QED is a full state machine. qcow2 adds a >>>>> number of additional states because of the additional metadata and sync >>>>> operations but it's not an exponential increase in complexity. >>>>> >>>>> >>>> It will be quite some additional states that qcow2 brings in, but I >>>> suspect the really hard thing is getting the dependencies between >>>> requests right. >>>> >>>> I just had a look at how QED is doing this, and it seems to take the >>>> easy solution, namely allowing only one allocation at the same time. >>>> >>> One L2 allocation, not cluster allocations. You can allocate multiple >>> clusters concurrently and you can read/write L2s concurrently. >>> >>> Since L2 allocation only happens every 2GB, it's a rare event. >>> >> Then your state machine is too complicated for me to understand. :-) >> >> Let me try to chase function pointers for a simple cluster allocation: >> >> bdrv_qed_aio_writev >> qed_aio_setup >> qed_aio_next_io >> qed_find_cluster >> qed_read_l2_table >> ... >> qed_find_cluster_cb >> >> This function contains the code to check if the cluster is already >> allocated, right? >> >> n = qed_count_contiguous_clusters(s, request->l2_table->table, >> index, n,&offset); >> ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2; >> >> The callback called from there is qed_aio_write_data(..., ret = >> QED_CLUSTER_L2, ...) which means >> >> bool need_alloc = ret != QED_CLUSTER_FOUND; >> /* Freeze this request if another allocating write is in progress */ >> if (need_alloc) { >> ... >> >> So where did I start to follow the path of a L2 table allocation instead >> of a simple cluster allocation? >> > qed_aio_write_main() writes the main body of data into the cluster. > Then it decides whether to update/allocate L2 tables if this is an > allocating write. > Right, it should only freeze if the L2 table needs to be allocated, not if it only needs to be updated. IOW, diff --git a/block/qed.c b/block/qed.c index 4c4e7a2..0357c03 100644 --- a/block/qed.c +++ b/block/qed.c @@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int ret, } /* Freeze this request if another allocating write is in progress */ - if (need_alloc) { + if (ret == QED_CLUSTER_L1) { if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) { QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next); } It's being a bit more conservative than it needs to be. Regards, Anthony Liguori > qed_aio_write_l2_update() is the function that gets called to touch > the L2 table (it also handles the allocation case). > > Stefan > >