From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=35585 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OvYBa-0001Lq-As
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OvYBO-00010T-85
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:36 -0400
Received: from mail-gx0-f173.google.com ([209.85.161.173]:38194)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OvYBO-00010I-0z
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 12:16:34 -0400
Received: by gxk22 with SMTP id 22so2332818gxk.4
	for <qemu-devel@nongnu.org>; Tue, 14 Sep 2010 09:16:33 -0700 (PDT)
Message-ID: <4C8F9FDE.8050004@codemonkey.ws>
Date: Tue, 14 Sep 2010 11:16:30 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] qcow2 performance plan
References: <4C8F7394.8060802@redhat.com>
	<4C8F7BE4.5010102@codemonkey.ws>	<4C8F9087.2050005@redhat.com>
	<4C8F92D9.2000908@codemonkey.ws>	<4C8F9920.7070908@redhat.com>
	<AANLkTimtONED4Spt-i1A4bHdAfivxo3MhuECk-NAHtk3@mail.gmail.com>
In-Reply-To: <AANLkTimtONED4Spt-i1A4bHdAfivxo3MhuECk-NAHtk3@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Avi Kivity <avi@redhat.com>, qemu-devel <qemu-devel@nongnu.org>

On 09/14/2010 11:03 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 14, 2010 at 4:47 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>    
>> Am 14.09.2010 17:20, schrieb Anthony Liguori:
>>      
>>> On 09/14/2010 10:11 AM, Kevin Wolf wrote:
>>>        
>>>> Am 14.09.2010 15:43, schrieb Anthony Liguori:
>>>>
>>>>          
>>>>> Hi Avi,
>>>>>
>>>>> On 09/14/2010 08:07 AM, Avi Kivity wrote:
>>>>>
>>>>>            
>>>>>>    Here's a draft of a plan that should improve qcow2 performance.  It's
>>>>>> written in wiki syntax for eventual upload to wiki.qemu.org; lines
>>>>>> starting with # are numbered lists, not comments.
>>>>>>
>>>>>>              
>>>>> Thanks for putting this together.  I think it's really useful to think
>>>>> through the problem before anyone jumps in and starts coding.
>>>>>
>>>>>
>>>>>            
>>>>>> = Basics =
>>>>>>
>>>>>> At the minimum level, no operation should block the main thread.  This
>>>>>> could be done in two ways: extending the state machine so that each
>>>>>> blocking operation can be performed asynchronously
>>>>>> (<code>bdrv_aio_*</code>)
>>>>>> or by threading: each new operation is handed off to a worker thread.
>>>>>> Since a full state machine is prohibitively complex, this document
>>>>>> will discuss threading.
>>>>>>
>>>>>>              
>>>>> There's two distinct requirements that must be satisfied by a fast block
>>>>> device.  The device must have fast implementations of aio functions and
>>>>> it must support concurrent request processing.
>>>>>
>>>>> If an aio function blocks in the process of submitting the request, it's
>>>>> by definition slow.  But even if you may the aio functions fast, you
>>>>> still need to be able to support concurrent request processing in order
>>>>> to achieve high throughput.
>>>>>
>>>>> I'm not going to comment in depth on your threading proposal.  When it
>>>>> comes to adding concurrency, I think any approach will require a rewrite
>>>>> of the qcow2 code and if the author of that rewrite is more comfortable
>>>>> implementing concurrency with threads than with a state machine, I'm
>>>>> happy with a threaded implementation.
>>>>>
>>>>> I'd suggest avoiding hyperbole like "a full state machine is
>>>>> prohibitively complex".  QED is a full state machine.  qcow2 adds a
>>>>> number of additional states because of the additional metadata and sync
>>>>> operations but it's not an exponential increase in complexity.
>>>>>
>>>>>            
>>>> It will be quite some additional states that qcow2 brings in, but I
>>>> suspect the really hard thing is getting the dependencies between
>>>> requests right.
>>>>
>>>> I just had a look at how QED is doing this, and it seems to take the
>>>> easy solution, namely allowing only one allocation at the same time.
>>>>          
>>> One L2 allocation, not cluster allocations.  You can allocate multiple
>>> clusters concurrently and you can read/write L2s concurrently.
>>>
>>> Since L2 allocation only happens every 2GB, it's a rare event.
>>>        
>> Then your state machine is too complicated for me to understand. :-)
>>
>> Let me try to chase function pointers for a simple cluster allocation:
>>
>> bdrv_qed_aio_writev
>> qed_aio_setup
>> qed_aio_next_io
>> qed_find_cluster
>> qed_read_l2_table
>> ...
>> qed_find_cluster_cb
>>
>> This function contains the code to check if the cluster is already
>> allocated, right?
>>
>>     n = qed_count_contiguous_clusters(s, request->l2_table->table,
>>                                       index, n,&offset);
>>     ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
>>
>> The callback called from there is qed_aio_write_data(..., ret =
>> QED_CLUSTER_L2, ...) which means
>>
>>     bool need_alloc = ret != QED_CLUSTER_FOUND;
>>     /* Freeze this request if another allocating write is in progress */
>>     if (need_alloc) {
>>     ...
>>
>> So where did I start to follow the path of a L2 table allocation instead
>> of a simple cluster allocation?
>>      
> qed_aio_write_main() writes the main body of data into the cluster.
> Then it decides whether to update/allocate L2 tables if this is an
> allocating write.
>    

Right, it should only freeze if the L2 table needs to be allocated, not 
if it only needs to be updated.  IOW,
diff --git a/block/qed.c b/block/qed.c
index 4c4e7a2..0357c03 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -948,7 +948,7 @@ static void qed_aio_write_data(void *opaque, int ret,
      }

      /* Freeze this request if another allocating write is in progress */
-    if (need_alloc) {
+    if (ret == QED_CLUSTER_L1) {
          if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) {
              QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next);
          }

It's being a bit more conservative than it needs to be.

Regards,

Anthony Liguori

> qed_aio_write_l2_update() is the function that gets called to touch
> the L2 table (it also handles the allocation case).
>
> Stefan
>
>