From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=57845 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OvVmu-0006en-Rb
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 09:43:14 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OvVmp-0003l5-Mq
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 09:43:05 -0400
Received: from mail-gw0-f45.google.com ([74.125.83.45]:44819)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1OvVmp-0003ko-Gg
	for qemu-devel@nongnu.org; Tue, 14 Sep 2010 09:43:03 -0400
Received: by gwb11 with SMTP id 11so2809792gwb.4
	for <qemu-devel@nongnu.org>; Tue, 14 Sep 2010 06:43:03 -0700 (PDT)
Message-ID: <4C8F7BE4.5010102@codemonkey.ws>
Date: Tue, 14 Sep 2010 08:43:00 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] qcow2 performance plan
References: <4C8F7394.8060802@redhat.com>
In-Reply-To: <4C8F7394.8060802@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel <qemu-devel@nongnu.org>

Hi Avi,

On 09/14/2010 08:07 AM, Avi Kivity wrote:
>  Here's a draft of a plan that should improve qcow2 performance.  It's 
> written in wiki syntax for eventual upload to wiki.qemu.org; lines 
> starting with # are numbered lists, not comments.

Thanks for putting this together.  I think it's really useful to think 
through the problem before anyone jumps in and starts coding.

> = Basics =
>
> At the minimum level, no operation should block the main thread.  This
> could be done in two ways: extending the state machine so that each
> blocking operation can be performed asynchronously 
> (<code>bdrv_aio_*</code>)
> or by threading: each new operation is handed off to a worker thread.
> Since a full state machine is prohibitively complex, this document
> will discuss threading.

There's two distinct requirements that must be satisfied by a fast block 
device.  The device must have fast implementations of aio functions and 
it must support concurrent request processing.

If an aio function blocks in the process of submitting the request, it's 
by definition slow.  But even if you may the aio functions fast, you 
still need to be able to support concurrent request processing in order 
to achieve high throughput.

I'm not going to comment in depth on your threading proposal.  When it 
comes to adding concurrency, I think any approach will require a rewrite 
of the qcow2 code and if the author of that rewrite is more comfortable 
implementing concurrency with threads than with a state machine, I'm 
happy with a threaded implementation.

I'd suggest avoiding hyperbole like "a full state machine is 
prohibitively complex".  QED is a full state machine.  qcow2 adds a 
number of additional states because of the additional metadata and sync 
operations but it's not an exponential increase in complexity.

There are alternative models of concurrency that are worth exploring.  
For instance, coroutines are extremely appealing to me for this type of 
concurrency.  Let me give an example of why.  To handle a read request, 
qed loosely looks like:

aio_read():
    read_table(l1_table, aio_read_l2_table, opaque)

aio_read_l2_table():
    read_table(l2_table, aio_read_data, opaque)

aio_read_data():
    read_data(offset, aio_read_complete, opaque)

aio_read_complete():
    complete_request()

There's no parallelism so there's no need for locking.  With coroutines, 
there is no added parallelism but you can encode a state machine table 
using control flow by yielding to the next state.  QED aio_read() ends 
up looking like:

aio_read():
    launch_coroutine(aio_read_impl)

aio_read_impl():
    read_table(l1_table)
    read_table(l2_table)
    read_data(offset)
    complete_request()

There's a subtle difference with threads though.  Instead of being 
re-entrant by default, you only are re-entrant at well defined points in 
time.

What's nice is that coroutines can be implemented using threads or 
something more sophisticated (like ucontext).  Code written via state 
machine can be converted to coroutines without really thinking.

I'd like to get QED merged first, but my plan is to attempt to introduce 
coroutines to QEMU with something like QED.  I think coroutines are a 
much better fit for concurrency in QEMU than using pre-emptive threads.

> = Speeding up allocation =
>
> When a write grows a qcow2 image, the following operations take place:
>
> # clusters are allocated, and the refcount table is updated to reflect 
> this
> # sync to ensure the allocation is committed
> # the data is written to the clusters
> # the L2 table is located; if it doesn't exist, it is allocated and 
> linked
> # the L2 table is updated
> # sync to ensure the L2->data pointer is committed
>
> We can avoid the first sync by maintaining a volatile list of allocated
> but not yet linked clusters.  This requires a tradeoff between the 
> risk of
> losing those clusters on an abort, and the performance gain.  To 
> minimize the
> risk, the list is flushed if there is no demand for it.
>
> # we maintain low and high theresholds for the volatile free list
> # if we're under the low threshold, we start a task to allocate 
> clusters up to the midpoint
> # if we're above the high threshold, we start a task to return 
> clusters down to the midpoint
> # if we ever need a cluster (extent) and find that the volatile list 
> is empty, we double the low and thresholds (up to a limit)
> # once a second, we decrease the thresholds by 25%
>
> This ensures that sustained writes will not block on allocation.
>
> Note that a lost cluster is simply leaked; no data loss is involved.  
> The free list can be rebuilt if an unclean shutdown is detected.  
> Older implementations can ignore this those leaks.  To transport an 
> image, it is recommended to run qemu-img to reclaim any clusters in 
> case it was shut down uncleanly.

I'm not sure that letting older implementations ignore leaks is a 
reasonable approach.

>
> == Alternative implementation ==
>
> We can avoid a volatile list by relying on guest concurrency.  We replace
> <code>bdrv_aio_write</code> by <code>bdrv_aio_submit</code>, which issues
> many operations in parallel (but completes each one separately).  This 
> mimics
> SCSI and virtio devices, which can trigger multiple ops with a single 
> call
> to the hardware.  We make a first pass over all write operations, 
> seeing how
> many clusters need to be allocated, allocate that in a single 
> operation, then
> submit all of the allocating writes.

The difficulty here is that making a first pass means you have to read 
the metadata for each operation.  Each write() may have many writes to 
difference clusters which means that you've got to potentially read a 
lot of metadata.

This means that you may have one write that you know needs a new cluster 
but you wait to do the actual write until you finish other metadata 
operations.  It's non obvious without benchmarking whether this is the 
best strategy.

Regards,

Anthony Liguori

> = Avoiding L2 syncs =
>
> Currently after updating an L2 table with a cluster pointer, we sync 
> to avoid
> loss of a cluster.  We can avoid this since the guest is required to sync
> if it wants to ensure the data is on disk.  We need only to sync if we 
> UNMAP
> the cluster, before we free it in the refcount table.
>
> = Copying L1 tables =
>
> qcow2 requires copying of L1 tables in two cases: taking a snapshot, 
> and growing the physical image size beyond a certain boundary.  Since 
> L1s are relatively small, even for very large images, and growing L1 
> is very rare, we can exclude all write operations by having a global 
> shared/exclusive lock taken for shared access by write operations, and 
> for exclusive access by grow/snapshot operations.
>
> If concurrent growing and writing is desired, it can be achieved by 
> having a thread copy L1, and requiring each L1 update to update both 
> copies (for the region already copied) or just the source (for the 
> region that was not yet copied).
>