From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=36384 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ou3pA-0002wy-Fr
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 09:39:29 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1Ou3p8-0005YW-Vz
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 09:39:28 -0400
Received: from mail-iw0-f173.google.com ([209.85.214.173]:36844)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <anthony@codemonkey.ws>) id 1Ou3p8-0005YL-Ri
	for qemu-devel@nongnu.org; Fri, 10 Sep 2010 09:39:26 -0400
Received: by iwn38 with SMTP id 38so2291191iwn.4
	for <qemu-devel@nongnu.org>; Fri, 10 Sep 2010 06:39:25 -0700 (PDT)
Message-ID: <4C8A3509.6060101@codemonkey.ws>
Date: Fri, 10 Sep 2010 08:39:21 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>	<4C84E738.3020802@codemonkey.ws>	<4C865187.6090508@redhat.com>	<4C865CFE.7010508@codemonkey.ws>	<4C8663C4.1090508@redhat.com>	<4C866773.2030103@codemonkey.ws>	<4C86BC6B.5010809@codemonkey.ws>	<4C874812.9090807@redhat.com>	<4C87860A.3060904@codemonkey.ws>	<4C888287.8020209@redhat.com>	<4C88D7CC.5000806@codemonkey.ws>	<4C8A1311.8070903@redhat.com>	<AANLkTikMOjZEZ6et1Tg26Qvfa3k0-s_-WdCD=G7CrteG@mail.gmail.com>	<4C8A20B8.1040800@redhat.com>
	<AANLkTimFVBpaGKZAxJbOp+ymnqapKbjcvMrA5DkAtA3g@mail.gmail.com>
	<4C8A28C5.9010704@redhat.com>
In-Reply-To: <4C8A28C5.9010704@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

On 09/10/2010 07:47 AM, Avi Kivity wrote:
>> Then, with a clean base that takes on board the lessons of existing
>> formats it is much easier to innovate.  Look at the image streaming,
>> defragmentation, and trim ideas that are playing out right now.  I
>> think the reason we haven't seen them before is because the effort and
>> the baggage of doing them is too great.  Sure, we maintain existing
>> formats but I don't see active development pushing virtualized storage
>> happening.
>
>
> The same could be said about much of qemu.  It is an old code base 
> that wasn't designed for virtualization.  Yet we maintain it and 
> develop it because compatibility is king.
>
> (as an aside, qcow2 is better positioned for TRIM support than qed is)

You're hand waving to a dangerous degree here :-)

TRIM in qcow2 would require the following sequence:

1) remove cluster from L2 table
2) sync()
3) reduce cluster reference count
4) sync()

TRIM needs to be fast so this is not going to be acceptable.  How do you 
solve it?

For QED, TRIM requires:

1) remove cluster from L2 table
2) sync()

In both cases, I'm assuming we lazily write the free list and have a way 
to detect unclean mounts.  Unclean mounts require an fsck() and both 
qcow2 and qed require it.

You can drop the last sync() in both QEDand qcow2 by delaying the sync() 
until you reallocate the cluster.  If you sync() for some other reason 
before then, you can avoid it completely.

I don't think you can remove (2) from qcow2 TRIM.

This is the key feature of qed.  Because there's only one piece of 
metadata, you never have to worry about metadata ordering.  You can 
amortize the cost of metadata ordering in qcow2 by batching certain 
operations but not all operations are easily batched.

Maybe you could batch trim operations and attempt to do them all at 
once.  But then you need to track future write requests in order to make 
sure you don't trim over a new write.

When it comes to data integrity, increased complexity == increased 
chance of screwing up.

Regards,

Anthony Liguori