From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=34323 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OumXg-0007MM-8L
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:24:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OumXe-0006wt-Di
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:24:24 -0400
Received: from mx1.redhat.com ([209.132.183.28]:47835)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OumXd-0006wc-VM
	for qemu-devel@nongnu.org; Sun, 12 Sep 2010 09:24:22 -0400
Message-ID: <4C8CD47E.4060309@redhat.com>
Date: Sun, 12 Sep 2010 15:24:14 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>
	<4C84E738.3020802@codemonkey.ws> <4C865187.6090508@redhat.com>
	<4C865CFE.7010508@codemonkey.ws> <4C8663C4.1090508@redhat.com>
	<4C866773.2030103@codemonkey.ws>
	<4C86BC6B.5010809@codemonkey.ws> <4C874812.9090807@redhat.com>
	<4C87860A.3060904@codemonkey.ws> <4C888287.8020209@redhat.com>
	<4C88D7CC.5000806@codemonkey.ws> <4C8A1311.8070903@redhat.com>
	<4C8A2F40.7000509@codemonkey.ws> <4C8A36D4.5050001@redhat.com>
	<4C8A4707.7080705@codemonkey.ws> <4C8A5391.2030601@redhat.com>
	<4C8A65BB.9010602@codemonkey.ws>
In-Reply-To: <4C8A65BB.9010602@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

  On 09/10/2010 08:07 PM, Anthony Liguori wrote:
> On 09/10/2010 10:49 AM, Avi Kivity wrote:
>>>   If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a 
>>> naive copy of the image file and end up with a 2GB image when 
>>> there's nothing in it, that's badness.
>>
>> Only if you crash in the middle.  If not, you free the preallocation 
>> during shutdown (or when running a guest, when it isn't actively 
>> writing at 100 MB/s).
>
> Which is potentially guest exploitable.

If this worries you, run a scrubber in the background after an 
uncontrolled crash.  Like qed fsck, this will recover the free list from 
L2.  Unlike qed fsck, it will not delay starting of large guests.

>
>>> And what do you do when you shutdown and start up?  You're setting a 
>>> reference count on blocks and keeping metadata in memory that those 
>>> blocks are really free.  Do you need an atexit hook to decrement the 
>>> reference counts? 
>>
>> Not atexit, just when we close the image.
>
> Just a detail, but we need an atexit() handler to make sure block 
> devices get closed because we have too many exit()s in the code today.

Right.

>
>>> Do you need to create a free list structure that gets written out on 
>>> close?
>>
>> Yes, the same freelist that we allocate from.  It's an "allocated but 
>> not yet referenced" list.
>
> Does it get written to disk?

On exit or when there is no allocation activity.

>
>>> Just saying "we can do batching" is not solving the problem.  If you 
>>> want to claim that the formats are equally, then in the very least, 
>>> you have to give a very exact description of how this would work 
>>> because it's not entirely straight forward.
>>
>> I thought I did, but I realize it is spread over multiple email 
>> messages.  If you like, I can try to summarize it.  It will be 
>> equally useful for qed once you add a freelist for UNMAP support.
>
> Yes, please consolidate so we can debate specifics.  If there's a 
> reasonable way to fix qcow2, I'm happy to walk away from qed.  But 
> we've studied the problem and do not believe there's a reasonable 
> approach to fixing qcow2 whereas reasonable considers the amount of 
> development effort, the time line required to get things right, and 
> the confidence we would have in the final product compared against the 
> one time cost of introducing a new format.

I've started something and will post it soon.  When considering 
development time, also consider the time it will take users to actually 
use qed (6 months for qemu release users, ~9 months on average for 
semiannual community distro releases, 12-18 months for enterprise 
distros.  Consider also that we still have to support qcow2 since people 
do use the extra features, and since I don't see us forcing them to migrate.

>>> I don't think you have any grounds to make such a statement.
>>
>> No, it's a forward-looking statement.  But you're already looking at 
>> adding a freelist for UNMAP support and three levels for larger 
>> images.  So it's safe to say that qed will not remain as nice and 
>> simple as it is now.
>
> I have a lot of faith in starting from a strong base and avoiding 
> making it weaker vs. starting from a weak base and trying to make it 
> stronger.

This has led to many rewrites in the past.

>
> I realize it's somewhat subjective though.

While qed looks like a good start, it has at least three flaws already 
(relying on physical image size, relying on fsck, and limited logical 
image size).  Just fixing those will introduce complication.  What about 
new features or newly discovered flaws?

>
>>>>
>>>>> 4) We have looked at trying to fix qcow2.  It appears to be a 
>>>>> monumental amount of work that starts with a rewrite where it's 
>>>>> unclear if we can even keep supporting all of the special 
>>>>> features.  IOW, there is likely to be a need for users to 
>>>>> experience some type of image conversion or optimization process.
>>>>
>>>> I don't see why.
>>>
>>> Because you're oversimplifying what it takes to make qcow2 perform 
>>> well.
>>
>> Maybe.  With all its complexity, it's nowhere near as close to the 
>> simplest filesystem.  The biggest burden is the state machine design.
>
> Maybe I'm broken with respect to how I think, but I find state 
> machines very easy to rationalize.

Your father's state machine. Not as clumsy or random as a thread; an 
elegant weapon for a more civilized age

> To me, the biggest burden in qcow2 is thinking through how you deal 
> with shared resources.  Because you can block for a long period of 
> time during write operations, it's not enough to just carry a mutex 
> during all metadata operations.  You have to stage operations and 
> commit them at very specific points in time.

The standard way of dealing with this is to have a hash table for 
metadata that contains a local mutex:

     l2cache = defaultdict(L2)

     def get_l2(pos):
         l2 = l2cache[pos]
         l2.mutex.lock()
         if not l2.valid:
              l2.pos = pos
              l2.read()
              l2.valid = True
         return l2

     def put_l2(l2):
         if l2.dirty:
             l2.write()
             l2.dirty = False
         l2.mutex.unlock()

Further tricks allow you to batch unrelated updates of a single L2 into 
one write.

You can do all this with a state machine, except now you have to 
maintain dependency lists and manually call waiters.

>>> A "naive" correct version of qcow2 does.  Look at the above 
>>> example.  If you introduce a free list, you change the format which 
>>> means that you couldn't support moving an image to an older version.
>>
>> qcow2 already has a free list, it's the refcount table.
>
>
> Okay, qed already has a free list, it's the L1/L2 tables.
>
> Really, the ref count table in qcow2 is redundant.  You can rebuild it 
> if you needed to which means you could relax the integrity associated 
> with it if you were willing to add an fsck process.
>
> But with internal snapshots, you can have a lot more metadata than 
> without them so fsck can be very, very expensive.  It's difficult to 
> determine how to solve this problem.
>
> It's far easier to just avoid internal snapshots altogether and this 
> is exactly the thought process that led to QED.  Once you drop support 
> for internal snapshots, you can dramatically simplify.

The amount of metadata is O(nb_L2 * nb_snapshots).  For qed, 
nb_snapshots = 1 but nb_L2 can be still quite large.  If fsck is too 
long for one, it is too long for the other.

I don't see the huge simplification.  You simply iterate over all 
snapshots to build your free list.

>
>>>
>>> So just for your batching example, the only compatible approach is 
>>> to reduce the reference count on shutdown.  But there's definitely a 
>>> trade off because a few unclean shut downs could result in a huge 
>>> image.
>>
>> Not just on shutdown, also on guest quiesce.  And yes, many unclean 
>> shutdowns will bloat the image size.  Definitely a downside.
>>
>> The qed solution is to not support UNMAP or qed-on-lvm, and to 
>> require fsck instead.
>
> We can support UNMAP.  Not sure why you're suggesting we can't.

I meant, without doing an fsck to recover the space.  It's hard for me 
to consider a scan of all metadata on start as something normal; with 
large enough disks it's simply way too slow from cold cache.

Can you run an experiment?  Populate a 1TB disk with fio running a 
random write workload over the whole range for a while.  Reboot the 
host.  How long does fsck take?

> Not doing qed-on-lvm is definitely a limitation.  The one use case 
> I've heard is qcow2 on top of clustered LVM as clustered LVM is 
> simpler than a clustered filesystem.  I don't know the space well 
> enough so I need to think more about it.

I don't either.  If this use case survives, and if qed isn't changed to 
accomodate it, it means that that's another place where qed can't 
supplant qcow2.

>>> I don't see the advantage at all.
>>
>> I can't parse this.  You don't see the advantage of TRIM (now 
>> UNMAP)?  You don't see the advantage of refcount tables?  There isn't 
>> any, except when compared to a format with no freelist which 
>> therefore can't support UNMAP.
>
> Refcount table.  See above discussion  for my thoughts on refcount table.

Ok.  It boils down to "is fsck on startup acceptable".  Without a 
freelist, you need fsck for both unclean shutdown and for UNMAP.

>
>>> 3) Another format adds choice, choice adds complexity.  From my 
>>> perspective, QED can reduce choice long term because we can tell 
>>> users that unless they have a strong reason otherwise, use QED.  We 
>>> cannot do that with qcow2 today.  That may be an implementation 
>>> detail of qcow2, but it doesn't change the fact that there's 
>>> complexity in choosing an image format today.
>>
>> True.
>>
>> 4) Requires fsck on unclean shutdown
>
> I know it's uncool to do this in 2010, but I honestly believe it's a 
> reasonable approach considering the relative simplicity of our FS 
> compared to a normal FS.
>
> We're close to having fsck support so we can publish some performance 
> data from doing it on a reasonable large disk (like 1TB).  Let's see 
> what that looks like before we draw too many conclusions.

Great, you already have a test request queued above.

>
>> 5) No support for qed-on-lvm
>>
>> 6) limited image resize
>
> Not anymore than qcow2 FWIW.
>
> Again, with the default create parameters, we can resize up to 64TB 
> without rewriting metadata.  I wouldn't call that limited image resize.

I guess 64TB should last a bit.  And if you relax the L1 size to be any 
number of clusters (or have three levels) you're unlimited.

btw, having 256KB L2s is too large IMO.  Reading them will slow down 
your random read throughput.  Even 64K is a bit large, but there's no 
point making them smaller than a cluster.

>
>> 7) No support for UNMAP
>>
>> All are fixable, the latter with considerable changes to the format 
>> (allocating from an on-disk freelist requires an intermediate sync 
>> step; if the freelist is not on-disk, you can lose unbounded on-disk 
>> storage on clean shutdown).
>
> If you treat the on-disk free list as advisory, then you can be very 
> loose with writing the free list to disk.  You only have to rebuild 
> the free list on unclean shutdown when you have to do an fsck anyway.  
> If you're doing an fsck, you can rebuild the free list for free.
>
> So really, support for UNMAP is free if you're okay with fsck.  And 
> let's debate fsck some more when we have some proper performance data.

You can decide to treat qcow2's on-disk free list as advisory if you 
like.  No need for format change.  Of course starting an unclean 
shutdown image from, new qemu on old qemu would cause corruption, so 
this has to be managed carefully.

That gives you sync-free qcow2, no need for conversions.

>>> But we can't realistically support users that are using those extra 
>>> features today anyway. 
>>
>> Why not?
>
> When I say, "support users", I mean make sure that they get very good 
> performance and data integrity.  So far, we've only talked about how 
> to get good performance when there have never been snapshots but I 
> think we also need to consider how to deal with making sure that no 
> matter what feature a user is using, they get consistent results.

I don't think those features impact data integrity.  Snapshots and 
encryption are just uses of read-modify-write which we already have.  
Not sure about compression, maybe that needs copy-on-write too.

>> I don't think it's so useless.  It's really only slow when 
>> allocating, yes?  Once you've allocated it is fully async IIRC.
>
> It bounces all buffers still and I still think it's synchronous 
> (although Kevin would know better).

(an aside: with cache!=none we're bouncing in the kernel as well; we 
really need to make it work for cache=none, perhaps use O_DIRECT for 
data and writeback for metadata and shared backing images).

>
>>>>> If you're willing to leak blocks on a scale that is still unknown. 
>>>>
>>>> Who cares, those aren't real storage blocks.
>>>
>>> They are once you move the image from one place to another.  If that 
>>> doesn't concern you, it really should.
>>
>> I don't see it as a huge problem, certainly less than fsck. If you 
>> think fsck is a smaller hit, you can use it to recover the space.
>>
>> Hm, you could have an 'unclean shutdown' bit in qcow2 and run a 
>> scrubber in the background if you see it set and recover the space.
>
> Yes, you'll want to have that regardless.  But adding new things to 
> qcow2 has all the problems of introducing a new image format.

Just some of them.  On mount, rewrite the image format as qcow3.  On 
clean shutdown, write it back to qcow2.  So now there's no risk of data 
corruption (but there is reduced usability).

>>> They are once you copy the image.  And power loss is the same thing 
>>> as unexpected exit because you're not simply talking about delaying 
>>> a sync, you're talking staging future I/O operations purely within 
>>> QEMU.
>>
>> qed is susceptible to the same problem.  If you have a 100MB write 
>> and qemu exits before it updates L2s, then those 100MB are leaked.  
>> You could alleviate the problem by writing L2 at intermediate points, 
>> but even then, a power loss can leak those 100MB.
>>
>> qed trades off the freelist for the file size (anything beyond the 
>> file size is free), it doesn't eliminate it completely.  So you still 
>> have some of its problems, but you don't get its benefits.
>
> I think you've just established that qcow2 and qed both require an 
> fsck.  I don't disagree :-)

There's a difference between a background scrubber and a foreground fsck.


-- 
error compiling committee.c: too many arguments to function