Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Fri, 10 Sep 2010 12:07:07 -0500	[thread overview]
Message-ID: <4C8A65BB.9010602@codemonkey.ws> (raw)
In-Reply-To: <4C8A5391.2030601@redhat.com>

On 09/10/2010 10:49 AM, Avi Kivity wrote:
>>   If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a 
>> naive copy of the image file and end up with a 2GB image when there's 
>> nothing in it, that's badness.
>
> Only if you crash in the middle.  If not, you free the preallocation 
> during shutdown (or when running a guest, when it isn't actively 
> writing at 100 MB/s).

Which is potentially guest exploitable.

>> And what do you do when you shutdown and start up?  You're setting a 
>> reference count on blocks and keeping metadata in memory that those 
>> blocks are really free.  Do you need an atexit hook to decrement the 
>> reference counts? 
>
> Not atexit, just when we close the image.

Just a detail, but we need an atexit() handler to make sure block 
devices get closed because we have too many exit()s in the code today.

>> Do you need to create a free list structure that gets written out on 
>> close?
>
> Yes, the same freelist that we allocate from.  It's an "allocated but 
> not yet referenced" list.

Does it get written to disk?

>> Just saying "we can do batching" is not solving the problem.  If you 
>> want to claim that the formats are equally, then in the very least, 
>> you have to give a very exact description of how this would work 
>> because it's not entirely straight forward.
>
> I thought I did, but I realize it is spread over multiple email 
> messages.  If you like, I can try to summarize it.  It will be equally 
> useful for qed once you add a freelist for UNMAP support.

Yes, please consolidate so we can debate specifics.  If there's a 
reasonable way to fix qcow2, I'm happy to walk away from qed.  But we've 
studied the problem and do not believe there's a reasonable approach to 
fixing qcow2 whereas reasonable considers the amount of development 
effort, the time line required to get things right, and the confidence 
we would have in the final product compared against the one time cost of 
introducing a new format.

>>
>>>> 2) qcow2 has historically had data integrity issues.  It's unclear 
>>>> anyone is willing to say that they're 100% confident that there are 
>>>> still data integrity issues in the format.
>>>
>>> Fast forward a few years, no one will be 100% confident there are no 
>>> data integrity issues in qed.
>>
>> I don't think you have any grounds to make such a statement.
>
> No, it's a forward-looking statement.  But you're already looking at 
> adding a freelist for UNMAP support and three levels for larger 
> images.  So it's safe to say that qed will not remain as nice and 
> simple as it is now.

I have a lot of faith in starting from a strong base and avoiding making 
it weaker vs. starting from a weak base and trying to make it stronger.

I realize it's somewhat subjective though.

>>>
>>>> 4) We have looked at trying to fix qcow2.  It appears to be a 
>>>> monumental amount of work that starts with a rewrite where it's 
>>>> unclear if we can even keep supporting all of the special 
>>>> features.  IOW, there is likely to be a need for users to 
>>>> experience some type of image conversion or optimization process.
>>>
>>> I don't see why.
>>
>> Because you're oversimplifying what it takes to make qcow2 perform well.
>
> Maybe.  With all its complexity, it's nowhere near as close to the 
> simplest filesystem.  The biggest burden is the state machine design.

Maybe I'm broken with respect to how I think, but I find state machines 
very easy to rationalize.

To me, the biggest burden in qcow2 is thinking through how you deal with 
shared resources.  Because you can block for a long period of time 
during write operations, it's not enough to just carry a mutex during 
all metadata operations.  You have to stage operations and commit them 
at very specific points in time.

>>
>>>>
>>>> 5) A correct version of qcow2 has terrible performance. 
>>>
>>> Not inherently.
>>
>> A "naive" correct version of qcow2 does.  Look at the above example.  
>> If you introduce a free list, you change the format which means that 
>> you couldn't support moving an image to an older version.
>
> qcow2 already has a free list, it's the refcount table.

Okay, qed already has a free list, it's the L1/L2 tables.

Really, the ref count table in qcow2 is redundant.  You can rebuild it 
if you needed to which means you could relax the integrity associated 
with it if you were willing to add an fsck process.

But with internal snapshots, you can have a lot more metadata than 
without them so fsck can be very, very expensive.  It's difficult to 
determine how to solve this problem.

It's far easier to just avoid internal snapshots altogether and this is 
exactly the thought process that led to QED.  Once you drop support for 
internal snapshots, you can dramatically simplify.

>>
>> So just for your batching example, the only compatible approach is to 
>> reduce the reference count on shutdown.  But there's definitely a 
>> trade off because a few unclean shut downs could result in a huge image.
>
> Not just on shutdown, also on guest quiesce.  And yes, many unclean 
> shutdowns will bloat the image size.  Definitely a downside.
>
> The qed solution is to not support UNMAP or qed-on-lvm, and to require 
> fsck instead.

We can support UNMAP.  Not sure why you're suggesting we can't.

Not doing qed-on-lvm is definitely a limitation.  The one use case I've 
heard is qcow2 on top of clustered LVM as clustered LVM is simpler than 
a clustered filesystem.  I don't know the space well enough so I need to 
think more about it.

>> I don't see the advantage at all.
>
> I can't parse this.  You don't see the advantage of TRIM (now UNMAP)?  
> You don't see the advantage of refcount tables?  There isn't any, 
> except when compared to a format with no freelist which therefore 
> can't support UNMAP.

Refcount table.  See above discussion  for my thoughts on refcount table.

>>
>> 2) If a user has an existing image qcow2 and wants to get the 
>> performance/correctness advantages of QED, they have to convert their 
>> images.  That said, in place conversion can tremendously simplify this.
>
> Live conversion would be even better.  It's still a user-visible hassle.

Yeah, but you need a user to initiate it.  Otherwise, it's doable.

>> 3) Another format adds choice, choice adds complexity.  From my 
>> perspective, QED can reduce choice long term because we can tell 
>> users that unless they have a strong reason otherwise, use QED.  We 
>> cannot do that with qcow2 today.  That may be an implementation 
>> detail of qcow2, but it doesn't change the fact that there's 
>> complexity in choosing an image format today.
>
> True.
>
> 4) Requires fsck on unclean shutdown

I know it's uncool to do this in 2010, but I honestly believe it's a 
reasonable approach considering the relative simplicity of our FS 
compared to a normal FS.

We're close to having fsck support so we can publish some performance 
data from doing it on a reasonable large disk (like 1TB).  Let's see 
what that looks like before we draw too many conclusions.

> 5) No support for qed-on-lvm
>
> 6) limited image resize

Not anymore than qcow2 FWIW.

Again, with the default create parameters, we can resize up to 64TB 
without rewriting metadata.  I wouldn't call that limited image resize.

> 7) No support for UNMAP
>
> All are fixable, the latter with considerable changes to the format 
> (allocating from an on-disk freelist requires an intermediate sync 
> step; if the freelist is not on-disk, you can lose unbounded on-disk 
> storage on clean shutdown).

If you treat the on-disk free list as advisory, then you can be very 
loose with writing the free list to disk.  You only have to rebuild the 
free list on unclean shutdown when you have to do an fsck anyway.  If 
you're doing an fsck, you can rebuild the free list for free.

So really, support for UNMAP is free if you're okay with fsck.  And 
let's debate fsck some more when we have some proper performance data.

> It's true for desktop users.  It's not true for large installations.
>
>>>>
>>>> The impact to users is minimal.  Upgrading images to a new format 
>>>> is not a big deal.  This isn't guest visible and we're not talking 
>>>> about deleting qcow2 and removing support for it.
>>>
>>> It's a big deal to them.  Users are not experts in qemu image 
>>> formats.  They will have to learn how to do it, whether they can do 
>>> it (need to upgrade all your qemus before you can do it, need to 
>>> make sure you're not using qcow2 features, need to be sure you're 
>>> not planning to use qcow2 features).
>>
>> But we can't realistically support users that are using those extra 
>> features today anyway. 
>
> Why not?

When I say, "support users", I mean make sure that they get very good 
performance and data integrity.  So far, we've only talked about how to 
get good performance when there have never been snapshots but I think we 
also need to consider how to deal with making sure that no matter what 
feature a user is using, they get consistent results.

>>> Sure, we'll support qcow2, but will we give it the same attention?
>>
>> We have a lot of block formats in QEMU today but only one block 
>> format that actually performs well and has good data integrity.
>>
>> We're not giving qcow2 the attention it would need today to promote 
>> it to a Useful Format so I'm not sure that it really matters.
>
> I don't think it's so useless.  It's really only slow when allocating, 
> yes?  Once you've allocated it is fully async IIRC.

It bounces all buffers still and I still think it's synchronous 
(although Kevin would know better).

>>>> If you're willing to leak blocks on a scale that is still unknown. 
>>>
>>> Who cares, those aren't real storage blocks.
>>
>> They are once you move the image from one place to another.  If that 
>> doesn't concern you, it really should.
>
> I don't see it as a huge problem, certainly less than fsck. If you 
> think fsck is a smaller hit, you can use it to recover the space.
>
> Hm, you could have an 'unclean shutdown' bit in qcow2 and run a 
> scrubber in the background if you see it set and recover the space.

Yes, you'll want to have that regardless.  But adding new things to 
qcow2 has all the problems of introducing a new image format.

>>
>>>> You can *potentially* batch metadata updates by preallocating 
>>>> clusters, but what's the right amount to preallocate
>>>
>>> You look at your write rate and adjust it dynamically so you never 
>>> wait.
>>
>> It's never that simple.  How long do you look at the write rate?  Do 
>> you lower the amount dynamically, if so, after how long?  Predicting 
>> the future is never easy.
>
> No, it's not easy.  But you have to do it in qed as well, if you want 
> to avoid fsck.

I don't want to avoid fskc, but we need to provide data about cost of 
fsck in order to really make that case.

>
>>>> and is it really okay to leak blocks at that scale? 
>>>
>>> Again, those aren't real blocks.  And we're talking power loss 
>>> anyway.  It's certainly better than requiring fsck for correctness.
>>
>> They are once you copy the image.  And power loss is the same thing 
>> as unexpected exit because you're not simply talking about delaying a 
>> sync, you're talking staging future I/O operations purely within QEMU.
>
> qed is susceptible to the same problem.  If you have a 100MB write and 
> qemu exits before it updates L2s, then those 100MB are leaked.  You 
> could alleviate the problem by writing L2 at intermediate points, but 
> even then, a power loss can leak those 100MB.
>
> qed trades off the freelist for the file size (anything beyond the 
> file size is free), it doesn't eliminate it completely.  So you still 
> have some of its problems, but you don't get its benefits.

I think you've just established that qcow2 and qed both require an 
fsck.  I don't disagree :-)

>> It's not an easy thing to do, I'll be the first to admit it.  But we 
>> have to do difficult things in the name of progress.
>>
>> This discussion is an important one to have because we should not do 
>> things of this significance lightly.
>>
>> But that doesn't mean we should be afraid to make significant 
>> changes.  The lack of a useful image format in QEMU today in 
>> unacceptable.  We cannot remain satisfied with the status quo.
>>
>> If you think we can fix qcow2, then fix qcow2.  But it's not obvious 
>> to me that it's fixable so if you think it is, you'll need to guide 
>> the way.
>
> I'm willing to list the things I think should be done.  But someone 
> else will have to actually do them and someone else will have to 
> allocate the time for this work, which is not going to be insignificant.

Understood.

>>> IMO, the real problem is the state machine implementation.  
>>> Threading it would make it much simpler.  I wish I had the time to 
>>> go back to do that.
>>
>> The hard parts of support multiple requests in qed had nothing to do 
>> with threading vs. state machine.  It was ensuring that all requests 
>> had independent state that didn't depend on a global context.  Since 
>> the meta data cache has to be shared content, you have to be very 
>> careful about thinking through the semantics of evicting entries from 
>> the cache and bringing entries into the cache.
>>
>> The concurrency model really doesn't matter.
>
> I disagree.  When you want to order dependent operations with threads, 
> you stick a mutex in the data structure that needs serialization.  The 
> same problem with a state machine means collecting all the state in 
> the call stack, sticking it in a dependency chain, and scheduling a 
> restart when the first operation completes.  It's a lot more code.

Yeah, but I'm saying that you can't just carry a lock, you have to make 
sure you don't carry locks over write()s or read()s which means you end 
up having to stage certain operations with an explicit commit.

If you think async is harder, that's fine.  To me, that's a simple part.

>>> What is specifically so bad about qcow2?  The refcount table?  It 
>>> happens to be necessary for TRIM.  Copy-on-write?  It's needed for 
>>> external snapshots.
>>
>> The refcount table is not necessary for trim.  For trim, all you need 
>> is one bit of information, whether a block is allocated or not.
>>
>> With one bit of information, the refcount table is redundant because 
>> you have that same information in the L2 tables.  It's harder to 
>> obtain but the fact that it's obtainable means you can have weak 
>> semantics with maintaining a refcount table (IOW, a free list) 
>> because it's only an optimization.
>
> Well, the refcount table is also redundant wrt qcow2's L2 tables.  You 
> can always reconstruct it with an fsck.
>
> You store 64 bits vs 1 bit (or less if you use an extent based format, 
> or only store allocated blocks) but essentially it has the same 
> requirements.

Precisely.

>>> We can have them side by side and choose later based on 
>>> performance.  Though I fear if qed is merged qcow2 will see no 
>>> further work.
>>
>> I think that's a weak argument not to merge qed and it's a bad way to 
>> grow a community. 
>
> Certainly, it's open source and we should encourage new ideas.  But 
> I'm worried that when qed grows for a while it will become gnarly, and 
> we'll lost some of the benefit, while we'll create user confusion.

But that would be regressions and we need to be good about rejecting 
things that cause regressions.

>> We shouldn't prevent useful code from being merged because there was 
>> a previous half-baked implementation.  Evolution is sometimes 
>> destructive and that's not a bad thing.  Otherwise, I'd still be 
>> working on Xen :-)
>>
>> We certainly should do our best to ease transition for users.  For 
>> guest facing things, we absolutely need to provide full compatibility 
>> and avoid changing guests at all costs.
>>
>> But upgrading on the host is a part of life.  It's the same reason 
>> that every few years, we go from ext2 -> ext3, ext3 -> ext4, ext4 -> 
>> btrfs.  It's never pretty but the earth still continues to orbit the 
>> sun and we all seem to get by.
>
> ext[234] is more like qcow2 evolution.  qcow2->qed is more similar to 
> ext4->btrfs, but compare the huge feature set difference between ext4 
> and btrfs, and qcow2 and qed.

To me, performance and correctness are huge features.

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2010-09-10 17:07 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31   ` Stefan Hajnoczi
2010-09-06 14:21   ` Luca Tettamanti
2010-09-06 14:24     ` Alexander Graf
2010-09-06 16:27       ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40   ` Stefan Hajnoczi
2010-09-06 12:57     ` Anthony Liguori
2010-09-06 13:02       ` Stefan Hajnoczi
2010-09-06 14:10       ` Kevin Wolf
2010-09-06 16:45         ` Anthony Liguori
2010-09-06 12:45   ` Anthony Liguori
2010-09-10 23:49     ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52   ` Anthony Liguori
2010-09-06 13:35     ` Daniel P. Berrange
2010-09-06 16:38       ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51   ` Avi Kivity
2010-09-07 15:40     ` Anthony Liguori
2010-09-07 16:09       ` Avi Kivity
2010-09-07 16:25         ` Anthony Liguori
2010-09-07 22:27           ` Anthony Liguori
2010-09-08  8:23             ` Avi Kivity
2010-09-08  8:41               ` Alexander Graf
2010-09-08  8:53                 ` Avi Kivity
2010-09-08 11:15                   ` Stefan Hajnoczi
2010-09-08 15:38                     ` Christoph Hellwig
2010-09-08 16:30                       ` Anthony Liguori
2010-09-08 20:23                         ` Christoph Hellwig
2010-09-08 20:28                           ` Anthony Liguori
2010-09-09  2:35                             ` Christoph Hellwig
2010-09-09  6:24                               ` Avi Kivity
2010-09-09 21:01                                 ` Christoph Hellwig
2010-09-10 11:15                                   ` Avi Kivity
2010-09-09  6:53                     ` Avi Kivity
2010-09-10 21:22                     ` Jamie Lokier
2010-09-14 10:46                       ` Stefan Hajnoczi
2010-09-14 11:08                         ` Stefan Hajnoczi
2010-09-14 12:54                         ` Anthony Liguori
2010-09-08 12:55                   ` Anthony Liguori
2010-09-09  6:30                     ` Avi Kivity
2010-09-08 12:48               ` Anthony Liguori
2010-09-08 13:20                 ` Kevin Wolf
2010-09-08 13:26                   ` Anthony Liguori
2010-09-08 13:46                     ` Kevin Wolf
2010-09-09  6:45                 ` Avi Kivity
2010-09-09  6:48                   ` Avi Kivity
2010-09-09 12:49                   ` Anthony Liguori
2010-09-09 16:48                     ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02                       ` Anthony Liguori
2010-09-09 20:56                         ` Christoph Hellwig
2010-09-10 10:53                         ` Avi Kivity
2010-09-10 11:14                     ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25                       ` Avi Kivity
2010-09-10 11:33                         ` Stefan Hajnoczi
2010-09-10 11:43                           ` Avi Kivity
2010-09-10 13:22                             ` Anthony Liguori
2010-09-10 13:48                               ` Christoph Hellwig
2010-09-10 15:02                                 ` Anthony Liguori
2010-09-10 15:18                                   ` Kevin Wolf
2010-09-10 15:53                                     ` Anthony Liguori
2010-09-10 16:05                                       ` Kevin Wolf
2010-09-10 17:10                                         ` Anthony Liguori
2010-09-10 17:44                                           ` Kevin Wolf
2010-09-10 17:46                                           ` Miguel Di Ciurcio Filho
2010-09-10 14:02                               ` Avi Kivity
2010-09-10 13:47                           ` Christoph Hellwig
2010-09-10 14:05                             ` Avi Kivity
2010-09-10 14:12                               ` Christoph Hellwig
2010-09-10 14:24                                 ` Avi Kivity
2010-09-10 13:16                         ` Anthony Liguori
2010-09-10 14:06                           ` Avi Kivity
2010-09-10 11:43                       ` Stefan Hajnoczi
2010-09-10 12:06                         ` Avi Kivity
2010-09-10 13:28                           ` Anthony Liguori
2010-09-10 12:12                         ` Kevin Wolf
2010-09-10 12:35                           ` Stefan Hajnoczi
2010-09-10 12:47                             ` Avi Kivity
2010-09-10 13:10                               ` Stefan Hajnoczi
2010-09-10 13:19                                 ` Avi Kivity
2010-09-10 13:39                               ` Anthony Liguori
2010-09-10 13:52                                 ` Christoph Hellwig
2010-09-10 13:56                                 ` Avi Kivity
2010-09-10 13:48                             ` Kevin Wolf
2010-09-10 13:14                       ` Anthony Liguori
2010-09-10 13:47                         ` Avi Kivity
2010-09-10 14:56                           ` Anthony Liguori
2010-09-10 15:49                             ` Avi Kivity
2010-09-10 17:07                               ` Anthony Liguori [this message]
2010-09-10 17:42                                 ` Kevin Wolf
2010-09-10 19:33                                   ` Anthony Liguori
2010-09-13 10:41                                     ` Kevin Wolf
2010-09-12 13:24                                 ` Avi Kivity
2010-09-12 15:13                                   ` Anthony Liguori
2010-09-12 15:56                                     ` Avi Kivity
2010-09-12 17:09                                       ` Anthony Liguori
2010-09-12 17:51                                         ` Avi Kivity
2010-09-12 20:18                                           ` Anthony Liguori
2010-09-13  9:24                                             ` Avi Kivity
2010-09-13 11:28                                         ` Kevin Wolf
2010-09-13 11:34                                           ` Avi Kivity
2010-09-13 11:48                                             ` Kevin Wolf
2010-09-13 13:19                                               ` Anthony Liguori
2010-09-13 13:12                                           ` Anthony Liguori
2010-09-13 11:03                                       ` Kevin Wolf
2010-09-13 13:07                                         ` Anthony Liguori
2010-09-13 13:24                                           ` Kevin Wolf
2010-09-07 16:12     ` Anthony Liguori
2010-09-07 21:35       ` Christoph Hellwig
2010-09-07 22:29         ` Anthony Liguori
2010-09-07 22:40           ` Christoph Hellwig
2010-09-08 15:07     ` Stefan Hajnoczi
2010-09-09  6:59       ` Avi Kivity
2010-09-09 17:43         ` Anthony Liguori
2010-09-09 20:46           ` Christoph Hellwig
2010-09-10 11:22           ` Avi Kivity
2010-09-10 11:29             ` Stefan Hajnoczi
2010-09-10 11:37               ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41   ` Anthony Liguori
2010-09-08  7:48     ` Kevin Wolf
2010-09-08 15:37   ` Stefan Hajnoczi
2010-09-08 18:24     ` Blue Swirl
2010-09-08 18:35       ` Anthony Liguori
2010-09-08 18:56         ` Blue Swirl
2010-09-08 19:19           ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12   ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2010-09-17  3:51 [Qemu-devel] " Khoa Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C8A65BB.9010602@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.