From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Fri, 10 Sep 2010 18:49:37 +0300 [thread overview]
Message-ID: <4C8A5391.2030601@redhat.com> (raw)
In-Reply-To: <4C8A4707.7080705@codemonkey.ws>
On 09/10/2010 05:56 PM, Anthony Liguori wrote:
> On 09/10/2010 08:47 AM, Avi Kivity wrote:
>> The current qcow2 implementation, yes. The qcow2 format, no.
>
> The qcow2 format has more writes because it maintains more meta data.
> More writes == worse performance.
>
> You claim that you can effectively batch those writes such that the
> worse performance will be in the noise. That claim needs to be proven
> though because it's purely conjecture right now.
It's based on experience. Why do you think batching allocations will
not improve performance?
In the common case (growing the physical file) allocating involves
writing a '(int64_t)1' to a refcount table. Allocating multiple
contiguous clusters means writing multiple such entries. That's trivial
to batch.
>
> There is a trade off to batching too as you leak address space. If
> you have to preallocate 2GB worth of address space to get good
> performance, then I'm very sceptical that qcow2 achieves the goals of
> a sparse file format.
2GB is 20 seconds worth of writes at 100 MB/s. It's way beyond what's
needed. At a guess I'd say 100ms worth, and of course, only if actively
writing.
> If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a naive
> copy of the image file and end up with a 2GB image when there's
> nothing in it, that's badness.
Only if you crash in the middle. If not, you free the preallocation
during shutdown (or when running a guest, when it isn't actively writing
at 100 MB/s).
>
> And what do you do when you shutdown and start up? You're setting a
> reference count on blocks and keeping metadata in memory that those
> blocks are really free. Do you need an atexit hook to decrement the
> reference counts?
Not atexit, just when we close the image.
> Do you need to create a free list structure that gets written out on
> close?
Yes, the same freelist that we allocate from. It's an "allocated but
not yet referenced" list.
> Just saying "we can do batching" is not solving the problem. If you
> want to claim that the formats are equally, then in the very least,
> you have to give a very exact description of how this would work
> because it's not entirely straight forward.
I thought I did, but I realize it is spread over multiple email
messages. If you like, I can try to summarize it. It will be equally
useful for qed once you add a freelist for UNMAP support.
At least one filesystem I'm aware of does preallocation in this manner.
>
>>> 2) qcow2 has historically had data integrity issues. It's unclear
>>> anyone is willing to say that they're 100% confident that there are
>>> still data integrity issues in the format.
>>
>> Fast forward a few years, no one will be 100% confident there are no
>> data integrity issues in qed.
>
> I don't think you have any grounds to make such a statement.
No, it's a forward-looking statement. But you're already looking at
adding a freelist for UNMAP support and three levels for larger images.
So it's safe to say that qed will not remain as nice and simple as it is
now.
>>
>>> 4) We have looked at trying to fix qcow2. It appears to be a
>>> monumental amount of work that starts with a rewrite where it's
>>> unclear if we can even keep supporting all of the special features.
>>> IOW, there is likely to be a need for users to experience some type
>>> of image conversion or optimization process.
>>
>> I don't see why.
>
> Because you're oversimplifying what it takes to make qcow2 perform well.
Maybe. With all its complexity, it's nowhere near as close to the
simplest filesystem. The biggest burden is the state machine design.
>
>>>
>>> 5) A correct version of qcow2 has terrible performance.
>>
>> Not inherently.
>
> A "naive" correct version of qcow2 does. Look at the above example.
> If you introduce a free list, you change the format which means that
> you couldn't support moving an image to an older version.
qcow2 already has a free list, it's the refcount table.
>
> So just for your batching example, the only compatible approach is to
> reduce the reference count on shutdown. But there's definitely a
> trade off because a few unclean shut downs could result in a huge image.
Not just on shutdown, also on guest quiesce. And yes, many unclean
shutdowns will bloat the image size. Definitely a downside.
The qed solution is to not support UNMAP or qed-on-lvm, and to require
fsck instead. Or to introduce an on-disk freelist, at which point you
get the qcow2 problems back.
>
>>> You need to do a bunch of fancy tricks to recover that performance.
>>> Every fancy trick needs to be carefully evaluated with respect to
>>> correctness. There's a large surface area for potential data
>>> corruptors.
>>
>> s/large/larger/. The only real difference is the refcount table,
>> which I agree sucks, but happens to be nice for TRIM support.
>
> I don't see the advantage at all.
I can't parse this. You don't see the advantage of TRIM (now UNMAP)?
You don't see the advantage of refcount tables? There isn't any, except
when compared to a format with no freelist which therefore can't support
UNMAP.
>> Those are properties of the implementation, not the format. The
>> format makes it harder to get it right but doesn't give us a free
>> pass not to do it.
>
>
> If the complexity doesn't buy us anything, than why pay the cost of it?
Because of compatibility. Starting from scratch, I'd pick qed, with
three levels and some way to support UNMAP.
>
> Let's review the proported downsides of QED.
>
> 1) It's a new image format. If users create QED images, they can't
> use them with older QEMU's. However, if we add a new feature to
> qcow2, we have the same problem.
Depends. Some features don't need format changes (UNMAP). On the other
hand, qcow2 doesn't have a feature bitmap, which complicates things.
>
> 2) If a user has an existing image qcow2 and wants to get the
> performance/correctness advantages of QED, they have to convert their
> images. That said, in place conversion can tremendously simplify this.
Live conversion would be even better. It's still a user-visible hassle.
>
> 3) Another format adds choice, choice adds complexity. From my
> perspective, QED can reduce choice long term because we can tell users
> that unless they have a strong reason otherwise, use QED. We cannot
> do that with qcow2 today. That may be an implementation detail of
> qcow2, but it doesn't change the fact that there's complexity in
> choosing an image format today.
True.
4) Requires fsck on unclean shutdown
5) No support for qed-on-lvm
6) limited image resize
7) No support for UNMAP
All are fixable, the latter with considerable changes to the format
(allocating from an on-disk freelist requires an intermediate sync step;
if the freelist is not on-disk, you can lose unbounded on-disk storage
on clean shutdown).
>> Sure, because you don't care about users. All of the complexity of
>> changing image formats (and deciding whether to do that or not) is
>> hidden away.
>
> Let's not turn this into a "I care more about users than you do"
> argument. Changing image formats consists of running a single
> command. The command is pretty slow today but we can make it pretty
> darn fast. It seems like a relatively small price to pay for a
> relatively large gain.
It's true for desktop users. It's not true for large installations.
>>>
>>> The impact to users is minimal. Upgrading images to a new format is
>>> not a big deal. This isn't guest visible and we're not talking
>>> about deleting qcow2 and removing support for it.
>>
>> It's a big deal to them. Users are not experts in qemu image
>> formats. They will have to learn how to do it, whether they can do
>> it (need to upgrade all your qemus before you can do it, need to make
>> sure you're not using qcow2 features, need to be sure you're not
>> planning to use qcow2 features).
>
> But we can't realistically support users that are using those extra
> features today anyway.
Why not?
> It's those "features" that are the fundamental problem.
I agree some of them (compression, in-image snapshots) are misfeatures.
>> Sure, we'll support qcow2, but will we give it the same attention?
>
> We have a lot of block formats in QEMU today but only one block format
> that actually performs well and has good data integrity.
>
> We're not giving qcow2 the attention it would need today to promote it
> to a Useful Format so I'm not sure that it really matters.
I don't think it's so useless. It's really only slow when allocating,
yes? Once you've allocated it is fully async IIRC.
So even today qcow2 is only slow at the start of the lifetime of the image.
>>> If you're willing to leak blocks on a scale that is still unknown.
>>
>> Who cares, those aren't real storage blocks.
>
> They are once you move the image from one place to another. If that
> doesn't concern you, it really should.
I don't see it as a huge problem, certainly less than fsck. If you think
fsck is a smaller hit, you can use it to recover the space.
Hm, you could have an 'unclean shutdown' bit in qcow2 and run a scrubber
in the background if you see it set and recover the space.
>
>>> It's not at all clear that making qcow2 have the same
>>> characteristics as qed is an easy problem. qed is specifically
>>> designed to avoid synchronous metadata updates. qcow2 cannot
>>> achieve that.
>>
>> qcow2 and qed are equivalent if you disregard the refcount table
>> (which we address by preallocation). Exactly the same technique you
>> use for sync-free metadata updates in qed can be used for qcow2.
>
> You cannot ignore the refcount table, that's the point of the discussion.
#include "I'm using preallocation to reduce its cost".
>
>>> You can *potentially* batch metadata updates by preallocating
>>> clusters, but what's the right amount to preallocate
>>
>> You look at your write rate and adjust it dynamically so you never wait.
>
> It's never that simple. How long do you look at the write rate? Do
> you lower the amount dynamically, if so, after how long? Predicting
> the future is never easy.
No, it's not easy. But you have to do it in qed as well, if you want to
avoid fsck.
>>> and is it really okay to leak blocks at that scale?
>>
>> Again, those aren't real blocks. And we're talking power loss
>> anyway. It's certainly better than requiring fsck for correctness.
>
> They are once you copy the image. And power loss is the same thing as
> unexpected exit because you're not simply talking about delaying a
> sync, you're talking staging future I/O operations purely within QEMU.
qed is susceptible to the same problem. If you have a 100MB write and
qemu exits before it updates L2s, then those 100MB are leaked. You
could alleviate the problem by writing L2 at intermediate points, but
even then, a power loss can leak those 100MB.
qed trades off the freelist for the file size (anything beyond the file
size is free), it doesn't eliminate it completely. So you still have
some of its problems, but you don't get its benefits.
>>> It's a weak story either way. There's a burden of proof still
>>> required to establish that this would, indeed, address the
>>> performance concerns.
>>
>> I don't see why you doubt it so much. Amortization is an well known
>> technique for reducing the cost of expensive operations.
>
> Because there are always limits, otherwise, all expensive operations
> would be cheap, and that's not reality.
Well, I guess we won't get anywhere with a theoretical discussion here.
>
>> You misunderstand me. I'm not advocating dropping qed and stopping
>> qcow2 development. I'm advocating dropping qed and working on qcow2
>> to provide the benefits that qed brings.
>
> If you think qcow2 is fixable, than either 1) fix qcow2 and prove me
> wrong 2) detail in great length how you would fix qcow2, and prove me
> wrong. Either way, the burden of proof is on establishing that qcow2
> is fixable.
I agree the burden of proof is on me (I'm just going to bounce it off to
Kevin). Mere words shouldn't be used to block off new work.
>
> So far, the proposed fixes are not specific and/or have unacceptable
> trade offs.
I thought they were quite specific. I'll try to summarize them in one
place so at least they're not lost.
> Having a leaking image is not acceptable IMHO because it potentially
> becomes something that is guest exploitable.
>
> If a guest finds a SEGV that is not exploitable in any meaningful way
> accept crashing QEMU, by leaking data in each crash, a guest can now
> grow an image's virtual size indefinitely.
>
> This does have real costs in disk space as the underlying file system
> does need to deal with metadata, but it's not unrealistic for
> management tools to copy images around for various reasons (maybe
> offline backup). A reasonable management tool might do planning based
> on maximum image size, but now the tools have to cope with (virtually)
> infinitely large images.
The qed solution is fsck, which is a lot worse IMO.
>> This simple formula doesn't work if some of your hosts don't support
>> qed yet. And it's still complicated for users because they have to
>> understand all of that. "trust me, use qed" is not going to work.
>
> Verses what? "Trust me, this time, we've finally fixed qcow2's data
> integrity issues" is going to work? That's an uphill battle no matter
> what.
We have to fix qcow2 anyway, since we can't ensure users do upgrade to qed.
>>>
>>> qcow2 has been a failure. Let's live up to it and move on. Making
>>> statements at each release that qcow2 has issues but we'll fix it
>>> soon just makes us look like we don't know what we're doing.
>>>
>>
>> Switching file formats is a similar statement.
>
> It's not an easy thing to do, I'll be the first to admit it. But we
> have to do difficult things in the name of progress.
>
> This discussion is an important one to have because we should not do
> things of this significance lightly.
>
> But that doesn't mean we should be afraid to make significant
> changes. The lack of a useful image format in QEMU today in
> unacceptable. We cannot remain satisfied with the status quo.
>
> If you think we can fix qcow2, then fix qcow2. But it's not obvious
> to me that it's fixable so if you think it is, you'll need to guide
> the way.
I'm willing to list the things I think should be done. But someone else
will have to actually do them and someone else will have to allocate the
time for this work, which is not going to be insignificant.
> It's not enough to just wave your hands and say "ammortize the
> expensive operations". It's not that easy to solve or else we would
> have solved it ages ago.
We were rightly focusing on data integrity first.
>> IMO, the real problem is the state machine implementation. Threading
>> it would make it much simpler. I wish I had the time to go back to
>> do that.
>
> The hard parts of support multiple requests in qed had nothing to do
> with threading vs. state machine. It was ensuring that all requests
> had independent state that didn't depend on a global context. Since
> the meta data cache has to be shared content, you have to be very
> careful about thinking through the semantics of evicting entries from
> the cache and bringing entries into the cache.
>
> The concurrency model really doesn't matter.
I disagree. When you want to order dependent operations with threads,
you stick a mutex in the data structure that needs serialization. The
same problem with a state machine means collecting all the state in the
call stack, sticking it in a dependency chain, and scheduling a restart
when the first operation completes. It's a lot more code.
>> What is specifically so bad about qcow2? The refcount table? It
>> happens to be necessary for TRIM. Copy-on-write? It's needed for
>> external snapshots.
>
> The refcount table is not necessary for trim. For trim, all you need
> is one bit of information, whether a block is allocated or not.
>
> With one bit of information, the refcount table is redundant because
> you have that same information in the L2 tables. It's harder to
> obtain but the fact that it's obtainable means you can have weak
> semantics with maintaining a refcount table (IOW, a free list) because
> it's only an optimization.
Well, the refcount table is also redundant wrt qcow2's L2 tables. You
can always reconstruct it with an fsck.
You store 64 bits vs 1 bit (or less if you use an extent based format,
or only store allocated blocks) but essentially it has the same
requirements.
>> We can have them side by side and choose later based on performance.
>> Though I fear if qed is merged qcow2 will see no further work.
>
> I think that's a weak argument not to merge qed and it's a bad way to
> grow a community.
Certainly, it's open source and we should encourage new ideas. But I'm
worried that when qed grows for a while it will become gnarly, and we'll
lost some of the benefit, while we'll create user confusion.
> We shouldn't prevent useful code from being merged because there was a
> previous half-baked implementation. Evolution is sometimes
> destructive and that's not a bad thing. Otherwise, I'd still be
> working on Xen :-)
>
> We certainly should do our best to ease transition for users. For
> guest facing things, we absolutely need to provide full compatibility
> and avoid changing guests at all costs.
>
> But upgrading on the host is a part of life. It's the same reason
> that every few years, we go from ext2 -> ext3, ext3 -> ext4, ext4 ->
> btrfs. It's never pretty but the earth still continues to orbit the
> sun and we all seem to get by.
ext[234] is more like qcow2 evolution. qcow2->qed is more similar to
ext4->btrfs, but compare the huge feature set difference between ext4
and btrfs, and qcow2 and qed.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
next prev parent reply other threads:[~2010-09-10 15:50 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity [this message]
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C8A5391.2030601@redhat.com \
--to=avi@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).