Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Fri, 10 Sep 2010 09:56:07 -0500	[thread overview]
Message-ID: <4C8A4707.7080705@codemonkey.ws> (raw)
In-Reply-To: <4C8A36D4.5050001@redhat.com>

On 09/10/2010 08:47 AM, Avi Kivity wrote:
> The current qcow2 implementation, yes.  The qcow2 format, no.

The qcow2 format has more writes because it maintains more meta data.  
More writes == worse performance.

You claim that you can effectively batch those writes such that the 
worse performance will be in the noise.  That claim needs to be proven 
though because it's purely conjecture right now.

There is a trade off to batching too as you leak address space.  If you 
have to preallocate 2GB worth of address space to get good performance, 
then I'm very sceptical that qcow2 achieves the goals of a sparse file 
format.  If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a 
naive copy of the image file and end up with a 2GB image when there's 
nothing in it, that's badness.

And what do you do when you shutdown and start up?  You're setting a 
reference count on blocks and keeping metadata in memory that those 
blocks are really free.  Do you need an atexit hook to decrement the 
reference counts?  Do you need to create a free list structure that gets 
written out on close?

Just saying "we can do batching" is not solving the problem.  If you 
want to claim that the formats are equally, then in the very least, you 
have to give a very exact description of how this would work because 
it's not entirely straight forward.

>> 2) qcow2 has historically had data integrity issues.  It's unclear 
>> anyone is willing to say that they're 100% confident that there are 
>> still data integrity issues in the format.
>
> Fast forward a few years, no one will be 100% confident there are no 
> data integrity issues in qed.

I don't think you have any grounds to make such a statement.

>> 3) The users I care most about are absolutely uncompromising about 
>> data integrity.  There is no room for uncertainty or trade offs when 
>> you're building an enterprise product.
>
> 100% in agreement here.
>
>> 4) We have looked at trying to fix qcow2.  It appears to be a 
>> monumental amount of work that starts with a rewrite where it's 
>> unclear if we can even keep supporting all of the special features.  
>> IOW, there is likely to be a need for users to experience some type 
>> of image conversion or optimization process.
>
> I don't see why.

Because you're oversimplifying what it takes to make qcow2 perform well.

>>
>> 5) A correct version of qcow2 has terrible performance. 
>
> Not inherently.

A "naive" correct version of qcow2 does.  Look at the above example.  If 
you introduce a free list, you change the format which means that you 
couldn't support moving an image to an older version.

So just for your batching example, the only compatible approach is to 
reduce the reference count on shutdown.  But there's definitely a trade 
off because a few unclean shut downs could result in a huge image.

>> You need to do a bunch of fancy tricks to recover that performance.  
>> Every fancy trick needs to be carefully evaluated with respect to 
>> correctness.  There's a large surface area for potential data 
>> corruptors.
>
> s/large/larger/.  The only real difference is the refcount table, 
> which I agree sucks, but happens to be nice for TRIM support.

I don't see the advantage at all.

>>
>> We're still collecting performance data, but here's an example of 
>> what we're talking about.
>>
>> FFSB Random Writes MB/s (Block Size=8KB)
>>
>>                         Native        Raw         QCow2     QED
>> 1 Thread           30.2           24.4         22.7           23.4
>> 8 Threads        145.1         119.9        10.6          112.9
>> 16 Threads      177.1         139.0        10.1          120.9
>>
>> The performance difference is an order of magnitude.  qcow2 bounces 
>> all requests, needs to issue synchronous metadata updates, and only 
>> supports a single outstanding request at a time.
>
> Those are properties of the implementation, not the format.  The 
> format makes it harder to get it right but doesn't give us a free pass 
> not to do it.

If the complexity doesn't buy us anything, than why pay the cost of it?

Let's review the proported downsides of QED.

1) It's a new image format.  If users create QED images, they can't use 
them with older QEMU's.  However, if we add a new feature to qcow2, we 
have the same problem.

2) If a user has an existing image qcow2 and wants to get the 
performance/correctness advantages of QED, they have to convert their 
images.  That said, in place conversion can tremendously simplify this.

3) Another format adds choice, choice adds complexity.  From my 
perspective, QED can reduce choice long term because we can tell users 
that unless they have a strong reason otherwise, use QED.  We cannot do 
that with qcow2 today.  That may be an implementation detail of qcow2, 
but it doesn't change the fact that there's complexity in choosing an 
image format today.

>>
>> With good performance and high confidence in integrity, it's a no 
>> brainer as far as I'm concerned.  We have a format that it easy to 
>> rationalize as correct, performs damn close to raw.  On the other 
>> hand, we have a format that no one is confident that is correct that 
>> is even harder to rationalize as correct, and is an order of 
>> magnitude off raw in performance.
>>
>> It's really a no brainer.
>
> Sure, because you don't care about users.  All of the complexity of 
> changing image formats (and deciding whether to do that or not) is 
> hidden away.

Let's not turn this into a "I care more about users than you do" 
argument.  Changing image formats consists of running a single command.  
The command is pretty slow today but we can make it pretty darn fast.  
It seems like a relatively small price to pay for a relatively large gain.

>>
>> The impact to users is minimal.  Upgrading images to a new format is 
>> not a big deal.  This isn't guest visible and we're not talking about 
>> deleting qcow2 and removing support for it.
>
> It's a big deal to them.  Users are not experts in qemu image 
> formats.  They will have to learn how to do it, whether they can do it 
> (need to upgrade all your qemus before you can do it, need to make 
> sure you're not using qcow2 features, need to be sure you're not 
> planning to use qcow2 features).

But we can't realistically support users that are using those extra 
features today anyway.  It's those "features" that are the fundamental 
problem.

> Sure, we'll support qcow2, but will we give it the same attention?

We have a lot of block formats in QEMU today but only one block format 
that actually performs well and has good data integrity.

We're not giving qcow2 the attention it would need today to promote it 
to a Useful Format so I'm not sure that it really matters.

>> If you're willing to leak blocks on a scale that is still unknown. 
>
> Who cares, those aren't real storage blocks.

They are once you move the image from one place to another.  If that 
doesn't concern you, it really should.

>> It's not at all clear that making qcow2 have the same characteristics 
>> as qed is an easy problem.  qed is specifically designed to avoid 
>> synchronous metadata updates.  qcow2 cannot achieve that.
>
> qcow2 and qed are equivalent if you disregard the refcount table 
> (which we address by preallocation).  Exactly the same technique you 
> use for sync-free metadata updates in qed can be used for qcow2.

You cannot ignore the refcount table, that's the point of the discussion.

>> You can *potentially* batch metadata updates by preallocating 
>> clusters, but what's the right amount to preallocate
>
> You look at your write rate and adjust it dynamically so you never wait.

It's never that simple.  How long do you look at the write rate?  Do you 
lower the amount dynamically, if so, after how long?  Predicting the 
future is never easy.

>> and is it really okay to leak blocks at that scale? 
>
> Again, those aren't real blocks.  And we're talking power loss 
> anyway.  It's certainly better than requiring fsck for correctness.

They are once you copy the image.  And power loss is the same thing as 
unexpected exit because you're not simply talking about delaying a sync, 
you're talking staging future I/O operations purely within QEMU.

>> It's a weak story either way.  There's a burden of proof still 
>> required to establish that this would, indeed, address the 
>> performance concerns.
>
> I don't see why you doubt it so much.  Amortization is an well known 
> technique for reducing the cost of expensive operations.

Because there are always limits, otherwise, all expensive operations 
would be cheap, and that's not reality.

> You misunderstand me.  I'm not advocating dropping qed and stopping 
> qcow2 development.  I'm advocating dropping qed and working on qcow2 
> to provide the benefits that qed brings.

If you think qcow2 is fixable, than either 1) fix qcow2 and prove me 
wrong 2) detail in great length how you would fix qcow2, and prove me 
wrong.  Either way, the burden of proof is on establishing that qcow2 is 
fixable.

So far, the proposed fixes are not specific and/or have unacceptable 
trade offs.  Having a leaking image is not acceptable IMHO because it 
potentially becomes something that is guest exploitable.

If a guest finds a SEGV that is not exploitable in any meaningful way 
accept crashing QEMU, by leaking data in each crash, a guest can now 
grow an image's virtual size indefinitely.

This does have real costs in disk space as the underlying file system 
does need to deal with metadata, but it's not unrealistic for management 
tools to copy images around for various reasons (maybe offline backup).  
A reasonable management tool might do planning based on maximum image 
size, but now the tools have to cope with (virtually) infinitely large 
images.

>>>> A new format doesn't introduce much additional complexity.  We 
>>>> provide image conversion tool and we can almost certainly provide 
>>>> an in-place conversion tool that makes the process very fast.
>>>
>>> It introduces a lot of complexity for the users who aren't qed 
>>> experts.  They need to make a decision.  What's the impact of the 
>>> change?  Are the features that we lose important to us?  Do we know 
>>> what they are?  Is there any risk?  Can we make the change online or 
>>> do we have to schedule downtime?  Do all our hosts support qed?
>>
>> It's very simple.  Use qed, convert all existing images.  Image 
>> conversion is a part of virtualization.  We have tools to do it.  If 
>> they want to stick with qcow2 and are happy with it, fine, no one is 
>> advocating removing it.
>
> This simple formula doesn't work if some of your hosts don't support 
> qed yet.  And it's still complicated for users because they have to 
> understand all of that.  "trust me, use qed" is not going to work.

Verses what?  "Trust me, this time, we've finally fixed qcow2's data 
integrity issues" is going to work?  That's an uphill battle no matter what.

>>
>>> Improving qcow2 will be very complicated for Kevin who already looks 
>>> older beyond his years [1] but very simple for users.
>>
>> I think we're all better off if we move past sunk costs and focus on 
>> solving other problems.  I'd rather we all focus on improving 
>> performance and correctness even further than trying to make qcow2 be 
>> as good as what every other hypervisor had 5 years ago.
>>
>> qcow2 has been a failure.  Let's live up to it and move on.  Making 
>> statements at each release that qcow2 has issues but we'll fix it 
>> soon just makes us look like we don't know what we're doing.
>>
>
> Switching file formats is a similar statement.

It's not an easy thing to do, I'll be the first to admit it.  But we 
have to do difficult things in the name of progress.

This discussion is an important one to have because we should not do 
things of this significance lightly.

But that doesn't mean we should be afraid to make significant changes.  
The lack of a useful image format in QEMU today in unacceptable.  We 
cannot remain satisfied with the status quo.

If you think we can fix qcow2, then fix qcow2.  But it's not obvious to 
me that it's fixable so if you think it is, you'll need to guide the way.

It's not enough to just wave your hands and say "ammortize the expensive 
operations".  It's not that easy to solve or else we would have solved 
it ages ago.

> IMO, the real problem is the state machine implementation.  Threading 
> it would make it much simpler.  I wish I had the time to go back to do 
> that.

The hard parts of support multiple requests in qed had nothing to do 
with threading vs. state machine.  It was ensuring that all requests had 
independent state that didn't depend on a global context.  Since the 
meta data cache has to be shared content, you have to be very careful 
about thinking through the semantics of evicting entries from the cache 
and bringing entries into the cache.

The concurrency model really doesn't matter.

> What is specifically so bad about qcow2?  The refcount table?  It 
> happens to be necessary for TRIM.  Copy-on-write?  It's needed for 
> external snapshots.

The refcount table is not necessary for trim.  For trim, all you need is 
one bit of information, whether a block is allocated or not.

With one bit of information, the refcount table is redundant because you 
have that same information in the L2 tables.  It's harder to obtain but 
the fact that it's obtainable means you can have weak semantics with 
maintaining a refcount table (IOW, a free list) because it's only an 
optimization.

>> The choices we have 1) provide our users a format that has high 
>> performance and good data integrity 2) continue to only offer a 
>> format that has poor performance and bad data integrity and promise 
>> that we'll eventually fix it.
>>
>> We've been doing (2) for too long now.  We need to offer a solution 
>> to users today.  It's not fair to our users to not offer them a good 
>> solution just because we don't want to admit to previous mistakes.
>>
>> If someone can fix qcow2 and make it competitive, by all means, 
>> please do.
>
> We can have them side by side and choose later based on performance.  
> Though I fear if qed is merged qcow2 will see no further work.

I think that's a weak argument not to merge qed and it's a bad way to 
grow a community.  We shouldn't prevent useful code from being merged 
because there was a previous half-baked implementation.  Evolution is 
sometimes destructive and that's not a bad thing.  Otherwise, I'd still 
be working on Xen :-)

We certainly should do our best to ease transition for users.  For guest 
facing things, we absolutely need to provide full compatibility and 
avoid changing guests at all costs.

But upgrading on the host is a part of life.  It's the same reason that 
every few years, we go from ext2 -> ext3, ext3 -> ext4, ext4 -> btrfs.  
It's never pretty but the earth still continues to orbit the sun and we 
all seem to get by.

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2010-09-10 14:56 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31   ` Stefan Hajnoczi
2010-09-06 14:21   ` Luca Tettamanti
2010-09-06 14:24     ` Alexander Graf
2010-09-06 16:27       ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40   ` Stefan Hajnoczi
2010-09-06 12:57     ` Anthony Liguori
2010-09-06 13:02       ` Stefan Hajnoczi
2010-09-06 14:10       ` Kevin Wolf
2010-09-06 16:45         ` Anthony Liguori
2010-09-06 12:45   ` Anthony Liguori
2010-09-10 23:49     ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52   ` Anthony Liguori
2010-09-06 13:35     ` Daniel P. Berrange
2010-09-06 16:38       ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51   ` Avi Kivity
2010-09-07 15:40     ` Anthony Liguori
2010-09-07 16:09       ` Avi Kivity
2010-09-07 16:25         ` Anthony Liguori
2010-09-07 22:27           ` Anthony Liguori
2010-09-08  8:23             ` Avi Kivity
2010-09-08  8:41               ` Alexander Graf
2010-09-08  8:53                 ` Avi Kivity
2010-09-08 11:15                   ` Stefan Hajnoczi
2010-09-08 15:38                     ` Christoph Hellwig
2010-09-08 16:30                       ` Anthony Liguori
2010-09-08 20:23                         ` Christoph Hellwig
2010-09-08 20:28                           ` Anthony Liguori
2010-09-09  2:35                             ` Christoph Hellwig
2010-09-09  6:24                               ` Avi Kivity
2010-09-09 21:01                                 ` Christoph Hellwig
2010-09-10 11:15                                   ` Avi Kivity
2010-09-09  6:53                     ` Avi Kivity
2010-09-10 21:22                     ` Jamie Lokier
2010-09-14 10:46                       ` Stefan Hajnoczi
2010-09-14 11:08                         ` Stefan Hajnoczi
2010-09-14 12:54                         ` Anthony Liguori
2010-09-08 12:55                   ` Anthony Liguori
2010-09-09  6:30                     ` Avi Kivity
2010-09-08 12:48               ` Anthony Liguori
2010-09-08 13:20                 ` Kevin Wolf
2010-09-08 13:26                   ` Anthony Liguori
2010-09-08 13:46                     ` Kevin Wolf
2010-09-09  6:45                 ` Avi Kivity
2010-09-09  6:48                   ` Avi Kivity
2010-09-09 12:49                   ` Anthony Liguori
2010-09-09 16:48                     ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02                       ` Anthony Liguori
2010-09-09 20:56                         ` Christoph Hellwig
2010-09-10 10:53                         ` Avi Kivity
2010-09-10 11:14                     ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25                       ` Avi Kivity
2010-09-10 11:33                         ` Stefan Hajnoczi
2010-09-10 11:43                           ` Avi Kivity
2010-09-10 13:22                             ` Anthony Liguori
2010-09-10 13:48                               ` Christoph Hellwig
2010-09-10 15:02                                 ` Anthony Liguori
2010-09-10 15:18                                   ` Kevin Wolf
2010-09-10 15:53                                     ` Anthony Liguori
2010-09-10 16:05                                       ` Kevin Wolf
2010-09-10 17:10                                         ` Anthony Liguori
2010-09-10 17:44                                           ` Kevin Wolf
2010-09-10 17:46                                           ` Miguel Di Ciurcio Filho
2010-09-10 14:02                               ` Avi Kivity
2010-09-10 13:47                           ` Christoph Hellwig
2010-09-10 14:05                             ` Avi Kivity
2010-09-10 14:12                               ` Christoph Hellwig
2010-09-10 14:24                                 ` Avi Kivity
2010-09-10 13:16                         ` Anthony Liguori
2010-09-10 14:06                           ` Avi Kivity
2010-09-10 11:43                       ` Stefan Hajnoczi
2010-09-10 12:06                         ` Avi Kivity
2010-09-10 13:28                           ` Anthony Liguori
2010-09-10 12:12                         ` Kevin Wolf
2010-09-10 12:35                           ` Stefan Hajnoczi
2010-09-10 12:47                             ` Avi Kivity
2010-09-10 13:10                               ` Stefan Hajnoczi
2010-09-10 13:19                                 ` Avi Kivity
2010-09-10 13:39                               ` Anthony Liguori
2010-09-10 13:52                                 ` Christoph Hellwig
2010-09-10 13:56                                 ` Avi Kivity
2010-09-10 13:48                             ` Kevin Wolf
2010-09-10 13:14                       ` Anthony Liguori
2010-09-10 13:47                         ` Avi Kivity
2010-09-10 14:56                           ` Anthony Liguori [this message]
2010-09-10 15:49                             ` Avi Kivity
2010-09-10 17:07                               ` Anthony Liguori
2010-09-10 17:42                                 ` Kevin Wolf
2010-09-10 19:33                                   ` Anthony Liguori
2010-09-13 10:41                                     ` Kevin Wolf
2010-09-12 13:24                                 ` Avi Kivity
2010-09-12 15:13                                   ` Anthony Liguori
2010-09-12 15:56                                     ` Avi Kivity
2010-09-12 17:09                                       ` Anthony Liguori
2010-09-12 17:51                                         ` Avi Kivity
2010-09-12 20:18                                           ` Anthony Liguori
2010-09-13  9:24                                             ` Avi Kivity
2010-09-13 11:28                                         ` Kevin Wolf
2010-09-13 11:34                                           ` Avi Kivity
2010-09-13 11:48                                             ` Kevin Wolf
2010-09-13 13:19                                               ` Anthony Liguori
2010-09-13 13:12                                           ` Anthony Liguori
2010-09-13 11:03                                       ` Kevin Wolf
2010-09-13 13:07                                         ` Anthony Liguori
2010-09-13 13:24                                           ` Kevin Wolf
2010-09-07 16:12     ` Anthony Liguori
2010-09-07 21:35       ` Christoph Hellwig
2010-09-07 22:29         ` Anthony Liguori
2010-09-07 22:40           ` Christoph Hellwig
2010-09-08 15:07     ` Stefan Hajnoczi
2010-09-09  6:59       ` Avi Kivity
2010-09-09 17:43         ` Anthony Liguori
2010-09-09 20:46           ` Christoph Hellwig
2010-09-10 11:22           ` Avi Kivity
2010-09-10 11:29             ` Stefan Hajnoczi
2010-09-10 11:37               ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41   ` Anthony Liguori
2010-09-08  7:48     ` Kevin Wolf
2010-09-08 15:37   ` Stefan Hajnoczi
2010-09-08 18:24     ` Blue Swirl
2010-09-08 18:35       ` Anthony Liguori
2010-09-08 18:56         ` Blue Swirl
2010-09-08 19:19           ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12   ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2010-09-17  3:51 [Qemu-devel] " Khoa Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C8A4707.7080705@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.