Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Sun, 12 Sep 2010 10:13:24 -0500	[thread overview]
Message-ID: <4C8CEE14.4020501@codemonkey.ws> (raw)
In-Reply-To: <4C8CD47E.4060309@redhat.com>

On 09/12/2010 08:24 AM, Avi Kivity wrote:
>>> Not atexit, just when we close the image.
>>
>> Just a detail, but we need an atexit() handler to make sure block 
>> devices get closed because we have too many exit()s in the code today.
>
>
> Right.

So when you click the 'X' on the qemu window, we get to wait a few 
seconds for it to actually disappear because it's flushing metadata to 
disk..

> I've started something and will post it soon.

Excellent, thank you.

>   When considering development time, also consider the time it will 
> take users to actually use qed (6 months for qemu release users, ~9 
> months on average for semiannual community distro releases, 12-18 
> months for enterprise distros.  Consider also that we still have to 
> support qcow2 since people do use the extra features, and since I 
> don't see us forcing them to migrate.

I'm of the opinion that qcow2 is unfit for production use for the type 
of production environments I care about.  The amount of changes needed 
to make qcow2 fit for production use put it on at least the same 
timeline as you cite above.

Yes, there are people today that qcow2 is appropriate but by the same 
respect, it will continue to be appropriate for them in the future.

In my view, we don't have an image format fit for production use.  
You're arguing we should make qcow2 fit for production use whereas I am 
arguing we should start from scratch.  My reasoning for starting from 
scratch is that it simplifies the problem.  Your reasoning for improving 
qcow2 is simplifying the transition for non-production users of qcow2.

We have an existence proof that we can achieve good data integrity and 
good performance by simplifying the problem.  The burden still is 
establishing that it's possible to improve qcow2 in a reasonable amount 
of effort.

NB, you could use qcow2 today if you had all of the data integrity fixes 
or didn't care about data integrity in the event of power failure or 
didn't care about performance.  I don't have any customers that fit that 
bill so from my perspective, qcow2 isn't production fit.  That doesn't 
mean that it's not fit for someone else's production use.

>>
>> I realize it's somewhat subjective though.
>
> While qed looks like a good start, it has at least three flaws already 
> (relying on physical image size, relying on fsck, and limited logical 
> image size).  Just fixing those will introduce complication.  What 
> about new features or newly discovered flaws?

Let's quantify fsck.  My suspicion is that if you've got the storage for 
1TB disk images, it's fast enough that fsck can not be so bad.

Keep in mind, we don't have to completely pause the guest while 
fsck'ing.  We simply have to prevent cluster allocations.  We can allow 
reads and we can allow writes to allocated clusters.

Consequently, if you had a 1TB disk image, it's extremely likely that 
the vast majority of I/O is just to allocated clusters which means that 
fsck() is entirely a background task.  The worst case scenario is 
actually a half-allocated disk.

But since you have to boot before you can run any serious test, if it 
takes 5 seconds to do an fsck(), it's highly likely that it's not even 
noticeable.

>> Maybe I'm broken with respect to how I think, but I find state 
>> machines very easy to rationalize.
>
> Your father's state machine. Not as clumsy or random as a thread; an 
> elegant weapon for a more civilized age

I find your lack of faith in QED disturbing.

>> To me, the biggest burden in qcow2 is thinking through how you deal 
>> with shared resources.  Because you can block for a long period of 
>> time during write operations, it's not enough to just carry a mutex 
>> during all metadata operations.  You have to stage operations and 
>> commit them at very specific points in time.
>
> The standard way of dealing with this is to have a hash table for 
> metadata that contains a local mutex:
>
>     l2cache = defaultdict(L2)
>
>     def get_l2(pos):
>         l2 = l2cache[pos]
>         l2.mutex.lock()
>         if not l2.valid:
>              l2.pos = pos
>              l2.read()
>              l2.valid = True
>         return l2
>
>     def put_l2(l2):
>         if l2.dirty:
>             l2.write()
>             l2.dirty = False
>         l2.mutex.unlock()

You're missing how you create entries.  That means you've got to do:

def put_l2(l2):
    if l2.committed:
        if l2.dirty
            l2.write()
            l2.dirty = False
        l2.mutex.unlock()
     else:
        l2.mutex.lock()
        l2cache[l2.pos] = l2
        l2.mutex.unlock()

And this really illustrates my point.  It's a harder problem that it 
seems.  You also are keeping l2 reads from occurring when flushing a 
dirty l2 entry which is less parallel than what qed achieves today.

This is part of why I prefer state machines.  Acquiring a mutex is too 
easy and it makes it easy to not think through what all could be 
running.  When you are more explicit about when you are allowing 
concurrency, I think it's easier to be more aggressive.

It's a personal preference really.  You can find just as many folks on 
the intertubes that claim Threads are Evil as claim State Machines are Evil.

The only reason we're discussing this is you've claimed QEMU's state 
machine model is the biggest inhibitor and I think that's over 
simplifying things.  It's like saying, QEMU's biggest problem is that 
too many of it's developers use vi verses emacs.  You may personally 
believe that vi is entirely superior to emacs but by the same token, you 
should be able to recognize that some people are able to be productive 
with emacs.

If someone wants to rewrite qcow2 to be threaded, I'm all for it.  I 
don't think it's really any simpler than making it a state machine.  I 
find it hard to believe you think there's an order of magnitude 
difference in development work too.

>> It's far easier to just avoid internal snapshots altogether and this 
>> is exactly the thought process that led to QED.  Once you drop 
>> support for internal snapshots, you can dramatically simplify.
>
> The amount of metadata is O(nb_L2 * nb_snapshots).  For qed, 
> nb_snapshots = 1 but nb_L2 can be still quite large.  If fsck is too 
> long for one, it is too long for the other.

nb_L2 is very small.  It's exactly n / 2GB + 1 where n is image size.  
Since image size is typically < 100GB, practically speaking it's less 
than 50.

OTOH, nb_snapshots in qcow2 can be very large.  In fact, it's not 
unrealistic for nb_snapshots to be >> 50.  What that means is that 
instead of metadata being O(n) as it is today, it's at least O(n^2).

Doing internal snapshots right is far more complicated than qcow2 does 
things.

> How long does fsck take?

We'll find out soon.  But remember, fsck() only blocks pending metadata 
writes so it's not entirely all up-front.

>> Not doing qed-on-lvm is definitely a limitation.  The one use case 
>> I've heard is qcow2 on top of clustered LVM as clustered LVM is 
>> simpler than a clustered filesystem.  I don't know the space well 
>> enough so I need to think more about it.
>
> I don't either.  If this use case survives, and if qed isn't changed 
> to accomodate it, it means that that's another place where qed can't 
> supplant qcow2.

I'm okay with that.  An image file should require a file system.  If I 
was going to design an image file to be used on top of raw storage, I 
would take an entirely different approach.

>> Refcount table.  See above discussion  for my thoughts on refcount 
>> table.
>
> Ok.  It boils down to "is fsck on startup acceptable".  Without a 
> freelist, you need fsck for both unclean shutdown and for UNMAP.

To rebuild the free list on unclean shutdown.

>>> 5) No support for qed-on-lvm
>>>
>>> 6) limited image resize
>>
>> Not anymore than qcow2 FWIW.
>>
>> Again, with the default create parameters, we can resize up to 64TB 
>> without rewriting metadata.  I wouldn't call that limited image resize.
>
> I guess 64TB should last a bit.  And if you relax the L1 size to be 
> any number of clusters (or have three levels) you're unlimited.
>
> btw, having 256KB L2s is too large IMO.  Reading them will slow down 
> your random read throughput.  Even 64K is a bit large, but there's no 
> point making them smaller than a cluster.

This is just defaults and honestly, adding another level would be pretty 
trivial.

> (an aside: with cache!=none we're bouncing in the kernel as well; we 
> really need to make it work for cache=none, perhaps use O_DIRECT for 
> data and writeback for metadata and shared backing images).

QED achieves zero-copy with cache=none today.  In fact, our performance 
testing that we'll publish RSN is exclusively with cache=none.

>> Yes, you'll want to have that regardless.  But adding new things to 
>> qcow2 has all the problems of introducing a new image format.
>
> Just some of them.  On mount, rewrite the image format as qcow3.  On 
> clean shutdown, write it back to qcow2.  So now there's no risk of 
> data corruption (but there is reduced usability).

It means on unclean shutdown, you can't move images to older versions.  
That means a management tool can't rely on the mobility of images which 
means it's a new format for all practical purposes.

QED started it's life as qcow3.  You start with qcow3, remove the 
features that are poorly thought out and make correctness hard, add some 
future proofing, and you're left with QED.

We're fully backwards compatible with qcow2 (by virtue that qcow2 is 
still in tree) but new images require new versions of QEMU.  That said, 
we have a conversion tool to convert new images to the old format if 
mobility is truly required.

So it's the same story that you're telling above from an end-user 
perspective.

>>>> They are once you copy the image.  And power loss is the same thing 
>>>> as unexpected exit because you're not simply talking about delaying 
>>>> a sync, you're talking staging future I/O operations purely within 
>>>> QEMU.
>>>
>>> qed is susceptible to the same problem.  If you have a 100MB write 
>>> and qemu exits before it updates L2s, then those 100MB are leaked.  
>>> You could alleviate the problem by writing L2 at intermediate 
>>> points, but even then, a power loss can leak those 100MB.
>>>
>>> qed trades off the freelist for the file size (anything beyond the 
>>> file size is free), it doesn't eliminate it completely.  So you 
>>> still have some of its problems, but you don't get its benefits.
>>
>> I think you've just established that qcow2 and qed both require an 
>> fsck.  I don't disagree :-)
>
> There's a difference between a background scrubber and a foreground fsck.

The difference between qcow2 and qed is that qed relies on the file size 
and qcow2 uses a bitmap.

The bitmap grows synchronously whereas in qed, we're not relying on 
synchronous file growth.  If we did, there would be no need for an fsck.

If you attempt to grow the refcount table in qcow2 without doing a 
sync(), then you're going to have to have an fsync to avoid corruption.

qcow2 doesn't have an advantage, it's just not trying to be as 
sophisticated as qed is.

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2010-09-12 15:13 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31   ` Stefan Hajnoczi
2010-09-06 14:21   ` Luca Tettamanti
2010-09-06 14:24     ` Alexander Graf
2010-09-06 16:27       ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40   ` Stefan Hajnoczi
2010-09-06 12:57     ` Anthony Liguori
2010-09-06 13:02       ` Stefan Hajnoczi
2010-09-06 14:10       ` Kevin Wolf
2010-09-06 16:45         ` Anthony Liguori
2010-09-06 12:45   ` Anthony Liguori
2010-09-10 23:49     ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52   ` Anthony Liguori
2010-09-06 13:35     ` Daniel P. Berrange
2010-09-06 16:38       ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51   ` Avi Kivity
2010-09-07 15:40     ` Anthony Liguori
2010-09-07 16:09       ` Avi Kivity
2010-09-07 16:25         ` Anthony Liguori
2010-09-07 22:27           ` Anthony Liguori
2010-09-08  8:23             ` Avi Kivity
2010-09-08  8:41               ` Alexander Graf
2010-09-08  8:53                 ` Avi Kivity
2010-09-08 11:15                   ` Stefan Hajnoczi
2010-09-08 15:38                     ` Christoph Hellwig
2010-09-08 16:30                       ` Anthony Liguori
2010-09-08 20:23                         ` Christoph Hellwig
2010-09-08 20:28                           ` Anthony Liguori
2010-09-09  2:35                             ` Christoph Hellwig
2010-09-09  6:24                               ` Avi Kivity
2010-09-09 21:01                                 ` Christoph Hellwig
2010-09-10 11:15                                   ` Avi Kivity
2010-09-09  6:53                     ` Avi Kivity
2010-09-10 21:22                     ` Jamie Lokier
2010-09-14 10:46                       ` Stefan Hajnoczi
2010-09-14 11:08                         ` Stefan Hajnoczi
2010-09-14 12:54                         ` Anthony Liguori
2010-09-08 12:55                   ` Anthony Liguori
2010-09-09  6:30                     ` Avi Kivity
2010-09-08 12:48               ` Anthony Liguori
2010-09-08 13:20                 ` Kevin Wolf
2010-09-08 13:26                   ` Anthony Liguori
2010-09-08 13:46                     ` Kevin Wolf
2010-09-09  6:45                 ` Avi Kivity
2010-09-09  6:48                   ` Avi Kivity
2010-09-09 12:49                   ` Anthony Liguori
2010-09-09 16:48                     ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02                       ` Anthony Liguori
2010-09-09 20:56                         ` Christoph Hellwig
2010-09-10 10:53                         ` Avi Kivity
2010-09-10 11:14                     ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25                       ` Avi Kivity
2010-09-10 11:33                         ` Stefan Hajnoczi
2010-09-10 11:43                           ` Avi Kivity
2010-09-10 13:22                             ` Anthony Liguori
2010-09-10 13:48                               ` Christoph Hellwig
2010-09-10 15:02                                 ` Anthony Liguori
2010-09-10 15:18                                   ` Kevin Wolf
2010-09-10 15:53                                     ` Anthony Liguori
2010-09-10 16:05                                       ` Kevin Wolf
2010-09-10 17:10                                         ` Anthony Liguori
2010-09-10 17:44                                           ` Kevin Wolf
2010-09-10 17:46                                           ` Miguel Di Ciurcio Filho
2010-09-10 14:02                               ` Avi Kivity
2010-09-10 13:47                           ` Christoph Hellwig
2010-09-10 14:05                             ` Avi Kivity
2010-09-10 14:12                               ` Christoph Hellwig
2010-09-10 14:24                                 ` Avi Kivity
2010-09-10 13:16                         ` Anthony Liguori
2010-09-10 14:06                           ` Avi Kivity
2010-09-10 11:43                       ` Stefan Hajnoczi
2010-09-10 12:06                         ` Avi Kivity
2010-09-10 13:28                           ` Anthony Liguori
2010-09-10 12:12                         ` Kevin Wolf
2010-09-10 12:35                           ` Stefan Hajnoczi
2010-09-10 12:47                             ` Avi Kivity
2010-09-10 13:10                               ` Stefan Hajnoczi
2010-09-10 13:19                                 ` Avi Kivity
2010-09-10 13:39                               ` Anthony Liguori
2010-09-10 13:52                                 ` Christoph Hellwig
2010-09-10 13:56                                 ` Avi Kivity
2010-09-10 13:48                             ` Kevin Wolf
2010-09-10 13:14                       ` Anthony Liguori
2010-09-10 13:47                         ` Avi Kivity
2010-09-10 14:56                           ` Anthony Liguori
2010-09-10 15:49                             ` Avi Kivity
2010-09-10 17:07                               ` Anthony Liguori
2010-09-10 17:42                                 ` Kevin Wolf
2010-09-10 19:33                                   ` Anthony Liguori
2010-09-13 10:41                                     ` Kevin Wolf
2010-09-12 13:24                                 ` Avi Kivity
2010-09-12 15:13                                   ` Anthony Liguori [this message]
2010-09-12 15:56                                     ` Avi Kivity
2010-09-12 17:09                                       ` Anthony Liguori
2010-09-12 17:51                                         ` Avi Kivity
2010-09-12 20:18                                           ` Anthony Liguori
2010-09-13  9:24                                             ` Avi Kivity
2010-09-13 11:28                                         ` Kevin Wolf
2010-09-13 11:34                                           ` Avi Kivity
2010-09-13 11:48                                             ` Kevin Wolf
2010-09-13 13:19                                               ` Anthony Liguori
2010-09-13 13:12                                           ` Anthony Liguori
2010-09-13 11:03                                       ` Kevin Wolf
2010-09-13 13:07                                         ` Anthony Liguori
2010-09-13 13:24                                           ` Kevin Wolf
2010-09-07 16:12     ` Anthony Liguori
2010-09-07 21:35       ` Christoph Hellwig
2010-09-07 22:29         ` Anthony Liguori
2010-09-07 22:40           ` Christoph Hellwig
2010-09-08 15:07     ` Stefan Hajnoczi
2010-09-09  6:59       ` Avi Kivity
2010-09-09 17:43         ` Anthony Liguori
2010-09-09 20:46           ` Christoph Hellwig
2010-09-10 11:22           ` Avi Kivity
2010-09-10 11:29             ` Stefan Hajnoczi
2010-09-10 11:37               ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41   ` Anthony Liguori
2010-09-08  7:48     ` Kevin Wolf
2010-09-08 15:37   ` Stefan Hajnoczi
2010-09-08 18:24     ` Blue Swirl
2010-09-08 18:35       ` Anthony Liguori
2010-09-08 18:56         ` Blue Swirl
2010-09-08 19:19           ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12   ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2010-09-17  3:51 [Qemu-devel] " Khoa Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C8CEE14.4020501@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).