Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Sun, 12 Sep 2010 19:51:15 +0200	[thread overview]
Message-ID: <4C8D1313.1050802@redhat.com> (raw)
In-Reply-To: <4C8D094E.4060507@codemonkey.ws>

  On 09/12/2010 07:09 PM, Anthony Liguori wrote:
> On 09/12/2010 10:56 AM, Avi Kivity wrote:
>> No, the worst case is 0.003% allocated disk, with the allocated 
>> clusters distributed uniformly.  That means all your L2s are 
>> allocated, but almost none of your clusters are.
>
> But in this case, you're so sparse that your metadata is pretty much 
> co-located which means seek performance won't matter much.

You still get the rotational delay.  But yes, the hit is reduced.

>
>>>
>>> But since you have to boot before you can run any serious test, if 
>>> it takes 5 seconds to do an fsck(), it's highly likely that it's not 
>>> even noticeable.
>>
>> What if it takes 300 seconds?
>
> That means for a 1TB disk you're taking 500ms per L2 entry, you're 
> fully allocated and yet still doing an fsck.  That seems awfully 
> unlikely.

I meant for a fully populated L1.  That's 10ms per L2.

But since that's 64TB, that's unlikely too.  It can still take 10s for a 
2TB disk.

>
>>>    if l2.committed:
>>>        if l2.dirty
>>>            l2.write()
>>>            l2.dirty = False
>>>        l2.mutex.unlock()
>>>     else:
>>>        l2.mutex.lock()
>>>        l2cache[l2.pos] = l2
>>>        l2.mutex.unlock()
>>
>> The in-memory L2 is created by defaultdict().  I did omit linking L2 
>> into L1, by that's a function call.  With a state machine, it's a new 
>> string of states and calls.
>
> But you have to write the L2 to disk first before you link it so it's 
> not purely in memory.

That's fine.  Threading allows you to have blocking calls.  It's slower, 
but very rare anyway.

>> Why is in n^2?  It's still n*m.  If your image is 4TB instead of 
>> 100GB, the time increases by a factor of 40 for both.
>
> It's n*m but either n ~= m in which case it's n^2 or m << n, in which 
> case, it's just n, or m >> n in which case, it's just O(m).
>
> This is where asymptotic complexity ends up not being terribly helpful 
> :-)
>
> Let me put this another way though, if you support internal snapshots, 
> what's a reasonable number of snapshots to expect reasonable 
> performance with?  10?  100?  1000? 10000?

I'd say 10.  Not that I really want to support internal snapshots, it 
doesn't work well with multiple disks.

>>> I'm okay with that.  An image file should require a file system.  If 
>>> I was going to design an image file to be used on top of raw 
>>> storage, I would take an entirely different approach.
>>
>> That spreads our efforts further.
>
> No.  I don't think we should be in the business of designing on top of 
> raw storage.  Either assume fixed partitions, LVM, or a file system.  
> We shouldn't reinvent the wheel at every opportunity (just the 
> carefully chosen opportunities).

I agree, but in this case there was no choice.

>
>>>>> Refcount table.  See above discussion  for my thoughts on refcount 
>>>>> table.
>>>>
>>>> Ok.  It boils down to "is fsck on startup acceptable".  Without a 
>>>> freelist, you need fsck for both unclean shutdown and for UNMAP.
>>>
>>> To rebuild the free list on unclean shutdown.
>>
>> If you have an on-disk compact freelist, you don't need that fsck.
>
> "If you have an on-disk compact [consistent] freelist, you don't need 
> that fsck."
>
> Consistency is the key point.  We go out of our way to avoid a 
> consistent freelist in QED because it's the path to best performance.  
> The key goal for a file format should be to have exactly as much 
> consistency as required and not one bit more as consistency always 
> means worse performance.

Preallocation lets you have a consistent (or at least conservative) free 
list, with just a bit of extra consistency.  If you piggy back 
preallocation on guest syncs, you don't even pay for that.

On the other hand, linear L2 (which now become L1) means your fsck is 
just a linear scan of the table, which is probably faster than qcow2 
allocation...

>>
>>>> (an aside: with cache!=none we're bouncing in the kernel as well; 
>>>> we really need to make it work for cache=none, perhaps use O_DIRECT 
>>>> for data and writeback for metadata and shared backing images).
>>>
>>> QED achieves zero-copy with cache=none today.  In fact, our 
>>> performance testing that we'll publish RSN is exclusively with 
>>> cache=none.
>>
>> In this case, preallocation should really be cheap, since there isn't 
>> a ton of dirty data that needs to be flushed.  You issue an extra 
>> flush once in a while so your truncate (or physical image size in the 
>> header) gets to disk, but that doesn't block new writes.
>>
>> It makes qed/lvm work, and it replaces the need to fsck for the next 
>> allocation with the need for a background scrubber to reclaim storage 
>> (you need that anyway for UNMAP).  It makes the whole thing a lot 
>> more attractive IMO.
>
>
> For a 1PB disk image with qcow2, the reference count table is 128GB.  
> For a 1TB image, the reference count table is 128MB.   For a 128GB 
> image, the reference table is 16MB which is why we get away with it 
> today.
>
> Anytime you grow the freelist with qcow2, you have to write a brand 
> new freelist table and update the metadata synchronously to point to a 
> new version of it.  That means for a 1TB image, you're potentially 
> writing out 128MB of data just to allocate a new cluster.
>
> s/freelist/refcount table/ to translate to current qcow2 
> nomenclature.  This is certainly not fast.  You can add a bunch of 
> free blocks each time you mitigate the growth but I can't of many 
> circumstances where a 128MB write isn't going to be noticeable.  And 
> it only gets worse as time moves on because 1TB disk images are 
> already in use today.
>

That's a strong point.  qcow2 doubles on each allocation, it amortizes, 
but the delay is certainly going to be noticable.

You can do it ahead of time (so guest writes don't need to wait) but 
it's still expensive.

> NB, with a 64-bit refcount table, the size of the refcount table is 
> almost exactly the same size as the L1/L2 table in QED.  IOW, the cost 
> of transversing the refcount table to allocate a cluster is exactly 
> the cost of transversing all of the L1/L2 metadata to build a 
> freelist.  IOW, you're doing the equivalent of an fsck everytime you 
> open a qcow2 file today.

No, L2 is O(logical size), refcount is O(physical size).

> It's very easy to neglect the details in something like qcow2.  We've 
> been talking like the refcount table is basically free to read and 
> write but it's absolutely not.  With large disk images, you're caching 
> an awful lot of metadata to read the refcount table in fully.
>
> If you reduce the reference count table to exactly two bits, you can 
> store that within the L1/L2 metadata since we have an extra 12 bits 
> worth of storage space.  Since you need the L1/L2 metadata anyway, we 
> might as well just use that space as the authoritative source of the 
> free list information.
>
> The only difference between qcow2 and qed is that since we use an 
> on-demand table for L1/L2, our free list may be non-contiguous.  Since 
> we store virtual -> physical instead of physical->virtual, you have to 
> do a full transversal with QED whereas with qcow2 you may get lucky.  
> However, the fact that the reference count table is contiguous in 
> qcow2 is a design flaw IMHO because it makes growth extremely painful 
> with large images to the point where I'll claim that qcow2 is probably 
> unusable by design with > 1TB disk images.

If you grow it in the background, it should be usable; since it happens 
once every 1TB worth of writes, it's not such a huge load.  I'll agree 
this is increasing complexity.

> We can optimize qed by having a contiguous freelist mapping 
> physical->virtual (that's just a bitmap, and therefore considerably 
> smaller) but making the freelist not authoritative.  That makes it 
> much faster because we don't add another sync and let's us fallback to 
> the L1/L2 table for authoritative information if we had an unclean 
> shutdown.
>
> It's a good compromise for performance and it validates the qed 
> philosophy.  By starting with a correct and performant approach that 
> scales to large disk images, we can add features (like unmap) without 
> sacrificing either.

How would you implement the bitmap as a compatible feature?

-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2010-09-12 17:51 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31   ` Stefan Hajnoczi
2010-09-06 14:21   ` Luca Tettamanti
2010-09-06 14:24     ` Alexander Graf
2010-09-06 16:27       ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40   ` Stefan Hajnoczi
2010-09-06 12:57     ` Anthony Liguori
2010-09-06 13:02       ` Stefan Hajnoczi
2010-09-06 14:10       ` Kevin Wolf
2010-09-06 16:45         ` Anthony Liguori
2010-09-06 12:45   ` Anthony Liguori
2010-09-10 23:49     ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52   ` Anthony Liguori
2010-09-06 13:35     ` Daniel P. Berrange
2010-09-06 16:38       ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51   ` Avi Kivity
2010-09-07 15:40     ` Anthony Liguori
2010-09-07 16:09       ` Avi Kivity
2010-09-07 16:25         ` Anthony Liguori
2010-09-07 22:27           ` Anthony Liguori
2010-09-08  8:23             ` Avi Kivity
2010-09-08  8:41               ` Alexander Graf
2010-09-08  8:53                 ` Avi Kivity
2010-09-08 11:15                   ` Stefan Hajnoczi
2010-09-08 15:38                     ` Christoph Hellwig
2010-09-08 16:30                       ` Anthony Liguori
2010-09-08 20:23                         ` Christoph Hellwig
2010-09-08 20:28                           ` Anthony Liguori
2010-09-09  2:35                             ` Christoph Hellwig
2010-09-09  6:24                               ` Avi Kivity
2010-09-09 21:01                                 ` Christoph Hellwig
2010-09-10 11:15                                   ` Avi Kivity
2010-09-09  6:53                     ` Avi Kivity
2010-09-10 21:22                     ` Jamie Lokier
2010-09-14 10:46                       ` Stefan Hajnoczi
2010-09-14 11:08                         ` Stefan Hajnoczi
2010-09-14 12:54                         ` Anthony Liguori
2010-09-08 12:55                   ` Anthony Liguori
2010-09-09  6:30                     ` Avi Kivity
2010-09-08 12:48               ` Anthony Liguori
2010-09-08 13:20                 ` Kevin Wolf
2010-09-08 13:26                   ` Anthony Liguori
2010-09-08 13:46                     ` Kevin Wolf
2010-09-09  6:45                 ` Avi Kivity
2010-09-09  6:48                   ` Avi Kivity
2010-09-09 12:49                   ` Anthony Liguori
2010-09-09 16:48                     ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02                       ` Anthony Liguori
2010-09-09 20:56                         ` Christoph Hellwig
2010-09-10 10:53                         ` Avi Kivity
2010-09-10 11:14                     ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25                       ` Avi Kivity
2010-09-10 11:33                         ` Stefan Hajnoczi
2010-09-10 11:43                           ` Avi Kivity
2010-09-10 13:22                             ` Anthony Liguori
2010-09-10 13:48                               ` Christoph Hellwig
2010-09-10 15:02                                 ` Anthony Liguori
2010-09-10 15:18                                   ` Kevin Wolf
2010-09-10 15:53                                     ` Anthony Liguori
2010-09-10 16:05                                       ` Kevin Wolf
2010-09-10 17:10                                         ` Anthony Liguori
2010-09-10 17:44                                           ` Kevin Wolf
2010-09-10 17:46                                           ` Miguel Di Ciurcio Filho
2010-09-10 14:02                               ` Avi Kivity
2010-09-10 13:47                           ` Christoph Hellwig
2010-09-10 14:05                             ` Avi Kivity
2010-09-10 14:12                               ` Christoph Hellwig
2010-09-10 14:24                                 ` Avi Kivity
2010-09-10 13:16                         ` Anthony Liguori
2010-09-10 14:06                           ` Avi Kivity
2010-09-10 11:43                       ` Stefan Hajnoczi
2010-09-10 12:06                         ` Avi Kivity
2010-09-10 13:28                           ` Anthony Liguori
2010-09-10 12:12                         ` Kevin Wolf
2010-09-10 12:35                           ` Stefan Hajnoczi
2010-09-10 12:47                             ` Avi Kivity
2010-09-10 13:10                               ` Stefan Hajnoczi
2010-09-10 13:19                                 ` Avi Kivity
2010-09-10 13:39                               ` Anthony Liguori
2010-09-10 13:52                                 ` Christoph Hellwig
2010-09-10 13:56                                 ` Avi Kivity
2010-09-10 13:48                             ` Kevin Wolf
2010-09-10 13:14                       ` Anthony Liguori
2010-09-10 13:47                         ` Avi Kivity
2010-09-10 14:56                           ` Anthony Liguori
2010-09-10 15:49                             ` Avi Kivity
2010-09-10 17:07                               ` Anthony Liguori
2010-09-10 17:42                                 ` Kevin Wolf
2010-09-10 19:33                                   ` Anthony Liguori
2010-09-13 10:41                                     ` Kevin Wolf
2010-09-12 13:24                                 ` Avi Kivity
2010-09-12 15:13                                   ` Anthony Liguori
2010-09-12 15:56                                     ` Avi Kivity
2010-09-12 17:09                                       ` Anthony Liguori
2010-09-12 17:51                                         ` Avi Kivity [this message]
2010-09-12 20:18                                           ` Anthony Liguori
2010-09-13  9:24                                             ` Avi Kivity
2010-09-13 11:28                                         ` Kevin Wolf
2010-09-13 11:34                                           ` Avi Kivity
2010-09-13 11:48                                             ` Kevin Wolf
2010-09-13 13:19                                               ` Anthony Liguori
2010-09-13 13:12                                           ` Anthony Liguori
2010-09-13 11:03                                       ` Kevin Wolf
2010-09-13 13:07                                         ` Anthony Liguori
2010-09-13 13:24                                           ` Kevin Wolf
2010-09-07 16:12     ` Anthony Liguori
2010-09-07 21:35       ` Christoph Hellwig
2010-09-07 22:29         ` Anthony Liguori
2010-09-07 22:40           ` Christoph Hellwig
2010-09-08 15:07     ` Stefan Hajnoczi
2010-09-09  6:59       ` Avi Kivity
2010-09-09 17:43         ` Anthony Liguori
2010-09-09 20:46           ` Christoph Hellwig
2010-09-10 11:22           ` Avi Kivity
2010-09-10 11:29             ` Stefan Hajnoczi
2010-09-10 11:37               ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41   ` Anthony Liguori
2010-09-08  7:48     ` Kevin Wolf
2010-09-08 15:37   ` Stefan Hajnoczi
2010-09-08 18:24     ` Blue Swirl
2010-09-08 18:35       ` Anthony Liguori
2010-09-08 18:56         ` Blue Swirl
2010-09-08 19:19           ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12   ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2010-09-17  3:51 [Qemu-devel] " Khoa Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C8D1313.1050802@redhat.com \
    --to=avi@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).