Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Sun, 12 Sep 2010 15:18:11 -0500	[thread overview]
Message-ID: <4C8D3583.3020907@codemonkey.ws> (raw)
In-Reply-To: <4C8D1313.1050802@redhat.com>

On 09/12/2010 12:51 PM, Avi Kivity wrote:
>  On 09/12/2010 07:09 PM, Anthony Liguori wrote:
>> On 09/12/2010 10:56 AM, Avi Kivity wrote:
>>> No, the worst case is 0.003% allocated disk, with the allocated 
>>> clusters distributed uniformly.  That means all your L2s are 
>>> allocated, but almost none of your clusters are.
>>
>> But in this case, you're so sparse that your metadata is pretty much 
>> co-located which means seek performance won't matter much.
>
> You still get the rotational delay.  But yes, the hit is reduced.
>
>>
>>>>
>>>> But since you have to boot before you can run any serious test, if 
>>>> it takes 5 seconds to do an fsck(), it's highly likely that it's 
>>>> not even noticeable.
>>>
>>> What if it takes 300 seconds?
>>
>> That means for a 1TB disk you're taking 500ms per L2 entry, you're 
>> fully allocated and yet still doing an fsck.  That seems awfully 
>> unlikely.
>
> I meant for a fully populated L1.  That's 10ms per L2.

Your math is off.  A single L2 entry covers 2GB worth of logical space.  
That means a 1TB image consists of 512 L2s.  300 / 512 == .585 which is 
585ms.

That's a fully populated L1 on a 1TB image.

> But since that's 64TB, that's unlikely too.  It can still take 10s for 
> a 2TB disk.

Ah, you're talking about a 64TB image.  Recall that we can read L2s in 
parallel.  I have trouble imaging that we'd get serialized performance 
with a 64TB backing store.  It's much more likely you've got more than 
one spindle in this scenario.

>>> Why is in n^2?  It's still n*m.  If your image is 4TB instead of 
>>> 100GB, the time increases by a factor of 40 for both.
>>
>> It's n*m but either n ~= m in which case it's n^2 or m << n, in which 
>> case, it's just n, or m >> n in which case, it's just O(m).
>>
>> This is where asymptotic complexity ends up not being terribly 
>> helpful :-)
>>
>> Let me put this another way though, if you support internal 
>> snapshots, what's a reasonable number of snapshots to expect 
>> reasonable performance with?  10?  100?  1000? 10000?
>
> I'd say 10.  Not that I really want to support internal snapshots, it 
> doesn't work well with multiple disks.

I don't think that's reasonable.  The folks that I've talked to about 
snapshots seem to want to do crazy things like use it for 
checkpointing.  TBH, I think they're looking for the ability to do 
thousands of checkpoints with an efficient way to release old checkpoints.

I imagine that's the design point things like btrfs are trying to achieve.

> On the other hand, linear L2 (which now become L1) means your fsck is 
> just a linear scan of the table, which is probably faster than qcow2 
> allocation...

And this is just a data layout optimization which is the sort of thing 
that we should let performance data drive.

>> For a 1PB disk image with qcow2, the reference count table is 128GB.  
>> For a 1TB image, the reference count table is 128MB.   For a 128GB 
>> image, the reference table is 16MB which is why we get away with it 
>> today.
>>
>> Anytime you grow the freelist with qcow2, you have to write a brand 
>> new freelist table and update the metadata synchronously to point to 
>> a new version of it.  That means for a 1TB image, you're potentially 
>> writing out 128MB of data just to allocate a new cluster.
>>
>> s/freelist/refcount table/ to translate to current qcow2 
>> nomenclature.  This is certainly not fast.  You can add a bunch of 
>> free blocks each time you mitigate the growth but I can't of many 
>> circumstances where a 128MB write isn't going to be noticeable.  And 
>> it only gets worse as time moves on because 1TB disk images are 
>> already in use today.
>>
>
> That's a strong point.  qcow2 doubles on each allocation, it 
> amortizes, but the delay is certainly going to be noticable.
>
> You can do it ahead of time (so guest writes don't need to wait) but 
> it's still expensive.

The trouble is, safe growth of the reference count table is hard because 
it's contiguous.  That means you need to copy the table to another 
location all at once instead of just creating a new L1 table and reusing 
most of the existing L2 entries.

It's a damning flaw in the format for large images.  You can preallocate 
the whole thing up front to try to avoid the cost at run time but even 
then, that's a huge cost to pay in disk space up front.

>> It's very easy to neglect the details in something like qcow2.  We've 
>> been talking like the refcount table is basically free to read and 
>> write but it's absolutely not.  With large disk images, you're 
>> caching an awful lot of metadata to read the refcount table in fully.
>>
>> If you reduce the reference count table to exactly two bits, you can 
>> store that within the L1/L2 metadata since we have an extra 12 bits 
>> worth of storage space.  Since you need the L1/L2 metadata anyway, we 
>> might as well just use that space as the authoritative source of the 
>> free list information.
>>
>> The only difference between qcow2 and qed is that since we use an 
>> on-demand table for L1/L2, our free list may be non-contiguous.  
>> Since we store virtual -> physical instead of physical->virtual, you 
>> have to do a full transversal with QED whereas with qcow2 you may get 
>> lucky.  However, the fact that the reference count table is 
>> contiguous in qcow2 is a design flaw IMHO because it makes growth 
>> extremely painful with large images to the point where I'll claim 
>> that qcow2 is probably unusable by design with > 1TB disk images.
>
> If you grow it in the background, it should be usable; since it 
> happens once every 1TB worth of writes, it's not such a huge load.  
> I'll agree this is increasing complexity.

Trouble is, the reference count table is your authoritative source of 
whether something is free.  You can grow it in the background but if you 
need to allocate clusters before your done growing, you have to stall 
the request.  Those stalls can get very long on large disk images.

>> We can optimize qed by having a contiguous freelist mapping 
>> physical->virtual (that's just a bitmap, and therefore considerably 
>> smaller) but making the freelist not authoritative.  That makes it 
>> much faster because we don't add another sync and let's us fallback 
>> to the L1/L2 table for authoritative information if we had an unclean 
>> shutdown.
>>
>> It's a good compromise for performance and it validates the qed 
>> philosophy.  By starting with a correct and performant approach that 
>> scales to large disk images, we can add features (like unmap) without 
>> sacrificing either.
>
> How would you implement the bitmap as a compatible feature?

First, a refresher on the consistency model.

Metadata is not sync'd to disk when initially written because we try to 
avoid having stronger metadata consistency than data consistency.  We 
are forced to sync to enforce ordering when updating L1 with a new L2 
but that's rare.  If a guest attempts to sync data, we sync metadata too.

Because we may have unsync'd metadata, all L1 and L2 entries need to be 
scanned to try to find unreachable entries based on the start up file 
size *before* allocating any new clusters upon start-up (noting that we 
can read and rewrite existing clusters).

This is a correct design and performant this is where we start from.

The start up cost is undesirable, so to reduce the need to scan all 
entries up front, we add a feature that introduces a meta-data clean 
flag in the header.  The meta-data clean flag tells an implementation 
that there were no cluster allocations since the last sync() which means 
cluster allocation can now happen without searching for and correcting 
unreachable entries.

An implementation can now set the meta-data clean flag right before any 
sync() operation and unset the flag before the first cluster allocation 
(being careful to sync the unsetting of the flag).  This eliminates a 
lot of unnecessary scans in the case of safe shut down.

We can also set a timer after unsetting the meta-data clean flag to 
sync() after, say, 5 minutes.  This further decreases the number of 
times we have to scan so that the fsck() window only happens when power 
failure occurs within 5 minutes of the last cluster allocation.

N.B., this flag is purely an optional feature that is backwards 
compatible.  If an implementation ignores the meta-data clean flag, it 
just always scans for unreachable entries at startup.

This is still a correct design and we've eliminated the vast majority of 
start up scans.

The freelist would also be an optional feature.   If the freelist 
pointer is non-zero in the header, it means that an implementation can 
find free blocks by reading the location of the freelist pointer in the 
header.  We maintain the freelist in memory and we write it out right 
before any sync() operation.  Adding to the freelist in memory (i.e. 
UNMAP) would clear the freelist pointer on disk.  We could use this as 
an opportunity to schedule a future sync() just like as above.

We would actually write free list changes to disk while the freelist 
pointer was set to 0 such that when we did a sync(), we were just 
marking the freelist valid again.

An implementation that doesn't know about free list has to do a full 
scan to determine free blocks.  That said, an implementation can also 
just ignore free blocks entirely and always allocate from the end.

The freelist would be a cluster that also happens to be an free'd 
entry.  It would then contain a list of free clusters and would look 
something like:

struct freelist
{
      uint64_t next_cluster; /* zero to represent EOL */
      uint64_t num_entries;
      uint64_t entry[num_entries]; /* cluster offsets of free blocks */
};

It's important to not represent the full list in a contiguous region for 
the reasons discussed in other notes re: refcount table.

This remains a completely correct implementation.  We add no additional 
syncs above what we already have and we write out the freelist changes 
on demand.  The only window that would require a rebuild of the freelist 
would be a hard power failure 5 minutes since the last unmap.

One gotcha is that the freelist is double metadata which means that 
freeing a block requires an ordered write with respect to the L2 table 
update.  Otherwise, you could allocate something off of the freelist 
that an L2 entry still pointed to.

So there's a sync() in the UNMAP path, but that's unavoidable without a 
lot of cleverness.  You could potentially delay the sync until you 
reallocate the block but it's not clear that UNMAP needs to be fast (yet).

Regards,

Anthony Liguori

next prev parent reply	other threads:[~2010-09-12 20:18 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31   ` Stefan Hajnoczi
2010-09-06 14:21   ` Luca Tettamanti
2010-09-06 14:24     ` Alexander Graf
2010-09-06 16:27       ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40   ` Stefan Hajnoczi
2010-09-06 12:57     ` Anthony Liguori
2010-09-06 13:02       ` Stefan Hajnoczi
2010-09-06 14:10       ` Kevin Wolf
2010-09-06 16:45         ` Anthony Liguori
2010-09-06 12:45   ` Anthony Liguori
2010-09-10 23:49     ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52   ` Anthony Liguori
2010-09-06 13:35     ` Daniel P. Berrange
2010-09-06 16:38       ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51   ` Avi Kivity
2010-09-07 15:40     ` Anthony Liguori
2010-09-07 16:09       ` Avi Kivity
2010-09-07 16:25         ` Anthony Liguori
2010-09-07 22:27           ` Anthony Liguori
2010-09-08  8:23             ` Avi Kivity
2010-09-08  8:41               ` Alexander Graf
2010-09-08  8:53                 ` Avi Kivity
2010-09-08 11:15                   ` Stefan Hajnoczi
2010-09-08 15:38                     ` Christoph Hellwig
2010-09-08 16:30                       ` Anthony Liguori
2010-09-08 20:23                         ` Christoph Hellwig
2010-09-08 20:28                           ` Anthony Liguori
2010-09-09  2:35                             ` Christoph Hellwig
2010-09-09  6:24                               ` Avi Kivity
2010-09-09 21:01                                 ` Christoph Hellwig
2010-09-10 11:15                                   ` Avi Kivity
2010-09-09  6:53                     ` Avi Kivity
2010-09-10 21:22                     ` Jamie Lokier
2010-09-14 10:46                       ` Stefan Hajnoczi
2010-09-14 11:08                         ` Stefan Hajnoczi
2010-09-14 12:54                         ` Anthony Liguori
2010-09-08 12:55                   ` Anthony Liguori
2010-09-09  6:30                     ` Avi Kivity
2010-09-08 12:48               ` Anthony Liguori
2010-09-08 13:20                 ` Kevin Wolf
2010-09-08 13:26                   ` Anthony Liguori
2010-09-08 13:46                     ` Kevin Wolf
2010-09-09  6:45                 ` Avi Kivity
2010-09-09  6:48                   ` Avi Kivity
2010-09-09 12:49                   ` Anthony Liguori
2010-09-09 16:48                     ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02                       ` Anthony Liguori
2010-09-09 20:56                         ` Christoph Hellwig
2010-09-10 10:53                         ` Avi Kivity
2010-09-10 11:14                     ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25                       ` Avi Kivity
2010-09-10 11:33                         ` Stefan Hajnoczi
2010-09-10 11:43                           ` Avi Kivity
2010-09-10 13:22                             ` Anthony Liguori
2010-09-10 13:48                               ` Christoph Hellwig
2010-09-10 15:02                                 ` Anthony Liguori
2010-09-10 15:18                                   ` Kevin Wolf
2010-09-10 15:53                                     ` Anthony Liguori
2010-09-10 16:05                                       ` Kevin Wolf
2010-09-10 17:10                                         ` Anthony Liguori
2010-09-10 17:44                                           ` Kevin Wolf
2010-09-10 17:46                                           ` Miguel Di Ciurcio Filho
2010-09-10 14:02                               ` Avi Kivity
2010-09-10 13:47                           ` Christoph Hellwig
2010-09-10 14:05                             ` Avi Kivity
2010-09-10 14:12                               ` Christoph Hellwig
2010-09-10 14:24                                 ` Avi Kivity
2010-09-10 13:16                         ` Anthony Liguori
2010-09-10 14:06                           ` Avi Kivity
2010-09-10 11:43                       ` Stefan Hajnoczi
2010-09-10 12:06                         ` Avi Kivity
2010-09-10 13:28                           ` Anthony Liguori
2010-09-10 12:12                         ` Kevin Wolf
2010-09-10 12:35                           ` Stefan Hajnoczi
2010-09-10 12:47                             ` Avi Kivity
2010-09-10 13:10                               ` Stefan Hajnoczi
2010-09-10 13:19                                 ` Avi Kivity
2010-09-10 13:39                               ` Anthony Liguori
2010-09-10 13:52                                 ` Christoph Hellwig
2010-09-10 13:56                                 ` Avi Kivity
2010-09-10 13:48                             ` Kevin Wolf
2010-09-10 13:14                       ` Anthony Liguori
2010-09-10 13:47                         ` Avi Kivity
2010-09-10 14:56                           ` Anthony Liguori
2010-09-10 15:49                             ` Avi Kivity
2010-09-10 17:07                               ` Anthony Liguori
2010-09-10 17:42                                 ` Kevin Wolf
2010-09-10 19:33                                   ` Anthony Liguori
2010-09-13 10:41                                     ` Kevin Wolf
2010-09-12 13:24                                 ` Avi Kivity
2010-09-12 15:13                                   ` Anthony Liguori
2010-09-12 15:56                                     ` Avi Kivity
2010-09-12 17:09                                       ` Anthony Liguori
2010-09-12 17:51                                         ` Avi Kivity
2010-09-12 20:18                                           ` Anthony Liguori [this message]
2010-09-13  9:24                                             ` Avi Kivity
2010-09-13 11:28                                         ` Kevin Wolf
2010-09-13 11:34                                           ` Avi Kivity
2010-09-13 11:48                                             ` Kevin Wolf
2010-09-13 13:19                                               ` Anthony Liguori
2010-09-13 13:12                                           ` Anthony Liguori
2010-09-13 11:03                                       ` Kevin Wolf
2010-09-13 13:07                                         ` Anthony Liguori
2010-09-13 13:24                                           ` Kevin Wolf
2010-09-07 16:12     ` Anthony Liguori
2010-09-07 21:35       ` Christoph Hellwig
2010-09-07 22:29         ` Anthony Liguori
2010-09-07 22:40           ` Christoph Hellwig
2010-09-08 15:07     ` Stefan Hajnoczi
2010-09-09  6:59       ` Avi Kivity
2010-09-09 17:43         ` Anthony Liguori
2010-09-09 20:46           ` Christoph Hellwig
2010-09-10 11:22           ` Avi Kivity
2010-09-10 11:29             ` Stefan Hajnoczi
2010-09-10 11:37               ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41   ` Anthony Liguori
2010-09-08  7:48     ` Kevin Wolf
2010-09-08 15:37   ` Stefan Hajnoczi
2010-09-08 18:24     ` Blue Swirl
2010-09-08 18:35       ` Anthony Liguori
2010-09-08 18:56         ` Blue Swirl
2010-09-08 19:19           ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12   ` Anthony Liguori
  -- strict thread matches above, loose matches on Subject: below --
2010-09-17  3:51 [Qemu-devel] " Khoa Huynh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C8D3583.3020907@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.