From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 17:27:55 -0500 [thread overview]
Message-ID: <4C86BC6B.5010809@codemonkey.ws> (raw)
In-Reply-To: <4C866773.2030103@codemonkey.ws>
On 09/07/2010 11:25 AM, Anthony Liguori wrote:
> On 09/07/2010 11:09 AM, Avi Kivity wrote:
>> On 09/07/2010 06:40 PM, Anthony Liguori wrote:
>>>>
>>>> Need a checksum for the header.
>>>
>>> Is that not a bit overkill for what we're doing? What's the benefit?
>>
>> Make sure we're not looking at a header write interrupted by a crash.
>
> Couldn't hurt I guess. I don't think it's actually needed for L1/L2
> tables FWIW.
>
>>>>> The L2 link '''should''' be made after the data is in place on
>>>>> storage. However, when no ordering is enforced the worst case
>>>>> scenario is an L2 link to an unwritten cluster.
>>>>
>>>> Or it may cause corruption if the physical file size is not
>>>> committed, and L2 now points at a free cluster.
>>>
>>> An fsync() will make sure the physical file size is committed. The
>>> metadata does not carry an additional integrity guarantees over the
>>> actual disk data except that in order to avoid internal corruption,
>>> we have to order the L2 and L1 writes.
>>
>> I was referring to "when no ordering is enforced, the worst case
>> scenario is an L2 link to an unwritten cluster". This isn't true -
>> worst case you point to an unallocated cluster which can then be
>> claimed by data or metadata.
>
> Right, it's necessary to do an fsync to protect against this. To make
> this user friendly, we could have a dirty bit in the header which gets
> set on first metadata write and then cleared on clean shutdown.
>
> Upon startup, if the dirty bit is set, we do an fsck.
>
>>>> We can remove this requirement by copying-on-write any metadata
>>>> write, and keeping two copies of the header (with version numbers
>>>> and checksums).
>>>
>>> QED has a property today that all metadata or cluster locations have
>>> a single location on the disk format that is immutable. Defrag
>>> would relax this but defrag can be slow.
>>>
>>> Having an immutable on-disk location is a powerful property which
>>> eliminates a lot of complexity with respect to reference counting
>>> and dealing with free lists.
>>
>> However, it exposes the format to "writes may corrupt overwritten data".
>
> No, you never write an L2 entry once it's been set. If an L2 entry
> isn't set, the contents of the cluster is all zeros.
>
> If you write data to allocate an L2 entry, until you do a flush(), the
> data can either be what was written or all zeros.
>
>>> For the initial design I would avoid introducing something like
>>> this. One of the nice things about features is that we can
>>> introduce multi-level trees as a future feature if we really think
>>> it's the right thing to do.
>>>
>>> But we should start at a simple design with high confidence and high
>>> performance, and then introduce features with the burden that we're
>>> absolutely sure that we don't regress integrity or performance.
>>
>> For most things, yes. Metadata checksums should be designed in
>> though (since we need to double the pointer size).
>>
>> Variable height trees have the nice property that you don't need
>> multi cluster allocation. It's nice to avoid large L2s for very
>> large disks.
>
> FWIW, L2s are 256K at the moment and with a two level table, it can
> support 5PB of data.
I clearly suck at basic math today. The image supports 64TB today.
Dropping to 128K tables would reduce it to 16TB and 64k tables would be 4TB.
BTW, I don't think your checksumming idea is sound. If you store a
64-bit checksum along side each point, it becomes necessary to update
the parent pointer every time the table changes. This introduces an
ordering requirement which means you need to sync() the file every time
you update and L2 entry.
Today, we only need to sync() when we first allocate an L2 entry
(because their locations never change). From a performance perspective,
it's the difference between an fsync() every 64k vs. every 2GB.
Plus, doesn't btrfs do block level checksumming? IOW, if you run a
workload where you care about this level of data integrity validation,
if you did btrfs + qed, you would be fine.
Since the majority of file systems don't do metadata checksumming, it's
not obvious to me that we should be. I think one of the critical flaws
in qcow2 was trying to invent a better filesystem within qemu instead of
just sticking to a very simple and obviously correct format and letting
the FS folks do the really fancy stuff.
Regards,
Anthony Liguori
next prev parent reply other threads:[~2010-09-07 22:28 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori [this message]
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C86BC6B.5010809@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=avi@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.