From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 17:51:51 +0300 [thread overview]
Message-ID: <4C865187.6090508@redhat.com> (raw)
In-Reply-To: <4C84E738.3020802@codemonkey.ws>
On 09/06/2010 04:06 PM, Anthony Liguori wrote:
>
> Another point worth mentioning is that our intention is to have a
> formal specification of the format before merging. A start of that is
> located at http://wiki.qemu.org/Features/QED
>
> =Specification=
>
> The file format looks like this:
>
> +---------+---------+---------+-----+
> | extent0 | extent1 | extent1 | ... |
> +---------+---------+---------+-----+
>
> The first extent contains a header. The header contains information
> about the first data extent. A data extent may be a data cluster, an
> L2, or an L1 table. L1 and L2 tables are composed of one or more
> contiguous extents.
>
> ==Header==
> Header {
> uint32_t magic; /* QED\0 */
Endianness?
>
> uint32_t cluster_size; /* in bytes */
Does cluster == extent? If so, use the same terminology. If not, explain.
Usually extent is a variable size structure.
> uint32_t table_size; /* table size, in clusters */
Presumably L1 table size? Or any table size?
Hm. It would be nicer not to require contiguous sectors anywhere. How
about a variable- or fixed-height tree?
> uint32_t first_cluster; /* in clusters */
First cluster of what?
>
> uint64_t features; /* format feature bits */
> uint64_t compat_features; /* compat feature bits */
> uint64_t l1_table_offset; /* L1 table offset, in clusters */
> uint64_t image_size; /* total image size, in clusters */
Logical, yes?
Is the physical image size always derived from the host file metadata?
Is this always safe?
> /* if (features & QED_F_BACKING_FILE) */
> uint32_t backing_file_offset; /* in bytes from start of header */
> uint32_t backing_file_size; /* in bytes */
It's really the filename size, not the file size. Also, make a note
that it is not zero terminated.
>
> /* if (compat_features & QED_CF_BACKING_FORMAT) */
> uint32_t backing_fmt_offset; /* in bytes from start of header */
> uint32_t backing_fmt_size; /* in bytes */
Why not make it mandatory?
> }
Need a checksum for the header.
>
> ==Extent table==
>
> #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
>
> Table {
> uint64_t offsets[TABLE_NOFFSETS];
> }
It's fashionable to put checksums here.
Do we want a real extent-based format like modern filesystems? So after
defragmentation a full image has O(1) metadata?
>
> The extent tables are organized as follows:
>
> +----------+
> | L1 table |
> +----------+
> ,------' | '------.
> +----------+ | +----------+
> | L2 table | ... | L2 table |
> +----------+ +----------+
> ,------' | '------.
> +----------+ | +----------+
> | Data | ... | Data |
> +----------+ +----------+
>
> The table_size field allows tables to be multiples of the cluster
> size. For example, cluster_size=64 KB and table_size=4 results in 256
> KB tables.
>
> =Operations=
>
> ==Read==
> # If L2 table is not present in L1, read from backing image.
> # If data cluster is not present in L2, read from backing image.
> # Otherwise read data from cluster.
If not in backing image, provide zeros
>
> ==Write==
> # If L2 table is not present in L1, allocate new cluster and L2.
> Perform L2 and L1 link after writing data.
> # If data cluster is not present in L2, allocate new cluster. Perform
> L1 link after writing data.
> # Otherwise overwrite data cluster.
Detail copy-on-write from backing image.
On a partial write without a backing file, do we recommend zero-filling
the cluster (to avoid intra-cluster fragmentation)?
>
> The L2 link '''should''' be made after the data is in place on
> storage. However, when no ordering is enforced the worst case
> scenario is an L2 link to an unwritten cluster.
Or it may cause corruption if the physical file size is not committed,
and L2 now points at a free cluster.
>
> The L1 link '''must''' be made after the L2 cluster is in place on
> storage. If the order is reversed then the L1 table may point to a
> bogus L2 table. (Is this a problem since clusters are allocated at
> the end of the file?)
>
> ==Grow==
> # If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW.
> The L1 table is not big enough.
With a variable-height tree, we allocate a new root, link its first
entry to the old root, and write the new header with updated root and
height.
> # Write new image_size header field.
>
> =Data integrity=
> ==Write==
> Writes that complete before a flush must be stable when the flush
> completes.
>
> If storage is interrupted (e.g. power outage) then writes in progress
> may be lost, stable, or partially completed. The storage must not be
> otherwise corrupted or inaccessible after it is restarted.
We can remove this requirement by copying-on-write any metadata write,
and keeping two copies of the header (with version numbers and
checksums). Enterprise storage will not corrupt on writes, but
commodity storage may.
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2010-09-07 14:52 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity [this message]
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C865187.6090508@redhat.com \
--to=avi@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).