From: Anthony Liguori <anthony@codemonkey.ws>
To: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 15:41:55 -0500 [thread overview]
Message-ID: <4C86A393.8000109@codemonkey.ws> (raw)
In-Reply-To: <AANLkTi=ZU83BYDkgT3=mEcx1HjRiH20fT3q6HFR-uS2A@mail.gmail.com>
On 09/07/2010 02:25 PM, Blue Swirl wrote:
> On Mon, Sep 6, 2010 at 10:04 AM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
>
>> QEMU Enhanced Disk format is a disk image format that forgoes features
>> found in qcow2 in favor of better levels of performance and data
>> integrity. Due to its simpler on-disk layout, it is possible to safely
>> perform metadata updates more efficiently.
>>
>> Installations, suspend-to-disk, and other allocation-heavy I/O workloads
>> will see increased performance due to fewer I/Os and syncs. Workloads
>> that do not cause new clusters to be allocated will perform similar to
>> raw images due to in-memory metadata caching.
>>
>> The format supports sparse disk images. It does not rely on the host
>> filesystem holes feature, making it a good choice for sparse disk images
>> that need to be transferred over channels where holes are not supported.
>>
>> Backing files are supported so only deltas against a base image can be
>> stored.
>>
>> The file format is extensible so that additional features can be added
>> later with graceful compatibility handling.
>>
>> Internal snapshots are not supported. This eliminates the need for
>> additional metadata to track copy-on-write clusters.
>>
> It would be nice to support external snapshots, so another file
> besides the disk images can store the snapshots. Then snapshotting
> would be available even with raw or QED disk images. This is of course
> not QED specific.
>
There's two types of snapshots that I think can cause confusion.
There's CPU/device state snapshots and then there's a block device snapshot.
qcow2 and qed both support block device snapshots. qed only supports
external snapshots (via backing_file) whereas qcow2 supports external
and internal snapshots. The internal snapshots are the source of an
incredible amount of complexity in the format.
qcow2 can also store CPU/device state snapshots and correlate them to
block device snapshots (within a single block device). It only supports
doing non-live CPU/device state snapshots.
OTOH, qemu can support live snapshotting via live migration. Today, it
can be used to snapshot CPU/device state to a file on the filesystem
with minimum downtime.
Combined with an external block snapshot and correlating data, this
could be used to implement a single "snapshot" command that would behave
like savevm but would not pause a guest's execution.
It's really just a matter of plumbing to expose an interface for this
today. We have all of the infrastructure we need.
>> + *
>> + * +--------+----------+----------+----------+-----+
>> + * | header | L1 table | cluster0 | cluster1 | ... |
>> + * +--------+----------+----------+----------+-----+
>> + *
>> + * There is a 2-level pagetable for cluster allocation:
>> + *
>> + * +----------+
>> + * | L1 table |
>> + * +----------+
>> + * ,------' | '------.
>> + * +----------+ | +----------+
>> + * | L2 table | ... | L2 table |
>> + * +----------+ +----------+
>> + * ,------' | '------.
>> + * +----------+ | +----------+
>> + * | Data | ... | Data |
>> + * +----------+ +----------+
>> + *
>> + * The L1 table is fixed size and always present. L2 tables are allocated on
>> + * demand. The L1 table size determines the maximum possible image size; it
>> + * can be influenced using the cluster_size and table_size values.
>>
> The formula for calculating the maximum size would be nice.
table_entries = (table_size * cluster_size / 8)
max_size = (table_entries) * table_entries * cluster_size
it's a hell of a lot easier to do powers-of-two math though:
table_entries = 2^2 * 2^16 / 2^3 = 2^15
max_size = 2^15 * 2^15 * 2^16 = 2^46 = 64TB
> Is the
> image_size the limit?
No.
> How many clusters can there be?
table_entries * table_entries
> What happens if
> the image_size is not equal to multiple of cluster size?
The code checks this and fails at open() or create() time.
> Wouldn't
> image_size be redundant if cluster_size and table_size determine the
> image size?
>
In a two level table, if you make table_size the determining factor, the
image has to be a multiple of the space spanned by the L2 tables which
in the default case for qed is 2GB.
>> + *
>> + * All fields are little-endian on disk.
>> + */
>> +
>> +typedef struct {
>> + uint32_t magic; /* QED */
>> +
>> + uint32_t cluster_size; /* in bytes */
>>
> Doesn't cluster_size need to be a power of two?
>
Yes. It's enforced at open() and create() time but needs to be in the spec.
>> + uint32_t table_size; /* table size, in clusters */
>> + uint32_t first_cluster; /* first usable cluster */
>>
> This introduces some limits to the location of first cluster, with 4k
> clusters it must reside within the first 16TB. I guess it doesn't
> matter.
>
first_cluster is a bad name. It should be header_size and yeah, there
is a limit on header_size.
>> +
>> + uint64_t features; /* format feature bits */
>> + uint64_t compat_features; /* compatible feature bits */
>> + uint64_t l1_table_offset; /* L1 table offset, in bytes */
>> + uint64_t image_size; /* total image size, in bytes */
>> +
>> + uint32_t backing_file_offset; /* in bytes from start of header */
>> + uint32_t backing_file_size; /* in bytes */
>> + uint32_t backing_fmt_offset; /* in bytes from start of header */
>> + uint32_t backing_fmt_size; /* in bytes */
>> +} QEDHeader;
>> +
>> +typedef struct {
>> + uint64_t offsets[0]; /* in bytes */
>> +} QEDTable;
>>
> Is this for both L1 and L2 tables?
>
Yes, which has the nice advantage of simplifying the code quite a bit.
Regards,
Anthony Liguori
next prev parent reply other threads:[~2010-09-07 20:42 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori [this message]
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C86A393.8000109@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=blauwirbel@gmail.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).