From: Anthony Liguori <anthony@codemonkey.ws>
To: Blue Swirl <blauwirbel@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 15:41:55 -0500 [thread overview]
Message-ID: <4C86A393.8000109@codemonkey.ws> (raw)
In-Reply-To: <AANLkTi=ZU83BYDkgT3=mEcx1HjRiH20fT3q6HFR-uS2A@mail.gmail.com>
On 09/07/2010 02:25 PM, Blue Swirl wrote:
> On Mon, Sep 6, 2010 at 10:04 AM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
>
>> QEMU Enhanced Disk format is a disk image format that forgoes features
>> found in qcow2 in favor of better levels of performance and data
>> integrity. Due to its simpler on-disk layout, it is possible to safely
>> perform metadata updates more efficiently.
>>
>> Installations, suspend-to-disk, and other allocation-heavy I/O workloads
>> will see increased performance due to fewer I/Os and syncs. Workloads
>> that do not cause new clusters to be allocated will perform similar to
>> raw images due to in-memory metadata caching.
>>
>> The format supports sparse disk images. It does not rely on the host
>> filesystem holes feature, making it a good choice for sparse disk images
>> that need to be transferred over channels where holes are not supported.
>>
>> Backing files are supported so only deltas against a base image can be
>> stored.
>>
>> The file format is extensible so that additional features can be added
>> later with graceful compatibility handling.
>>
>> Internal snapshots are not supported. This eliminates the need for
>> additional metadata to track copy-on-write clusters.
>>
> It would be nice to support external snapshots, so another file
> besides the disk images can store the snapshots. Then snapshotting
> would be available even with raw or QED disk images. This is of course
> not QED specific.
>
There's two types of snapshots that I think can cause confusion.
There's CPU/device state snapshots and then there's a block device snapshot.
qcow2 and qed both support block device snapshots. qed only supports
external snapshots (via backing_file) whereas qcow2 supports external
and internal snapshots. The internal snapshots are the source of an
incredible amount of complexity in the format.
qcow2 can also store CPU/device state snapshots and correlate them to
block device snapshots (within a single block device). It only supports
doing non-live CPU/device state snapshots.
OTOH, qemu can support live snapshotting via live migration. Today, it
can be used to snapshot CPU/device state to a file on the filesystem
with minimum downtime.
Combined with an external block snapshot and correlating data, this
could be used to implement a single "snapshot" command that would behave
like savevm but would not pause a guest's execution.
It's really just a matter of plumbing to expose an interface for this
today. We have all of the infrastructure we need.
>> + *
>> + * +--------+----------+----------+----------+-----+
>> + * | header | L1 table | cluster0 | cluster1 | ... |
>> + * +--------+----------+----------+----------+-----+
>> + *
>> + * There is a 2-level pagetable for cluster allocation:
>> + *
>> + * +----------+
>> + * | L1 table |
>> + * +----------+
>> + * ,------' | '------.
>> + * +----------+ | +----------+
>> + * | L2 table | ... | L2 table |
>> + * +----------+ +----------+
>> + * ,------' | '------.
>> + * +----------+ | +----------+
>> + * | Data | ... | Data |
>> + * +----------+ +----------+
>> + *
>> + * The L1 table is fixed size and always present. L2 tables are allocated on
>> + * demand. The L1 table size determines the maximum possible image size; it
>> + * can be influenced using the cluster_size and table_size values.
>>
> The formula for calculating the maximum size would be nice.
table_entries = (table_size * cluster_size / 8)
max_size = (table_entries) * table_entries * cluster_size
it's a hell of a lot easier to do powers-of-two math though:
table_entries = 2^2 * 2^16 / 2^3 = 2^15
max_size = 2^15 * 2^15 * 2^16 = 2^46 = 64TB
> Is the
> image_size the limit?
No.
> How many clusters can there be?
table_entries * table_entries
> What happens if
> the image_size is not equal to multiple of cluster size?
The code checks this and fails at open() or create() time.
> Wouldn't
> image_size be redundant if cluster_size and table_size determine the
> image size?
>
In a two level table, if you make table_size the determining factor, the
image has to be a multiple of the space spanned by the L2 tables which
in the default case for qed is 2GB.
>> + *
>> + * All fields are little-endian on disk.
>> + */
>> +
>> +typedef struct {
>> + uint32_t magic; /* QED */
>> +
>> + uint32_t cluster_size; /* in bytes */
>>
> Doesn't cluster_size need to be a power of two?
>
Yes. It's enforced at open() and create() time but needs to be in the spec.
>> + uint32_t table_size; /* table size, in clusters */
>> + uint32_t first_cluster; /* first usable cluster */
>>
> This introduces some limits to the location of first cluster, with 4k
> clusters it must reside within the first 16TB. I guess it doesn't
> matter.
>
first_cluster is a bad name. It should be header_size and yeah, there
is a limit on header_size.
>> +
>> + uint64_t features; /* format feature bits */
>> + uint64_t compat_features; /* compatible feature bits */
>> + uint64_t l1_table_offset; /* L1 table offset, in bytes */
>> + uint64_t image_size; /* total image size, in bytes */
>> +
>> + uint32_t backing_file_offset; /* in bytes from start of header */
>> + uint32_t backing_file_size; /* in bytes */
>> + uint32_t backing_fmt_offset; /* in bytes from start of header */
>> + uint32_t backing_fmt_size; /* in bytes */
>> +} QEDHeader;
>> +
>> +typedef struct {
>> + uint64_t offsets[0]; /* in bytes */
>> +} QEDTable;
>>
> Is this for both L1 and L2 tables?
>
Yes, which has the nice advantage of simplifying the code quite a bit.
Regards,
Anthony Liguori
next prev parent reply other threads:[~2010-09-07 20:42 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori [this message]
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C86A393.8000109@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=blauwirbel@gmail.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.