From: Kevin Wolf <kwolf@redhat.com>
To: Chunqiang Tang <ctang@us.ibm.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>,
qemu-devel@nongnu.org, Markus Armbruster <armbru@redhat.com>,
Aurelien Jarno <aurelien@aurel32.net>
Subject: Re: [Qemu-devel] Re: Strategic decision: COW format
Date: Mon, 14 Mar 2011 11:12:35 +0100 [thread overview]
Message-ID: <4D7DEA13.6060801@redhat.com> (raw)
In-Reply-To: <OF03D5ED77.46666A58-ON85257852.0013F083-85257852.001FF9A6@us.ibm.com>
Am 13.03.2011 06:51, schrieb Chunqiang Tang:
> After the heated debate, I thought more about the right approach of
> implementing snapshot, and it becomes clear to me that there are major
> limitations with both VMDK's external snapshot approach (which stores each
> snapshot as a separate CoW file) and QCOW2's internal snapshot approach
> (which stores all snapshots in one file and uses a reference count table
> to keep track of them). I just posted to the mailing list a patch that
> implements internal snapshot in FVD but does it in a way without the
> limitations of VMDK and QCOW2.
>
> Let's first list the properties of an ideal virtual disk snapshot
> solution, and then discuss how to achieve them.
>
> G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot
> code should not slow down the runtime performance of an image that has no
> snapshots. This implies that an image without snapshot should not cache
> the reference count table in memory and should not update the on-disk
> reference count table.
>
> G2: Even better, an image with 1 snapshot runs as fast as an image without
> snapshot.
>
> G3: Even even better, an image with 1,000 snapshots runs as fast as an
> image without snapshot. This basically means getting the snapshot feature
> for free.
>
> G4: An image with 1,000 snapshots consumes no more memory than an image
> without snapshot. This again means getting the snapshot feature for free.
>
> G5: Regardless of the number of existing snapshots, creating a new
> snapshot is fast, e.g., taking no more than 1 second.
>
> G6: Regardless of the number of existing snapshots, deleting a snapshot is
> fast, e.g., taking no more than 1 second.
>
> Now let's evaluate VMDK and QCOW2 against these ideal properties.
>
> G1: VMDK good; QCOW2 poor
> G2: VMDK ok; QCOW2 poor
> G3: VMDK very poor; QCOW2 poor
> G4: VMDK very poor; QCOW2 poor
> G5: VMDK good; QCOW2 good
> G6: VMDK poor; QCOW2 good
Okay. I think I don't agree with all of these. I'm not entirely sure how
VMDK works, so I take this as "random image format that uses backing
files" (so it also applies to qcow2 with backing files, which I hope
isn't too confusing).
G1: VMDK good; QCOW2 poor for cache=writethrough, ok otherwise; QCOW3 good
G2: VMDK ok; QCOW2 good
G3: VMDK poor; QCOW2 good
G4: VMDK very poor; QCOW2 ok
G5: VMDK good; QCOW2 good
G6: VMDK very poor; QCOW2 good
Also, let me add another feature which I believe is an important factor
in the decision between internal and external snapshots:
G7: Loading/Reverting to a snapshot is fast
G7: VMDK good; QCOW2 ok
> On the other hand, QCOW'2 internal snapshot has two major limitations that
> hurt runtime performance: caching the reference count table in memory and
> updating the on-disk reference count table. If we can eliminate both, then
> it is an ideal solution.
It's not even necessary to get completely rid of it. What hurts is
writing the additional metadata. So if you can delay writing the
metadata and only write out a refcount block once you need to load the
next one into memory, the overhead is lost in the noise (remember, even
with 64k clusters, a refcount block covers 2 GB of virtual disk space).
We already do that for qcow2 in all writeback cache modes. We can't do
it yet for cache=writethrough, but we were planning to allow using QED's
dirty flag approach which would get rid of the writes also in
writethrough modes.
I think this explains my estimation for G1.
For G2 and G3, I'm not sure why you think that having internal snapshots
slows down operation. It's basically just data that sits in the image
file and is unused. After startup or after deleting a snapshot you
probably have to look at all of the refcount table again for cluster
allocations, is this what you mean?
For G4, the size of snapshots in memory, the only overhead of internal
snapshots that I could think of is the snapshot table. I would hardly
rate this as "poor".
For G5 and G6 I basically agree with your estimation, except that I
think that the overhead of deleting a snapshot is _really_ bad. This is
one of the major problems we have with external snapshots today.
> In an internal snapshot implementation, the reference count table is used
> to track used blocks and free blocks. It serves no other purposes. In FVD,
> its "static" reference count table only tracks blocks used by (static)
> snapshots, and it does not track blocks (dynamically) allocated (on a
> write) or freed (on a trim) for the running VM. This is a simple but
> fundamental difference w.r.t. to QCOW2, whose reference count table tracks
> both the static content and the dynamic content. Because data blocks used
> by snapshots are static and do not change unless a snapshot is created or
> deleted, there is no need to update FVD's "static" reference count table
> when a VM runs, and actually there is even no need to cache it in memory.
> Data blocks that are dynamically allocated or freed for a running VM are
> already tracked by FVD's one-level lookup table (which is similar to
> QCOW2's two-level table, but in FVD it is much smaller and faster) even
> before introducing the snapshot feature, and hence it comes for free.
> Updating FVD's one-level lookup table is efficient because of FVD's
> journal.
So when is a cluster considered free? Only if both its refcount is 0 and
it's not referenced by a used lookup table entry?
How do you check the latter condition without scanning the whole lookup
table?
> When the VM boots, FVD scans the reference count table once to build a
> so-called free-block-bitmap in memory, which identifies blocks not used by
> static snapshots. The reference count table is then thrown away and never
> updated when the VM runs.
This is an implementation detail and not related to the format.
Kevin
next prev parent reply other threads:[~2011-03-14 10:10 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <OF3C9DAE9F.EC6B5878-ON85257826.00715C10-85257826.007A14FB@LocalDomain>
2011-02-15 19:45 ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Chunqiang Tang
2011-02-16 12:34 ` Kevin Wolf
2011-02-17 16:04 ` Chunqiang Tang
2011-02-18 9:12 ` Strategic decision: COW format (was: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED) Markus Armbruster
2011-02-18 9:57 ` [Qemu-devel] Re: Strategic decision: COW format Kevin Wolf
2011-02-18 14:20 ` Anthony Liguori
2011-02-22 8:37 ` Markus Armbruster
2011-02-22 8:56 ` Kevin Wolf
2011-02-22 10:21 ` Markus Armbruster
2011-02-22 15:57 ` Anthony Liguori
2011-02-22 16:15 ` Kevin Wolf
2011-02-22 18:18 ` Anthony Liguori
2011-02-23 9:13 ` Kevin Wolf
2011-02-23 14:21 ` Anthony Liguori
2011-02-23 14:55 ` Kevin Wolf
2011-02-23 13:43 ` Avi Kivity
2011-02-23 14:23 ` Anthony Liguori
2011-02-23 14:38 ` Kevin Wolf
2011-02-23 15:29 ` Anthony Liguori
2011-02-23 15:36 ` Avi Kivity
2011-02-23 15:47 ` Anthony Liguori
2011-02-23 15:59 ` Avi Kivity
2011-02-23 15:54 ` Kevin Wolf
2011-02-23 15:23 ` Avi Kivity
2011-02-23 15:31 ` Anthony Liguori
2011-02-23 15:37 ` Avi Kivity
2011-02-23 15:50 ` Anthony Liguori
2011-02-23 16:03 ` Avi Kivity
2011-02-23 16:04 ` Anthony Liguori
2011-02-23 16:15 ` Kevin Wolf
2011-02-25 11:20 ` Pavel Dovgaluk
[not found] ` <-1737654525499315352@unknownmsgid>
2011-02-25 13:22 ` Stefan Hajnoczi
2011-02-23 15:52 ` Anthony Liguori
2011-02-23 15:59 ` Gleb Natapov
2011-02-23 16:00 ` Avi Kivity
2011-02-23 15:33 ` Daniel P. Berrange
2011-02-23 15:38 ` Avi Kivity
2011-02-18 17:43 ` Stefan Weil
2011-02-18 19:11 ` Kevin Wolf
2011-02-18 19:47 ` Anthony Liguori
2011-02-18 20:49 ` Kevin Wolf
2011-02-18 20:50 ` Anthony Liguori
2011-02-18 21:27 ` Kevin Wolf
2011-02-19 17:19 ` Stefan Hajnoczi
2011-02-18 20:31 ` Anthony Liguori
2011-02-19 12:27 ` [Qemu-devel] Bugs in the VDI Block Device Driver Chunqiang Tang
2011-02-19 16:21 ` Stefan Hajnoczi
2011-02-19 18:49 ` Stefan Weil
2011-02-20 22:13 ` [Qemu-devel] Re: Strategic decision: COW format Aurelien Jarno
2011-02-21 8:59 ` Kevin Wolf
2011-02-21 13:44 ` Stefan Hajnoczi
2011-02-21 14:10 ` Kevin Wolf
2011-02-21 15:16 ` Anthony Liguori
2011-02-21 15:26 ` Kevin Wolf
2011-02-23 3:32 ` Chunqiang Tang
2011-02-23 13:20 ` Markus Armbruster
[not found] ` <OFAEB4CD91.BE989F29-ON8525783F.007366B8-85257840.00130B47@LocalDomain>
2011-03-13 5:51 ` Chunqiang Tang
2011-03-13 17:48 ` Anthony Liguori
2011-03-14 2:28 ` Chunqiang Tang
2011-03-14 13:22 ` Anthony Liguori
2011-03-14 13:53 ` Chunqiang Tang
2011-03-14 14:02 ` Anthony Liguori
2011-03-14 14:21 ` Kevin Wolf
2011-03-14 14:35 ` Chunqiang Tang
2011-03-14 14:49 ` Anthony Liguori
2011-03-14 15:05 ` Stefan Hajnoczi
2011-03-14 15:08 ` Kevin Wolf
2011-03-14 14:26 ` Stefan Hajnoczi
2011-03-14 14:30 ` Chunqiang Tang
2011-03-14 14:15 ` Kevin Wolf
2011-03-14 14:25 ` Chunqiang Tang
2011-03-14 14:31 ` Stefan Hajnoczi
2011-03-14 16:32 ` Chunqiang Tang
2011-03-14 17:57 ` Kevin Wolf
2011-03-14 19:23 ` Chunqiang Tang
2011-03-14 20:16 ` Kevin Wolf
[not found] ` <OF7C2FDD40.E76A4E14-ON85257853.005ADD68-85257853.005AF16E@LocalDomain>
2011-03-14 21:32 ` Chunqiang Tang
2011-03-14 14:34 ` Kevin Wolf
2011-03-14 14:47 ` Anthony Liguori
2011-03-14 15:03 ` Kevin Wolf
2011-03-14 15:13 ` Anthony Liguori
2011-03-14 15:04 ` Chunqiang Tang
2011-03-14 15:07 ` Stefan Hajnoczi
2011-03-14 10:12 ` Kevin Wolf [this message]
2011-02-22 8:40 ` Markus Armbruster
2011-02-16 13:21 ` [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED Stefan Hajnoczi
2011-02-17 16:04 ` Chunqiang Tang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D7DEA13.6060801@redhat.com \
--to=kwolf@redhat.com \
--cc=armbru@redhat.com \
--cc=aurelien@aurel32.net \
--cc=ctang@us.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).