From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Thu, 09 Sep 2010 09:45:27 +0300 [thread overview]
Message-ID: <4C888287.8020209@redhat.com> (raw)
In-Reply-To: <4C87860A.3060904@codemonkey.ws>
On 09/08/2010 03:48 PM, Anthony Liguori wrote:
> On 09/08/2010 03:23 AM, Avi Kivity wrote:
>> On 09/08/2010 01:27 AM, Anthony Liguori wrote:
>>>> FWIW, L2s are 256K at the moment and with a two level table, it can
>>>> support 5PB of data.
>>>
>>>
>>> I clearly suck at basic math today. The image supports 64TB today.
>>> Dropping to 128K tables would reduce it to 16TB and 64k tables would
>>> be 4TB.
>>
>> Maybe we should do three levels then. Some users are bound to
>> complain about 64TB.
>
> That's just the default size. The table size and cluster sizes are
> configurable. Without changing the cluster size, the image can
> support up to 1PB.
Loading very large L2 tables on demand will result in very long
latencies. Increasing cluster size will result in very long first write
latencies. Adding an extra level results in an extra random write every
4TB.
>>
>>> Today, we only need to sync() when we first allocate an L2 entry
>>> (because their locations never change). From a performance
>>> perspective, it's the difference between an fsync() every 64k vs.
>>> every 2GB.
>>
>> Yup. From a correctness perspective, it's the difference between a
>> corrupted filesystem on almost every crash and a corrupted filesystem
>> in some very rare cases.
>
>
> I'm not sure I understand you're corruption comment. Are you claiming
> that without checksumming, you'll often get corruption or are you
> claiming that without checksums, if you don't sync metadata updates
> you'll get corruption?
No, I'm claiming that with checksums but without allocate-on-write you
will have frequent (detected) data loss after power failures. Checksums
need to go hand-in-hand with allocate-on-write (which happens to be the
principle underlying zfs and btrfs).
>
> qed is very careful about ensuring that we don't need to do syncs and
> we don't get corruption because of data loss. I don't necessarily buy
> your checksumming argument.
The requirement for checksumming comes from a different place. For
decades we've enjoyed very low undetected bit error rates. However the
actual amount of data is increasing to the point that it makes an
undetectable bit error likely, just by throwing a huge amount of bits at
storage. Write ordering doesn't address this issue.
Virtualization is one of the uses where you have a huge number of bits.
btrfs addresses this, but if you have (working) btrfs you don't need
qed. Another problem is nfs; TCP and UDP checksums are incredibly weak
and it is easy for a failure to bypass them. Ethernet CRCs are better,
but they only work if the error is introduced after the CRC is taken and
before it is verified.
>>
>> Well, if we introduce a minimal format, we need to make sure it isn't
>> too minimal.
>>
>> I'm still not sold on the idea. What we're doing now is pushing the
>> qcow2 complexity to users. We don't have to worry about refcounts
>> now, but users have to worry whether they're the machine they're
>> copying the image to supports qed or not.
>>
>> The performance problems with qcow2 are solvable. If we preallocate
>> clusters, the performance characteristics become essentially the same
>> as qed.
>
> By creating two code paths within qcow2.
You're creating two code paths for users.
> It's not just the reference counts, it's the lack of guaranteed
> alignment, compression, and some of the other poor decisions in the
> format.
>
> If you have two code paths in qcow2, you have non-deterministic
> performance because users that do reasonable things with their images
> will end up getting catastrophically bad performance.
We can address that in the tools. "By enabling compression, you may
reduce performance for multithreaded workloads. Abort/Retry/Ignore?"
>
> A new format doesn't introduce much additional complexity. We provide
> image conversion tool and we can almost certainly provide an in-place
> conversion tool that makes the process very fast.
It requires users to make a decision. By the time qed is ready for mass
deployment, 1-2 years will have passed. How many qcow2 images will be
in the wild then? How much scheduled downtime will be needed? How much
user confusion will be caused?
Virtualization is about compatibility. In-guest compatibility first,
but keeping the external environment stable is also important. We
really need to exhaust the possibilities with qcow2 before giving up on it.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
next prev parent reply other threads:[~2010-09-09 6:45 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-09-06 10:04 [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-09-06 10:25 ` Alexander Graf
2010-09-06 10:31 ` Stefan Hajnoczi
2010-09-06 14:21 ` Luca Tettamanti
2010-09-06 14:24 ` Alexander Graf
2010-09-06 16:27 ` Anthony Liguori
2010-09-06 10:27 ` [Qemu-devel] " Kevin Wolf
2010-09-06 12:40 ` Stefan Hajnoczi
2010-09-06 12:57 ` Anthony Liguori
2010-09-06 13:02 ` Stefan Hajnoczi
2010-09-06 14:10 ` Kevin Wolf
2010-09-06 16:45 ` Anthony Liguori
2010-09-06 12:45 ` Anthony Liguori
2010-09-10 23:49 ` H. Peter Anvin
2010-09-06 11:18 ` [Qemu-devel] " Daniel P. Berrange
2010-09-06 12:52 ` Anthony Liguori
2010-09-06 13:35 ` Daniel P. Berrange
2010-09-06 16:38 ` Anthony Liguori
2010-09-06 13:06 ` Anthony Liguori
2010-09-07 14:51 ` Avi Kivity
2010-09-07 15:40 ` Anthony Liguori
2010-09-07 16:09 ` Avi Kivity
2010-09-07 16:25 ` Anthony Liguori
2010-09-07 22:27 ` Anthony Liguori
2010-09-08 8:23 ` Avi Kivity
2010-09-08 8:41 ` Alexander Graf
2010-09-08 8:53 ` Avi Kivity
2010-09-08 11:15 ` Stefan Hajnoczi
2010-09-08 15:38 ` Christoph Hellwig
2010-09-08 16:30 ` Anthony Liguori
2010-09-08 20:23 ` Christoph Hellwig
2010-09-08 20:28 ` Anthony Liguori
2010-09-09 2:35 ` Christoph Hellwig
2010-09-09 6:24 ` Avi Kivity
2010-09-09 21:01 ` Christoph Hellwig
2010-09-10 11:15 ` Avi Kivity
2010-09-09 6:53 ` Avi Kivity
2010-09-10 21:22 ` Jamie Lokier
2010-09-14 10:46 ` Stefan Hajnoczi
2010-09-14 11:08 ` Stefan Hajnoczi
2010-09-14 12:54 ` Anthony Liguori
2010-09-08 12:55 ` Anthony Liguori
2010-09-09 6:30 ` Avi Kivity
2010-09-08 12:48 ` Anthony Liguori
2010-09-08 13:20 ` Kevin Wolf
2010-09-08 13:26 ` Anthony Liguori
2010-09-08 13:46 ` Kevin Wolf
2010-09-09 6:45 ` Avi Kivity [this message]
2010-09-09 6:48 ` Avi Kivity
2010-09-09 12:49 ` Anthony Liguori
2010-09-09 16:48 ` [Qemu-devel] " Paolo Bonzini
2010-09-09 17:02 ` Anthony Liguori
2010-09-09 20:56 ` Christoph Hellwig
2010-09-10 10:53 ` Avi Kivity
2010-09-10 11:14 ` [Qemu-devel] " Avi Kivity
2010-09-10 11:25 ` Avi Kivity
2010-09-10 11:33 ` Stefan Hajnoczi
2010-09-10 11:43 ` Avi Kivity
2010-09-10 13:22 ` Anthony Liguori
2010-09-10 13:48 ` Christoph Hellwig
2010-09-10 15:02 ` Anthony Liguori
2010-09-10 15:18 ` Kevin Wolf
2010-09-10 15:53 ` Anthony Liguori
2010-09-10 16:05 ` Kevin Wolf
2010-09-10 17:10 ` Anthony Liguori
2010-09-10 17:44 ` Kevin Wolf
2010-09-10 17:46 ` Miguel Di Ciurcio Filho
2010-09-10 14:02 ` Avi Kivity
2010-09-10 13:47 ` Christoph Hellwig
2010-09-10 14:05 ` Avi Kivity
2010-09-10 14:12 ` Christoph Hellwig
2010-09-10 14:24 ` Avi Kivity
2010-09-10 13:16 ` Anthony Liguori
2010-09-10 14:06 ` Avi Kivity
2010-09-10 11:43 ` Stefan Hajnoczi
2010-09-10 12:06 ` Avi Kivity
2010-09-10 13:28 ` Anthony Liguori
2010-09-10 12:12 ` Kevin Wolf
2010-09-10 12:35 ` Stefan Hajnoczi
2010-09-10 12:47 ` Avi Kivity
2010-09-10 13:10 ` Stefan Hajnoczi
2010-09-10 13:19 ` Avi Kivity
2010-09-10 13:39 ` Anthony Liguori
2010-09-10 13:52 ` Christoph Hellwig
2010-09-10 13:56 ` Avi Kivity
2010-09-10 13:48 ` Kevin Wolf
2010-09-10 13:14 ` Anthony Liguori
2010-09-10 13:47 ` Avi Kivity
2010-09-10 14:56 ` Anthony Liguori
2010-09-10 15:49 ` Avi Kivity
2010-09-10 17:07 ` Anthony Liguori
2010-09-10 17:42 ` Kevin Wolf
2010-09-10 19:33 ` Anthony Liguori
2010-09-13 10:41 ` Kevin Wolf
2010-09-12 13:24 ` Avi Kivity
2010-09-12 15:13 ` Anthony Liguori
2010-09-12 15:56 ` Avi Kivity
2010-09-12 17:09 ` Anthony Liguori
2010-09-12 17:51 ` Avi Kivity
2010-09-12 20:18 ` Anthony Liguori
2010-09-13 9:24 ` Avi Kivity
2010-09-13 11:28 ` Kevin Wolf
2010-09-13 11:34 ` Avi Kivity
2010-09-13 11:48 ` Kevin Wolf
2010-09-13 13:19 ` Anthony Liguori
2010-09-13 13:12 ` Anthony Liguori
2010-09-13 11:03 ` Kevin Wolf
2010-09-13 13:07 ` Anthony Liguori
2010-09-13 13:24 ` Kevin Wolf
2010-09-07 16:12 ` Anthony Liguori
2010-09-07 21:35 ` Christoph Hellwig
2010-09-07 22:29 ` Anthony Liguori
2010-09-07 22:40 ` Christoph Hellwig
2010-09-08 15:07 ` Stefan Hajnoczi
2010-09-09 6:59 ` Avi Kivity
2010-09-09 17:43 ` Anthony Liguori
2010-09-09 20:46 ` Christoph Hellwig
2010-09-10 11:22 ` Avi Kivity
2010-09-10 11:29 ` Stefan Hajnoczi
2010-09-10 11:37 ` Avi Kivity
2010-09-07 13:58 ` Avi Kivity
2010-09-07 19:25 ` Blue Swirl
2010-09-07 20:41 ` Anthony Liguori
2010-09-08 7:48 ` Kevin Wolf
2010-09-08 15:37 ` Stefan Hajnoczi
2010-09-08 18:24 ` Blue Swirl
2010-09-08 18:35 ` Anthony Liguori
2010-09-08 18:56 ` Blue Swirl
2010-09-08 19:19 ` Anthony Liguori
2010-09-15 21:01 ` [Qemu-devel] " Michael S. Tsirkin
2010-09-15 21:12 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2010-09-17 3:51 [Qemu-devel] " Khoa Huynh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C888287.8020209@redhat.com \
--to=avi@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.