qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Tue, 18 Apr 2017 20:30:17 +0300	[thread overview]
Message-ID: <7e278fc7-675a-e42f-94d9-bff3f347a2c4@openvz.org> (raw)
In-Reply-To: <20170418112215.GC9236@noname.redhat.com>

On 04/18/2017 02:22 PM, Kevin Wolf wrote:
> Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben:
>> [skipped...]
>>
>>> Hi Denis,
>>>
>>> I've read this entire thread now and I really like Berto's summary which
>>> I think is one of the best recaps of existing qcow2 problems and this
>>> discussion so far.
>>>
>>> I understand your opinion that we should focus on compatible changes
>>> before incompatible ones, and I also understand that you are very
>>> concerned about physical fragmentation for reducing long-term IO.
>>>
>>> What I don't understand is why you think that subclusters will increase
>>> fragmentation. If we admit that fragmentation is a problem now, surely
>>> increasing cluster sizes to 1 or 2 MB will only help to *reduce*
>>> physical fragmentation, right?
>>>
>>> Subclusters as far as I am understanding them will not actually allow
>>> subclusters to be located at virtually disparate locations, we will
>>> continue to allocate clusters as normal -- we'll just keep track of
>>> which portions of the cluster we've written to to help us optimize COW*.
>>>
>>> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
>>> we write just the first subcluster, we'll have a map like:
>>>
>>> X---------------
>>>
>>> Whatever actually happens to exist in this space, whether it be a hole
>>> we punched via fallocate or literal zeroes, this space is known to the
>>> filesystem to be contiguous.
>>>
>>> If we write to the last subcluster, we'll get:
>>>
>>> X--------------X
>>>
>>> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
>>> but the last subcluster is located virtually exactly 15 subclusters
>>> behind the first, they're not physically contiguous. We've saved the
>>> space between them. Future out-of-order writes won't contribute to any
>>> fragmentation, at least at this level.
>>>
>>> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
>>> the subclusters right, we'll have *zero*, won't we?
>>>
>>> As far as I can tell, this lets us do a lot of good things all at once:
>>>
>>> (1) We get some COW optimizations (for reasons Berto and Kevin have
>>> detailed)
>> Yes. We are fine with COW. Let us assume that we will have issued read
>> entire cluster command after the COW, in the situation
>>
>> X--------------X
>>
>> with a backing store. This is possible even with 1-2 Mb cluster size.
>> I have seen 4-5 Mb requests from the guest in the real life. In this
>> case we will have 3 IOP:
>>     read left X area, read backing, read right X.
>> This is the real drawback of the approach, if sub-cluster size is really
>> small enough, which should be the case for optimal COW. Thus we
>> will have random IO in the host instead of sequential one in guest.
>> Thus we have optimized COW at the cost of long term reads. This
>> is what I am worrying about as we can have a lot of such reads before
>> any further data change.
> So just to avoid misunderstandings about what you're comparing here:
> You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you
> would get only a single operation for 2 MB clusters without subclusters.
> Today's 64k clusters without subclusters behave no better than the
> 2M/64k version, but that's not what you're comparing.
>
> Yes, you are correct about this observation. But it is a tradeoff that
> you're intentionally making when using backing files. In the extreme,
> there is an alternative that performs so much better: Instead of using a
> backing file, use 'qemu-img convert' to copy (and defragment) the whole
> image upfront. No COW whatsoever, no fragmentation, fast reads. The
> downside is that it takes a while to copy the whole image upfront, and
> it also costs quite a bit of disk space.
in general, for production environments, this is total pain. We
have a lot of customers with Tb images. Free space is also
a real problem for them.


> So once we acknowledge that we're dealing with a tradeoff here, it
> becomes obvious that neither the extreme setup for performance (copy the
> whole image upfront) nor the extreme setup for sparseness (COW on a
> sector level) are the right default for the average case, nor is
> optimising one-sidedly a good idea. It is good if we can provide
> solutions for extreme cases, but by default we need to cater for the
> average case, which cares both about reasonable performance and disk
> usage.
yes, I agree. But 64kb cluster size by default for big images (not for
backup!)
is another extreme ;) Who will care with 1 Tb image or 10 Tb image about
several Mbs.

Pls note, that 1 Mb is better for the default block size as with this size
sequential write is equal to the random write for non-SSD disks.

Den

  reply	other threads:[~2017-04-18 21:04 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev [this message]
2017-04-14  7:40             ` Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e278fc7-675a-e42f-94d9-bff3f347a2c4@openvz.org \
    --to=den@openvz.org \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).