qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>, Alberto Garcia <berto@igalia.com>,
	qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
Date: Sat, 22 Apr 2017 19:56:57 +0200	[thread overview]
Message-ID: <1877dc76-87e0-650c-d76a-44e6e7754a2b@redhat.com> (raw)
In-Reply-To: <382b0707-233d-ab05-e054-3da43617460d@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5565 bytes --]

On 21.04.2017 23:09, Eric Blake wrote:
> On 04/06/2017 11:40 AM, Eric Blake wrote:
> 
>>> === Changes to the on-disk format ===
>>>
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>>
>> You also have to add a new incompatible feature bit, so that older tools
>> know they can't read the new image correctly, and therefore don't
>> accidentally corrupt it.
> 
> As long as we are talking about incompatible feature bits, I had some
> thoughts about image mapping today.
> 
> tl;dr: summary> As long as we are considering incompatible features, maybe we should
> make it easier to have an image file that explicitly preserves
> guest=>host mapping, such that nothing the guest can do will reorder the
> mapping.  This way, it would be possible to fully preallocate an image
> such that all guest offsets are adjusted by a constant offset to become
> the corresponding host offset (basically, no qcow2 metadata is
> interleaved in the middle of guest data).
> 
> I don't see any way to do it on current qcow2 images, but with
> subclusters, you get it for free by having a cluster with an offset but
> with all subclusters marked as unallocated.  But perhaps it is something
> orthogonal enough that we would want a separate incompatible feature bit
> for just this, without having subclusters at the same time.
> 
> In the process of exploring the topic, I expose a couple of existing
> bugs in our qcow2 handling.
> 
> ========
> 
> Longer version:
> 
> If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our
> current implementation of bdrv_discard() says that except for clusters
> already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2
> mapping of the address.

As I've said, I think the ZERO bit is just a bug and we should free
preallocated zero clusters.

>                          Whether we actually punch a hole in the
> underlying image, or merely add it to a list of free clusters for use in
> subsequent allocations, is later decided based on
> s->discard_passthrough[type] (that is, the caller can pass different
> type levels that control whether to never punch, always punch, or let
> the blockdev parameters of the caller control whether to punch).
> 
> Presumably, if I've preallocated my image because I want to guarantee
> enough host space, then I would use blockdev parameters that ensure that
> guest actions never punch a hole.  But based on my reading, I still have
> the situation that if I initially preallocated things so that guest
> cluster 0 and 1 are consecutive clusters in the host, and the guest
> triggers bdrv_pdiscard() over both clusters, then writes to cluster 1
> before cluster 0, then even though I'm not changing the underlying
> allocation of the host file, I _am_ resulting in fragmentation in the
> qcow2 file, where cluster 1 in the guest now falls prior to cluster 0.

[...]

> But if we can preserve mappings of clusters that are explicitly marked
> zero, I started to wonder if it would also be reasonable to have a mode
> where we could ALWAYS preserve mappings.  Adding another bit that says
> that a cluster has a reserved mapping, but still defers to the backing
> file for its current data, would let us extend the existing behavior of
> map-preservation on write zeros to work with ALL writes, when writing to
> a fully pre-allocated image.

Yes, and it also means that we may want to think about implementation a
preallocation mode in qemu which puts all of the data into a single
consecutive chunk (as you have hinted at somewhere above).

> When I chatted with Max on IRC about the idea, we said this:
> 
> 
> <mreitz> I mean, sure, we can add both, but I'd still want them two be
> two incompatible bits
> <eblake> if you want the features to be orthogonal (with exponentially
> more cases to check), then having multiple incompatible bits is okay
> <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird
> and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious
> <mreitz> Well, orthogonal is a good question. If we want to have an
> UNALLOCATED flag we should think so before adding subclusters
> <mreitz> Because then we should at least make clear that the ZERO flag
> for a subcluster requires the ALLOCATED flag to be set or something
> <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want
> to fall through to the backing file
> 
> So, if you got this far in reading, the question becomes whether having
> a mode where you can mark a cluster as mapping-reserved-but-unallocated
> has enough use case to be worth pursuing, knowing that it will burn an
> incompatible feature bit; or if it should be rolled into the subcluster
> proposal, or whether it's not a feature that anyone needs after all.

I just forgot that just saying !ALLOCATED will be enough, regardless of
the ZERO flag... Yeah, subclusters will give us this for free, and I
think it's therefore reasonable to just require them if you want this
feature (preallocation with a backing file).

> And meanwhile, it looks like I have some patches to propose (and
> qemu-iotests to write) if I can help fix the bugs I've pointed out.

You mean these?
https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1

;-)

I'll send them once I've written tests.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

  reply	other threads:[~2017-04-22 17:57 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz [this message]
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev
2017-04-14  7:40             ` Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1877dc76-87e0-650c-d76a-44e6e7754a2b@redhat.com \
    --to=mreitz@redhat.com \
    --cc=berto@igalia.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).