From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51265)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eblake@redhat.com>) id 1XqyCB-0005JI-7r
	for qemu-devel@nongnu.org; Wed, 19 Nov 2014 00:52:56 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eblake@redhat.com>) id 1XqyC6-0000wc-3i
	for qemu-devel@nongnu.org; Wed, 19 Nov 2014 00:52:51 -0500
Received: from mx1.redhat.com ([209.132.183.28]:51216)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eblake@redhat.com>) id 1XqyC5-0000wY-Pi
	for qemu-devel@nongnu.org; Wed, 19 Nov 2014 00:52:46 -0500
Message-ID: <546C3022.9070701@redhat.com>
Date: Tue, 18 Nov 2014 22:52:34 -0700
From: Eric Blake <eblake@redhat.com>
MIME-Version: 1.0
References: <1415970374-16811-1-git-send-email-mreitz@redhat.com>	<1415970374-16811-22-git-send-email-mreitz@redhat.com>	<5467682E.1030604@redhat.com>
	<5469E4C0.8040509@redhat.com> <546BAB8E.3090200@redhat.com>
In-Reply-To: <546BAB8E.3090200@redhat.com>
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature";
	boundary="ivE7BT3ckQkLNe03DSOnHPNScuhvDXAvS"
Subject: Re: [Qemu-devel] [PATCH v2 21/21] iotests: Add test for different
 refcount widths
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Max Reitz <mreitz@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Stefan Hajnoczi <stefanha@redhat.com>

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--ivE7BT3ckQkLNe03DSOnHPNScuhvDXAvS
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 11/18/2014 01:26 PM, Eric Blake wrote:

> Now, in response to your question about some other 3-pass inducing
> pattern, let's think back to v1, where you questioned what would happen=

> if a hole in the reftable gets turned into data due to a later
> allocation.  Let's see if I can come up with a scenario for that...
>=20
> Let's stick with a cluster size of 512, and use 32-bit and 64-bit width=
s
> as our two sizes.  If we downsize from 64 to 32 bits, then every two
> refblock clusters in the old table results in one call to operation()
> for the new table; conversely, if we upsize, then every refblock cluste=
r
> in the old table gives two calls to operation() in the new table.  The
> trick at hand is to come up with some image where we punch a hole so
> that on the first pass, we call operation() with refblock_empty true fo=
r
> one iteration (necessarily a call later than the first, since the image=

> header guarantees the first refblock is not empty), but where we have
> data after the hole, where it is the later data that triggers the
> allocation that will finally start to fill the hole.
>=20
> How about starting with an image that occupies between 1.5 and 2
> refblocks worth of 32-width clusters (so an image anywhere between 193
> and 256 clusters, or between 98816 and 131072 bytes).  You should be
> able to figure out how many clusters this consumes for L1, L2, plus 1
> for header, reftable, and 2 for refblocks, in order to figure out how
> many remaining clusters are dedicated to data; ideally, the data
> clusters are contiguous, and occupy a swath that covers at least
> clusters 126 through 192.  Widening to 64-bit width will require 4
> refblocks instead of 2, if all refblocks are needed.  But the whole ide=
a
> of punching a hole is that we don't need a refblock if it will be
> all-zero entries.  So take this original image, and discard the data
> clusters from physical index 126 through 192, (this is NOT the data
> visible at guest offset 31744, but whatever actual offset of guest data=

> that maps to physical offset 31744).  The old reftable now looks like {=

> refblock_o1 [0-125 occupied, 126 and 127 empty], refblock_o2 [128-191
> empty, 192-whatever occupied, tail empty] }.  With no allocations
> required, this would in turn would map to the following new refblocks: =
{
> refblock_n1 [0-64 occupied], refblock_n2 [65-125 occupied, 126-127
> empty], NULL, refblock_n4 [192-whatever occupied] }.  Note that we do
> not need to allocate refblock_n3 because of the hole in the old
> refblock; we DO end up allocating three refblocks, but in the sequence
> of things, refblock_n1 and refblock_n2 are allocated while we are
> visiting refblock_o1 and still fit in refblock_o1, while refblock_n4 is=

> not allocated until after we have already passed over the first half of=

> refblock_o2.
>=20
> Thus, the second walk over the image will see that we need to allocate
> refblock_n3 because it now contains entries (in particular, the entry
> for refblock_n4, but also the 1-cluster entry for the proposed reftable=

> that is allocated between the walks).  So, while your test used the
> allocation of the reftable as the spillover point, my scenario here use=
s
> the allocation of later refblocks as the spillover point that got misse=
d
> during the first iteration.
>=20

Oops,...

>=20
> which means the reftable now looks like { refblock1, NULL, refblock3,
> NULL... }; and where refblock1 now has at least two free entries
> (possibly three, if the just-freed refblock2 happened to live before
> cluster 62).  is we can also free refblock2
>=20

=2E..forgot to delete these random thoughts that I typed up but no longer=

needed after reworking the above text.

At any rate, I'm not certain we can come up with a four-pass scenario;
if it is even possible, it would be quite complex.  Back in v1, I
questioned what would happen with a completely full reftable, where the
mere allocation of the new reftable causes a spillover not only in the
size of the new reftable, but also to the old.  Let's try to find a
scenario where the reftable does NOT spill on the first pass, but DOES
spill on the second.  A 32-bit width image spills from 1 to 2 clusters
for the reftable at the boundary of 64->65 refblocks (8192->8193
clusters); at that same cluster boundary, the reftable for a 64-bit
width table spills from 2 to 3 clusters.  Furthermore, a 64-bit width
refblock misses an allocation on the first pass if there is a hole of 64
aligned clusters.

First try:

We want an image that has roughly 128 free clusters on the first pass,
including a hole of at least 64 aligned clusters, to where a second pass
is needed to cover the refblock that was treated as a hole on the first
pass.  Furthermore, we want the second pass to completely fill the
image, so that the reftable allocation of 2 clusters after the walk is
the trigger of the spill, then the third pass will allocate another
refblock because of the spill, and a fourth pass is required to ensure
no allocations before the final pass of assigning refcounts.

Start with a 32-bit width image with clusters 0-125 allocated, 126-191
empty, then 192-8127 allocated (then 8128-8191 are unallocated to pad
out to cluster boundary) - describing a file of size 4161536.  This
image has 130 clusters spare before it spills, with a hole of 66
clusters near the front, and a tail of 64 clusters; since there is no
block of 128 aligned unallocated clusters, all 64 refblocks are in use.
 Now widen this image to 64-bit width refcounts.

On the first walk, we allocate two refblocks for clusters 0-127 (using
free slots 126 and 127), then leave an unallocated refblock for clusters
128-191, then allocate 124 more refblocks for clusters 192-8127 (using
free slots 128-191, as well as 8128-8187); this walk also picks up a
refblock for clusters 8128-8191 (using free slot 8188) because that area
of the file is allocated before the walk reaches that point, even if it
was a large enough hole to not need a refblock at the beginning of the
walk.  So we end the walk with 3 free clusters, and proceed to allocate
a 2-cluster reftable for the 127 refblocks that we created (using free
slots 8189-8190).

On the second walk, we allocate a refblock for clusters 128-191 (slot
8191). At the end of the walk, we free the 2-cluster reftable, then
reallocate; but as we still only need 2 clusters, we reuse slots
8189-8190.  Bummer - a third pass doesn't need allocation.

Second try:

Tweak the input by one cluster: 0-125 allocated, 126-191 empty, 192-8128
allocated (8129-8191 unallocated); file size 4162048.  129 spare
clusters, with hole of 66 and tail of 63.  On the first walk, we
allocate 127 refblocks (ending with the use of free slots 8129-8189),
then allocate a 2-cluster reftable (free slots 8190-8191).  On the
second walk, we allocate a refblock for clusters 128-191 (slot 8192 -
which triggers a spill of the old reftable), and a refblock for clusters
8192-8255 (since we already spilled).  After the walk, we now allocate a
3-cluster reftable, but it fits nicely within 8192-8255.  Bummer - a
third pass still doesn't need an allocation.

Third try:

Tweak the input by an entire reftable cluster.  Let's start with a
32-bit image with a 2-cluster reftable, and try for the spill of a
64-bit image from 3->4 clusters (happens at the boundary from cluster
12287->12288).  We want around 192 free clusters, plus the space for the
reftable, and still want a hole of at least 64 aligned clusters to
trigger the second pass as the one that fills to the spilling point.

So the image has clusters 0-125 allocated, 126-191 empty, then 192-12157
allocated (then 12158-16383 are unallocated to pad out to cluster
boundary) - describing a file of size 6224896.  This image has 196
clusters spare before it spills the 64-bit reftable, with a hole of 66
clusters near the front, and a tail of 130 clusters.

On the first walk, we allocate two refblocks for clusters 0-127 (using
free slots 126 and 127), then leave an unallocated refblock for clusters
128-191, then allocate 187 more refblocks for clusters 192-12157 (using
free slots 128-191, as well as 12158-12280); this walk also triggers a
refblock allocation for the 32-bit table (slot 12281 for clusters
12160-12287) and 2 refblocks for the 64-bit table (slots 12282-12283 for
clusters 12158-12287) because that area of the file is allocated before
the walk reaches that point.  So we end the walk with 4 free clusters,
and proceed to allocate a 3-cluster reftable for the 191 refblocks that
we created (using free slots 12284-12286).

On the second walk, we allocate a refblock for clusters 128-191 (slot
12287). At the end of the walk, we free the 3-cluster reftable, then
reallocate; but as we still only need 3 clusters, we reuse slots
12284-12286.  Bummer - a third pass doesn't need allocation.

I'm sensing a pattern here - trying to go for reftable spills isn't
going to help - by the second pass, we already have a reftable
reservation (which we free and then reallocate), and the only time we'd
need a larger reftable after the second walk is if we encountered extra
refblocks during the walk, but the only way to get extra refblocks
during the walk is to spill the old table during the second walk, but in
that case the second walk picks up the spill.

Fourth try:

Even trying to play with larger files with larger holes won't easily
help.  Suppose I have a file that has 0-62 allocated, then 63-4159
unallocated, followed by 262144 clusters allocated (an image over 128M
in size).  On the first walk, we allocate a refblock (slot 63) for
clusters 0-63, then pass over 64 refblocks that don't need allocation,
then allocate 4096 refblocks (slots 64-4159).  Not even considering the
slots that are added in the old table, this means the second pass will
allocate an additional 64 refblocks.  But this allocation will NOT spill
the size of the reftable, because the reftable is already sized large
enough to cover the refblocks for the 128M tail of the file.

What if we have a larger hole, and also tweak the input file so that
there are no missing refblocks of the old table (the holes are only
possible in the new table).  That is, start with 0-62 allocated, 63-127
free, then a repeating pattern of 64 allocated clusters and 64 free,
over the course of 130 repetitions.  Follow that with 262144 allocated
clusters.  This means we have 65 + 130*64 =3D=3D 8385 free clusters, to
dedicate solely to the new reftable; while the image occupies
(131*128)+262144 =3D=3D 278912 clusters (requires 4358 refblock entries,
which in turn requires 69 contiguous clusters for the reftable).

On the first walk, we alternate between allocating a refblock and
leaving a hole for a repetition of 131 times, then allocate 4096
refblocks.  Thus, we have consumed 4227 allocations out of the 8385 free
clusters; where the old table is now using the space of 67 of the
allocation holes.  Furthermore, the reftable allocation is too large to
fit in any of the holes, so it gets appended at the end of the image
(the image is now 278981 clusters, requiring 4360 refblock entries, but
still only 69 clusters for the reftable).  On the second walk, we
allocate 67 more refblocks for the refcounts in the old table that we
first treated as holes, plus 2 more refblocks for the tail of the file
holding the new reftable; these allocations still fit in the holes, and
are sufficient to fill up another hole.  However, the question remains -
does the hole in the old table get filled before or after the new table
has already passed by the hole?  If it is after, a third iteration would
allocate yet another refblock or two; but if it is before, then the 2nd
walk will already allocate refblocks when it encounters that spot in the
old reftable.  Remember, on the first iteration, we allocated a refblock
for 0-63 (slot 63), for 128-191 (slot 64), and so on - the reason the
second pass allocates a refblock for 64-127 is because the first pass
didn't populate that area of the file until it was already beyond
cluster 127.  But on the second pass, the first refblock we allocate for
clusters 64-127 will live somewhere around slot 8576, so it will be seen
during the second walk.  Even the refblock for clusters 8512-8575 will
end up being allocated somewhere around slot 8640, which still gets
visited during the second walk.  You may still be able to tweak things
to the point that we trigger an allocating third walk, but I'm not
readily seeing what that tweak would be.

At this point, I've spent far too long writing this email.  I haven't
completely ruled out the possibility of a corner case needing four
passes through the do loop, but the image sizes required to get there
are starting to be quite large compared to your simpler test of needing
three passes through the do loop.  I won't be bothered if we call it
good, and quit trying to come up with any other "interesting" allocation
sequencing.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--ivE7BT3ckQkLNe03DSOnHPNScuhvDXAvS
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg

iQEcBAEBCAAGBQJUbDAiAAoJEKeha0olJ0Nql98IAIGvRXHBuI8dSQDOxBZC0r56
aCQVji7utGcvpKJECmhGWN2UnA7UfAmyt7jZzAbiqRnNErIUJd9mfgs9Gl5MeQMI
u4pUWu6/8Zg5ZNe2qZKrdS7+mOVNhm/Df2cyByuJHBzEBrFIGoyNHR5ojkbRqIBJ
SWGNQBK/XPBoH9je4ap+jCeVg/Llw6xrUQN8gVPiyhzsQRXgVvruYYv+4OzAK5Fu
3G1CwKhHsWFYwU5tJ1cK+nA3I2KJ1oaqamWmtnShCqkQbiTW2NNIsckcG8qckMnf
/wbF5gycyYauxMgtNxvbJ0n54ecFld/zJZLJolxzaziFQ9E0e7IGOLdnvKw1kYk=
=kFZv
-----END PGP SIGNATURE-----

--ivE7BT3ckQkLNe03DSOnHPNScuhvDXAvS--