From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39074)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XrS5t-0004Xz-83
	for qemu-devel@nongnu.org; Thu, 20 Nov 2014 08:48:27 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XrS5n-0006K8-0i
	for qemu-devel@nongnu.org; Thu, 20 Nov 2014 08:48:21 -0500
Received: from mx1.redhat.com ([209.132.183.28]:35602)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XrS5m-0006K0-Oh
	for qemu-devel@nongnu.org; Thu, 20 Nov 2014 08:48:14 -0500
Message-ID: <546DF113.70403@redhat.com>
Date: Thu, 20 Nov 2014 14:48:03 +0100
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1415970374-16811-1-git-send-email-mreitz@redhat.com>
	<1415970374-16811-22-git-send-email-mreitz@redhat.com>
	<5467682E.1030604@redhat.com> <5469E4C0.8040509@redhat.com>
	<546BAB8E.3090200@redhat.com>
In-Reply-To: <546BAB8E.3090200@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v2 21/21] iotests: Add test for different
	refcount widths
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Stefan Hajnoczi <stefanha@redhat.com>

On 2014-11-18 at 21:26, Eric Blake wrote:
> On 11/17/2014 05:06 AM, Max Reitz wrote:
>
>>> Umm, that sounds backwards from what you document.  It's a good test =
of
>>> the _new_ reftable needing a second round of allocations.  So keep it
>>> with corrected comments.  But I think you _intended_ to write a test
>>> that starts with a refcount_width=3D64 image and resize to a
>>> refcount_width=3D1, where the _old_ reftable then suffers a reallocat=
ion
>>> as part of allocating refblocks for the new table.  It may even help =
if
>>> you add a tracepoint for every iteration through the walk function
>>> callback, to prove we are indeed executing it 3 times instead of the
>>> usual 2, for these test cases.
>> I'm currently thinking about a way to test the old reftable reallocati=
on
>> issue, and I can't find any. So, for the old reftable to require a
>> reallocation it must grow. For it to grow we need some allocation beyo=
nd
>> what it can currently represent. For this to happen during the refbloc=
k
>> allocation walk, this allocation must be the allocation of a new refbl=
ock.
>>
>> If the refblock is allocated beyond the current reftable's limit, this
>> means that either all clusters between free_cluster_index and that poi=
nt
>> are already taken. If the reftable is then reallocated, it will
>> therefore *always* be allocated behind that refblock, which is beyond
>> its old limit. Therefore, that walk through the old reftable will neve=
r
>> miss that new allocation.
>>
>> So the issue can only occur if the old reftable is resized after the
>> walk through it, that is, when allocating the new reftable. That is
>> indeed an issue but I think it manifests itself basically like the iss=
ue
>> I'm testing here: There is now an area in the old refcount structures
>> which was free before but has is used now, and the allocation causing
>> that was the allocation of the new reftable. The only difference is
>> whether the it's the old or the new reftable that resides in the
>> previously free area. Thus, I think I'll leave it at this test =E2=80=93=
 but if
>> you can describe to me how to create an image for a different "rewalk"
>> path, I'm all ears.
> =3D=3D=3D=3D=3D
> The test you wrote does:
>
> original image, pre-walk:
> reftable is one cluster; with one refblock and 63 zero entries
>   that refblock holds 4096 width-1 refcounts; of those, the first 63 ar=
e
> non-zero, the remaining are zero. Image is 32256 bytes long
>
> During the first walk, we call operation() 64 times - the first time
> with refblock_empty false, the remaining 63 times with refblock_empty t=
rue.
>
> after first walk but before reftable allocation, we have allocated one
> refblock that holds 64 width-64 refcounts (all zero, because we don't
> populate them until the final walk); and the old table now has 64
> refcounts populated. Image is 32768 bytes long.
>
> Then we allocating a new reftable; so far, we only created one refblock
> for it to hold, so one cluster is sufficient. The allocation causes the
> old table to now have 65 refcounts populated. Image is now 33280 bytes =
long.
>
> On the second pass, we call operation() 64 times; now the first two
> walks have refblock_empty as false, which means we allocate a new
> refblock.  This allocation causes the old table to now have 66 refcount=
s
> populated. Image is now 33792 bytes long.
>
> So we free our first attempt at a new reftable, and allocate another (a
> single cluster is still sufficient to hold two refblocks); I'm not sure
> whether this free/realloc will reuse cluster 65 or if it will pick up
> cluster 67 and leave a hole in 65.  [I guess it depends on whether
> cluster allocation is done by first-fit analysis or whether it blindly
> favors allocating at the end of the image].

There is a free_cluster_index to speed up finding the first fit. It's=20
reset when freeing clusters before that index, therefore cluster 65=20
should be reused.

> Either way, we have to do a
> third iteration, because the second iteration allocated a refblock and
> "reallocated" a reftable.
>
> On the third pass, operation() is still called 64 times, but because th=
e
> only two calls with refblock_empty as false already have an allocated
> refblock, no further allocations are needed, and we are done with the d=
o
> loop; the fourth walk can set refcounts.
>
> =3D=3D=3D=3D=3D
> The test I thought you were writing would start
>
> original image, pre-walk:
> reftable is one cluster; with one refblock and 63 zero entries
>   that refblock holds 64 width-64 refcounts; of those, the first 63 are
> non-zero, the remaining are zero. Image is 32256 bytes long
>
> During the first walk, we call operation() 1 time, with refblock_empty
> false.
>
> after first walk but before reftable allocation, we have allocated one
> refblock that holds 4096 width-1 refcounts (all zero, because we don't
> populate them until the final walk); and the old table now has 64
> refcounts populated. Image is 32768 bytes long.
>
> Then we allocating a new reftable; so far, we only created one refblock
> for it to hold, so one cluster is sufficient. The allocation causes the
> old table to now have 66 refcounts populated (one for the new refblock,
> but also one for an additional refblock in the old table because the
> first refblock was full). Image is now 33792 bytes long.
>
> On the second pass, we call operation() 1 time with refblock_empty as
> false, so we don't need any allocation.
>
> Which means the test you wrote is correct, while my idea does NOT
> trigger the third walk, at least not for the initial file size of 32256.
>   You've been vindicated, you did it correctly :)
>
>
> =3D=3D=3D=3D=3D
> Now, in response to your question about some other 3-pass inducing
> pattern, let's think back to v1, where you questioned what would happen
> if a hole in the reftable gets turned into data due to a later
> allocation.  Let's see if I can come up with a scenario for that...
>
> Let's stick with a cluster size of 512, and use 32-bit and 64-bit width=
s
> as our two sizes.  If we downsize from 64 to 32 bits, then every two
> refblock clusters in the old table results in one call to operation()
> for the new table; conversely, if we upsize, then every refblock cluste=
r
> in the old table gives two calls to operation() in the new table.  The
> trick at hand is to come up with some image where we punch a hole so
> that on the first pass, we call operation() with refblock_empty true fo=
r
> one iteration (necessarily a call later than the first, since the image
> header guarantees the first refblock is not empty), but where we have
> data after the hole, where it is the later data that triggers the
> allocation that will finally start to fill the hole.
>
> How about starting with an image that occupies between 1.5 and 2
> refblocks worth of 32-width clusters (so an image anywhere between 193
> and 256 clusters, or between 98816 and 131072 bytes).  You should be
> able to figure out how many clusters this consumes for L1, L2, plus 1
> for header, reftable, and 2 for refblocks, in order to figure out how
> many remaining clusters are dedicated to data; ideally, the data
> clusters are contiguous, and occupy a swath that covers at least
> clusters 126 through 192.  Widening to 64-bit width will require 4
> refblocks instead of 2, if all refblocks are needed.  But the whole ide=
a
> of punching a hole is that we don't need a refblock if it will be
> all-zero entries.  So take this original image, and discard the data
> clusters from physical index 126 through 192, (this is NOT the data
> visible at guest offset 31744, but whatever actual offset of guest data
> that maps to physical offset 31744).  The old reftable now looks like {
> refblock_o1 [0-125 occupied, 126 and 127 empty], refblock_o2 [128-191
> empty, 192-whatever occupied, tail empty] }.  With no allocations
> required, this would in turn would map to the following new refblocks: =
{
> refblock_n1 [0-64 occupied], refblock_n2 [65-125 occupied, 126-127
> empty], NULL, refblock_n4 [192-whatever occupied] }.  Note that we do
> not need to allocate refblock_n3 because of the hole in the old
> refblock; we DO end up allocating three refblocks, but in the sequence
> of things, refblock_n1 and refblock_n2 are allocated while we are
> visiting refblock_o1 and still fit in refblock_o1, while refblock_n4 is
> not allocated until after we have already passed over the first half of
> refblock_o2.
>
> Thus, the second walk over the image will see that we need to allocate
> refblock_n3 because it now contains entries (in particular, the entry
> for refblock_n4, but also the 1-cluster entry for the proposed reftable
> that is allocated between the walks).  So, while your test used the
> allocation of the reftable as the spillover point, my scenario here use=
s
> the allocation of later refblocks as the spillover point that got misse=
d
> during the first iteration.

Sounds good, the only problem is that I'd have to hand-craft the image=20
myself, because qemu generally uses self-references for refblocks (when=20
allocating new refblocks, they will contain their own refcount).

I think this already would be too much effort (I'll reply to your second=20
email right away ;-)). There is no fundamental difference in how new=20
allocations for the new reftable and for the new refblocks are treated:=20
If there's a new allocation, respin. If that works for the new reftable,=20
that's enough to convince me it will work for new refblocks as well.

Max