From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:54502)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Y0SBz-0006Ct-1F
	for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:57 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Y0SBs-0006gi-Ql
	for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:50 -0500
Received: from mx1.redhat.com ([209.132.183.28]:58311)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Y0SBs-0006gT-Go
	for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:44 -0500
Message-ID: <548EAD43.90407@redhat.com>
Date: Mon, 15 Dec 2014 10:43:31 +0100
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1416844620-17717-1-git-send-email-mreitz@redhat.com>
In-Reply-To: <1416844620-17717-1-git-send-email-mreitz@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check
	functions
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Stefan Hajnoczi <stefanha@redhat.com>

On 2014-11-24 at 16:56, Max Reitz wrote:
> As has been requested, this series adds new overlap check functions to
> the qcow2 code. My local branch is called "qcow2-improved-overlap-v1",
> but I am not so sure whether it is actually an improvement; that is lef=
t
> for you to decide, dear reviewers.
>
> See patch 1 for an explanation of why this series exists and what it
> does. Patch 1 is basically the core of this series, the rest just
> employs the functions introduced there.
>
> In a later patch, we may want to change the meaning of the "constant"
> overlap checking option to mean the same as "cached", which is
> everything except for inactive L2 tables. This series does make
> checking for overlaps with inactive L2 tables at runtime just as cheap
> as everything else (constant time plus caching), but using these checks
> means qemu has to read all the snapshot L1 tables when opening a qcow2
> file. This does not take long, of course, but it does result in a bit o=
f
> overhead so I did not want to enable it by default.
>
> I think just enabling all overlap checks by default after this series
> should be fine and useful, though.
>
>
> Benchmarks
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> First, let me repeat the results I wrote as a reply to the cover letter
> of v1:
>
> I had a 1 GB qcow2 image in tmpfs (sorry, I "only" have 4 GB of RAM in
> this laptop, and a 2 GB image will kill it) which I accessed from a
> guest from inside qemu (Arch Linux, to be exact, because its image just
> dumps you into a console and that's all I need). I ran
>
> $ for i in $(seq 1 42); do \
>      dd if=3D/dev/zero of=3D/dev/vda bs=3D65536 \
>      done
>
> because it does not matter what data is written to the image, I just
> need to write to a lot of clusters fast to maximize pressure on the
> overlap check.
>
> The results were 13.5/10.5/12.41 % of CPU usage (recorded using perf) i=
n
> qcow2_check_metadata_overlap() with the current implementation and
> 0.08/0.09 % with this series applied (0.08 % with overlap-check=3Dcache=
d,
> 0.09 % with overlap-check=3Dall).
>
> Now as a table, just to have one here (and because it is useful when
> skipping through this text):
>
> Current implementation:     12.14 (=C2=B1 1.519) %  (n =3D 3)
> With this series applied:    0.08           %  (n =3D 1)
> overlap-check=3Dall:           0.09           %  (n =3D 1)
>
>
> Now I did some other benchmarks.
>
> The first of those is just running (on the host, obviously):
>
> $ perf record -ag \
>    qemu-img create -f qcow2 -o preallocation=3Dmetadata,cluster_size=3D=
512 \
>    /tmp/test.qcow2 2G
>
> I ran ten rounds with the current implementation, ten rounds with these
> patches applied and five rounds with cache_size set to 0 in
> qcow2_create_empty_metadata_list() (which results in only one bitmap
> existing at one time, and therefore a lot of conversions between the
> bitmap and the list representation).
>
> The current implementation had a CPU usage (fraction of cycles) in
> qcow2_check_metadata_overlap() of 47.65 (=C2=B14.24) %. There was one r=
ather
> extreme outlier (36.31 %), with that removed it is 48.91 (=C2=B11.54) %.
>
> With this series applied, the usage dropped to 0.149 (=C2=B1 0.028) %.
> Additionally, qcow2_metadata_list_enter() took 0.046 (=C2=B1 0.021) %. =
Both
> together took thus 0.195 (=C2=B1 0.040) %.
>
> With cache_size set to 0, the usage was 0.130 % (=C2=B1 0.012 %) in
> qcow2_check_metadata_overlap(); 0.056 % (=C2=B1 0.011 %) in
> qcow2_metadata_list_enter(); and 0.186 % (=C2=B1 0.021 %). It dropped, =
but
> still in range of standard deviation. Thus, heavy conversion between
> bitmap and list conversion (in normal use cases due to random accesses)
> should not be a problem at all.
>
> As a table:
>
> Current implementation:     48.91 (=C2=B1 1.537) %  (n =3D  9)
> With this series applied:   0.195 (=C2=B1 0.040) %  (n =3D 10)
> Only one bitmap cached:     0.186 (=C2=B1 0.021) %  (n =3D  5)
>
> And as a really noticeable consequence: Before this series, creating
> the image like that took 16.624 s (15.97 s user). With this series, it
> takes 4.502 s (3.94 s user).
>
> Because statistics are fun, I collected the number of metadata list
> operations for the second and the third tests:
>
> Second test (default of 15 bitmaps cached):
>
> List to bitmap conversions: 1053
> Bitmap to list conversions: 1038
> Bitmaps deallocated:        1038
>
> Insert operations:         82266
> Remove operations:            14
> Queries:                  312265
>
> Third test (only one bitmap cached):
>
> List to bitmap conversions: 135943
> Bitmap to list conversions:  66546
> Bitmaps deallocated:        135942
>
> Insert operations:           82266
> Remove operations:              14
> Queries:                    312265
>
> As expected, the number of conversions is much higher. Interestingly, i=
n
> the first case, every bitmap deallocation also meant a bitmap to list
> conversion; that means, every bitmap was modified at least once while i=
t
> was cached. In contrast to that, in the second case, only every second
> deallocation resulted in a conversion, which means that half of the
> cached bitmaps were only read (which is bad considering all was done wa=
s
> allocating metadata structures).
>
> But for performance, there is no real information here (we only see tha=
t
> setting cache_size to 0 actually did increase the number of
> conversions).
>
>
> The second benchmark I ran today was a program which opened /dev/vda in
> qemu (again on Arch) with O_DIRECT | O_SYNC | O_RDWR. It then wrote an
> uninitialized 512 byte buffer to random places (512-byte-aligned) to a
> 1 GB image freshly created before VM launch with metadata preallocation
> and 512 byte clusters. The intention was to break bitmap caching and
> having to convert back and forth between list and bitmap representation.
> The results are as follows:
>
> Current implementation:     18.73 (=C2=B1 0.157) %  (n =3D 10)
> With this series applied:   0.068 (=C2=B1 0.009) %  (n =3D 10)
> Only one bitmap cached:     0.062 (=C2=B1 0.008) %  (n =3D  5)
>
>
> Runtime conclusion
> ------------------
>
> As can be seen from the benchmarks, runtime CPU usage by the metadata
> overlap checks is greatly decreased by this series (to 1/150 in the
> first benchmark, to 1/250 in the second, and to 1/275 in the third).
>
>
> Where is the caveat?
> --------------------
>
> The problem with this series is obviously that all the metadata
> information is read when reading a qcow2 image. iotest 044 is a good
> test for this. I personally haven't seen a problem here, but I can't
> outrule any. I never noticed any waits when opening images.
>
> When approaching this from a theoretical approach, it becomes clear tha=
t
> there shouldn't be any problems here. If using the default of
> overlap-check=3Dcached, only information available anyway is used to bu=
ilt
> up the metadata list (the same information that is iterated through for
> every overlap check in the current implementation). Since building the
> list simply means setting a bitmask in a bitmap and then transforming
> that bitmap to a list (which has been shown not pose any performance
> issues by the second benchmark), there should not be any problem here.
>
> So, the only caveat I do see is increased code complexity. I initially
> thought it might be too complex for its purpose; but after having done
> the benchmarks, it became apparent to me that there is a problem with
> the current implementation and that this series does fix it.
>
> And RAM usage may pose a problem.
>
>
> RAM usage
> =3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> So I have looked at my 2 GB image above, and the list uses 40 kB, which
> may or may not be too much (sounds completely fine to me for an image
> with 512 byte clusters); but it is a least a number I can use for
> testing the following theoretical inspection:
>
>
> So, once again, we assume the worst. Every metadata structure needs its
> own entry in the list - actually, the worst would be that every cluster
> needs its own entry, but that only makes a difference for L1 tables and
> the refcount table, so we can ignore that. In fact, we can completely
> ignore those tables; it makes things much easier.
>
> Let's name the cluster size "CS", and name the image size in bytes "IS".
>
> So, for a refcount width of 16 bits per entry, we can describe CS / 2
> clusters per refblock; which means we need (IS / CS) / (CS / 2) =3D
> 2 * IS / (CS * CS) refblocks. Let's be generous and say that the L2
> tables are capable of describing the complete image size (which in
> practice they do not because the guest disk size is smaller than the
> physical size). This assumption also has the neat side-effect of not
> having to care about snapshots. If we have a lot of internal snapshots
> with copied L2 tables, IS grows and therefore the next formula knows
> that the number of L2 tables grows as well. Nice. So, as every L2 table
> can describe CS / 8 (guest) clusters, we need 8 * IS / (CS * CS) L2
> tables.
>
> Therefore, ignoring the image header, L1 tables, the reftable and the
> snapshot table, we have about the following amount of metadata clusters
> per image (with 16 bit refcount entries):
>
> (2 + 8) * IS / (CS * CS) =3D 10 * IS / (CS * CS)
>
> We said that for every metadata cluster, we would need one entry in the
> list. Every list entry takes 32 bits (4 bytes). Thus, we need
> approximately up to
>
> 40 * IS / (CS * CS)
>
> bytes for the metadata list (please ignore units and append "bytes" at
> the resulting number...).
>
> Let's test that for the above image, which has a disk size of 266 MB:
>
> 40 * 266M / (512 * 512) =3D 42 kB
>
> Great! It works.
>
>
> So, now let's set CS to 64 kB, because that is basically the only
> cluster size we really care about. For a 1 TB image, we need 10 kB for
> the list. Sounds great to me. For a 1 PB image, we will need 10 MB. Fai=
r
> enough. (Note that you don't need 10 MB of RAM to create a 1 PB image;
> you only need that once the disk size of the image has reached 1 PB).
>
> And 1 TB with 512 byte clusters? 160 MB. Urgh, that is a lot. But then
> again, you can switch off the overlap check with overlap-check=3Doff; a=
nd
> trying to actually use a 1 TB image with 512 byte clusters is crazy in
> itself (have you tried just creating one without preallocation? It take=
s
> forever). So I can live with that.
>
>
> tl;dr
> =3D=3D=3D=3D=3D
>
> * CPU usage at runtime decreased by 150 to 275 percent on
>    overlap-check-heavy tasks
> * No additional performance problems at loading time (in theory has the
>    same runtime complexity as a single overlap check right now; in
>    practice I could not find any problems)
> * Decent RAM usage (40 kB for a 1 TB image with 64 kB clusters; 40 MB
>    for a 1 PB image etc. pp.)
>
>
>
> As of this version, this series depends on
> "[PATCH v6] qcow2: Buffer L1 table in snapshot refcount update".
>
>
> v2:
> - Patch 1:
>    - Fixed a mistake regarding the value of
>      Qcow2MetadataFragment::relative_start; this value should be relati=
ve
>      to the window start, not relative to the end of the last fragment
>      (destroy_window_bitmap() wrote the latter there in v1)
>    - Use uint64_t for the window index everywhere
> - Patch 8: Said "qcow2: Buffer L1 table in snapshot refcount update"
>    removes l1_allocated in qcow2_update_snapshot_refcount() and basical=
ly
>    replaces it by active_l1 (where active_l1 =3D !l1_allocated). In v1,=
 the
>    condition it was used in was actually wrong, this is fixed here, too
>    (the active L2 cluster type should be removed from the list if we ar=
e
>    working on the active L1 table, not if we are working on an inactive
>    L1 table).
>
>
> git-backport-diff against v2:
>
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream p=
atch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, resp=
ectively
>
> 001/12:[0010] [FC] 'qcow2: Add new overlap check functions'
> 002/12:[----] [--] 'qcow2: Pull up overlap check option evaluation'
> 003/12:[----] [--] 'qcow2: Create metadata list'
> 004/12:[----] [--] 'qcow2/overlaps: Protect image header'
> 005/12:[----] [--] 'qcow2/overlaps: Protect refcount table'
> 006/12:[----] [--] 'qcow2/overlaps: Protect refcount blocks'
> 007/12:[----] [--] 'qcow2/overlaps: Protect active L1 table'
> 008/12:[0002] [FC] 'qcow2/overlaps: Protect active L2 tables'
> 009/12:[----] [--] 'qcow2/overlaps: Protect snapshot table'
> 010/12:[----] [--] 'qcow2/overlaps: Protect inactive L1 tables'
> 011/12:[----] [-C] 'qcow2/overlaps: Protect inactive L2 tables'
> 012/12:[----] [--] 'qcow2: Use new metadata overlap check function'
>
>
> Max Reitz (12):
>    qcow2: Add new overlap check functions
>    qcow2: Pull up overlap check option evaluation
>    qcow2: Create metadata list
>    qcow2/overlaps: Protect image header
>    qcow2/overlaps: Protect refcount table
>    qcow2/overlaps: Protect refcount blocks
>    qcow2/overlaps: Protect active L1 table
>    qcow2/overlaps: Protect active L2 tables
>    qcow2/overlaps: Protect snapshot table
>    qcow2/overlaps: Protect inactive L1 tables
>    qcow2/overlaps: Protect inactive L2 tables
>    qcow2: Use new metadata overlap check function
>
>   block/Makefile.objs    |   3 +-
>   block/qcow2-cluster.c  |  13 ++
>   block/qcow2-overlap.c  | 400 ++++++++++++++++++++++++++++++++++++++++=
+++++++++
>   block/qcow2-refcount.c | 202 ++++++++++---------------
>   block/qcow2-snapshot.c | 105 ++++++++++++-
>   block/qcow2.c          | 130 ++++++++++------
>   block/qcow2.h          |  13 ++
>   7 files changed, 694 insertions(+), 172 deletions(-)
>   create mode 100644 block/qcow2-overlap.c

Ping