Re: [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Max Reitz <mreitz@redhat.com>
To: qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions
Date: Mon, 15 Dec 2014 10:43:31 +0100	[thread overview]
Message-ID: <548EAD43.90407@redhat.com> (raw)
In-Reply-To: <1416844620-17717-1-git-send-email-mreitz@redhat.com>

On 2014-11-24 at 16:56, Max Reitz wrote:
> As has been requested, this series adds new overlap check functions to
> the qcow2 code. My local branch is called "qcow2-improved-overlap-v1",
> but I am not so sure whether it is actually an improvement; that is left
> for you to decide, dear reviewers.
>
> See patch 1 for an explanation of why this series exists and what it
> does. Patch 1 is basically the core of this series, the rest just
> employs the functions introduced there.
>
> In a later patch, we may want to change the meaning of the "constant"
> overlap checking option to mean the same as "cached", which is
> everything except for inactive L2 tables. This series does make
> checking for overlaps with inactive L2 tables at runtime just as cheap
> as everything else (constant time plus caching), but using these checks
> means qemu has to read all the snapshot L1 tables when opening a qcow2
> file. This does not take long, of course, but it does result in a bit of
> overhead so I did not want to enable it by default.
>
> I think just enabling all overlap checks by default after this series
> should be fine and useful, though.
>
>
> Benchmarks
> ==========
>
> First, let me repeat the results I wrote as a reply to the cover letter
> of v1:
>
> I had a 1 GB qcow2 image in tmpfs (sorry, I "only" have 4 GB of RAM in
> this laptop, and a 2 GB image will kill it) which I accessed from a
> guest from inside qemu (Arch Linux, to be exact, because its image just
> dumps you into a console and that's all I need). I ran
>
> $ for i in $(seq 1 42); do \
>      dd if=/dev/zero of=/dev/vda bs=65536 \
>      done
>
> because it does not matter what data is written to the image, I just
> need to write to a lot of clusters fast to maximize pressure on the
> overlap check.
>
> The results were 13.5/10.5/12.41 % of CPU usage (recorded using perf) in
> qcow2_check_metadata_overlap() with the current implementation and
> 0.08/0.09 % with this series applied (0.08 % with overlap-check=cached,
> 0.09 % with overlap-check=all).
>
> Now as a table, just to have one here (and because it is useful when
> skipping through this text):
>
> Current implementation:     12.14 (± 1.519) %  (n = 3)
> With this series applied:    0.08           %  (n = 1)
> overlap-check=all:           0.09           %  (n = 1)
>
>
> Now I did some other benchmarks.
>
> The first of those is just running (on the host, obviously):
>
> $ perf record -ag \
>    qemu-img create -f qcow2 -o preallocation=metadata,cluster_size=512 \
>    /tmp/test.qcow2 2G
>
> I ran ten rounds with the current implementation, ten rounds with these
> patches applied and five rounds with cache_size set to 0 in
> qcow2_create_empty_metadata_list() (which results in only one bitmap
> existing at one time, and therefore a lot of conversions between the
> bitmap and the list representation).
>
> The current implementation had a CPU usage (fraction of cycles) in
> qcow2_check_metadata_overlap() of 47.65 (±4.24) %. There was one rather
> extreme outlier (36.31 %), with that removed it is 48.91 (±1.54) %.
>
> With this series applied, the usage dropped to 0.149 (± 0.028) %.
> Additionally, qcow2_metadata_list_enter() took 0.046 (± 0.021) %. Both
> together took thus 0.195 (± 0.040) %.
>
> With cache_size set to 0, the usage was 0.130 % (± 0.012 %) in
> qcow2_check_metadata_overlap(); 0.056 % (± 0.011 %) in
> qcow2_metadata_list_enter(); and 0.186 % (± 0.021 %). It dropped, but
> still in range of standard deviation. Thus, heavy conversion between
> bitmap and list conversion (in normal use cases due to random accesses)
> should not be a problem at all.
>
> As a table:
>
> Current implementation:     48.91 (± 1.537) %  (n =  9)
> With this series applied:   0.195 (± 0.040) %  (n = 10)
> Only one bitmap cached:     0.186 (± 0.021) %  (n =  5)
>
> And as a really noticeable consequence: Before this series, creating
> the image like that took 16.624 s (15.97 s user). With this series, it
> takes 4.502 s (3.94 s user).
>
> Because statistics are fun, I collected the number of metadata list
> operations for the second and the third tests:
>
> Second test (default of 15 bitmaps cached):
>
> List to bitmap conversions: 1053
> Bitmap to list conversions: 1038
> Bitmaps deallocated:        1038
>
> Insert operations:         82266
> Remove operations:            14
> Queries:                  312265
>
> Third test (only one bitmap cached):
>
> List to bitmap conversions: 135943
> Bitmap to list conversions:  66546
> Bitmaps deallocated:        135942
>
> Insert operations:           82266
> Remove operations:              14
> Queries:                    312265
>
> As expected, the number of conversions is much higher. Interestingly, in
> the first case, every bitmap deallocation also meant a bitmap to list
> conversion; that means, every bitmap was modified at least once while it
> was cached. In contrast to that, in the second case, only every second
> deallocation resulted in a conversion, which means that half of the
> cached bitmaps were only read (which is bad considering all was done was
> allocating metadata structures).
>
> But for performance, there is no real information here (we only see that
> setting cache_size to 0 actually did increase the number of
> conversions).
>
>
> The second benchmark I ran today was a program which opened /dev/vda in
> qemu (again on Arch) with O_DIRECT | O_SYNC | O_RDWR. It then wrote an
> uninitialized 512 byte buffer to random places (512-byte-aligned) to a
> 1 GB image freshly created before VM launch with metadata preallocation
> and 512 byte clusters. The intention was to break bitmap caching and
> having to convert back and forth between list and bitmap representation.
> The results are as follows:
>
> Current implementation:     18.73 (± 0.157) %  (n = 10)
> With this series applied:   0.068 (± 0.009) %  (n = 10)
> Only one bitmap cached:     0.062 (± 0.008) %  (n =  5)
>
>
> Runtime conclusion
> ------------------
>
> As can be seen from the benchmarks, runtime CPU usage by the metadata
> overlap checks is greatly decreased by this series (to 1/150 in the
> first benchmark, to 1/250 in the second, and to 1/275 in the third).
>
>
> Where is the caveat?
> --------------------
>
> The problem with this series is obviously that all the metadata
> information is read when reading a qcow2 image. iotest 044 is a good
> test for this. I personally haven't seen a problem here, but I can't
> outrule any. I never noticed any waits when opening images.
>
> When approaching this from a theoretical approach, it becomes clear that
> there shouldn't be any problems here. If using the default of
> overlap-check=cached, only information available anyway is used to built
> up the metadata list (the same information that is iterated through for
> every overlap check in the current implementation). Since building the
> list simply means setting a bitmask in a bitmap and then transforming
> that bitmap to a list (which has been shown not pose any performance
> issues by the second benchmark), there should not be any problem here.
>
> So, the only caveat I do see is increased code complexity. I initially
> thought it might be too complex for its purpose; but after having done
> the benchmarks, it became apparent to me that there is a problem with
> the current implementation and that this series does fix it.
>
> And RAM usage may pose a problem.
>
>
> RAM usage
> =========
>
> So I have looked at my 2 GB image above, and the list uses 40 kB, which
> may or may not be too much (sounds completely fine to me for an image
> with 512 byte clusters); but it is a least a number I can use for
> testing the following theoretical inspection:
>
>
> So, once again, we assume the worst. Every metadata structure needs its
> own entry in the list - actually, the worst would be that every cluster
> needs its own entry, but that only makes a difference for L1 tables and
> the refcount table, so we can ignore that. In fact, we can completely
> ignore those tables; it makes things much easier.
>
> Let's name the cluster size "CS", and name the image size in bytes "IS".
>
> So, for a refcount width of 16 bits per entry, we can describe CS / 2
> clusters per refblock; which means we need (IS / CS) / (CS / 2) =
> 2 * IS / (CS * CS) refblocks. Let's be generous and say that the L2
> tables are capable of describing the complete image size (which in
> practice they do not because the guest disk size is smaller than the
> physical size). This assumption also has the neat side-effect of not
> having to care about snapshots. If we have a lot of internal snapshots
> with copied L2 tables, IS grows and therefore the next formula knows
> that the number of L2 tables grows as well. Nice. So, as every L2 table
> can describe CS / 8 (guest) clusters, we need 8 * IS / (CS * CS) L2
> tables.
>
> Therefore, ignoring the image header, L1 tables, the reftable and the
> snapshot table, we have about the following amount of metadata clusters
> per image (with 16 bit refcount entries):
>
> (2 + 8) * IS / (CS * CS) = 10 * IS / (CS * CS)
>
> We said that for every metadata cluster, we would need one entry in the
> list. Every list entry takes 32 bits (4 bytes). Thus, we need
> approximately up to
>
> 40 * IS / (CS * CS)
>
> bytes for the metadata list (please ignore units and append "bytes" at
> the resulting number...).
>
> Let's test that for the above image, which has a disk size of 266 MB:
>
> 40 * 266M / (512 * 512) = 42 kB
>
> Great! It works.
>
>
> So, now let's set CS to 64 kB, because that is basically the only
> cluster size we really care about. For a 1 TB image, we need 10 kB for
> the list. Sounds great to me. For a 1 PB image, we will need 10 MB. Fair
> enough. (Note that you don't need 10 MB of RAM to create a 1 PB image;
> you only need that once the disk size of the image has reached 1 PB).
>
> And 1 TB with 512 byte clusters? 160 MB. Urgh, that is a lot. But then
> again, you can switch off the overlap check with overlap-check=off; and
> trying to actually use a 1 TB image with 512 byte clusters is crazy in
> itself (have you tried just creating one without preallocation? It takes
> forever). So I can live with that.
>
>
> tl;dr
> =====
>
> * CPU usage at runtime decreased by 150 to 275 percent on
>    overlap-check-heavy tasks
> * No additional performance problems at loading time (in theory has the
>    same runtime complexity as a single overlap check right now; in
>    practice I could not find any problems)
> * Decent RAM usage (40 kB for a 1 TB image with 64 kB clusters; 40 MB
>    for a 1 PB image etc. pp.)
>
>
>
> As of this version, this series depends on
> "[PATCH v6] qcow2: Buffer L1 table in snapshot refcount update".
>
>
> v2:
> - Patch 1:
>    - Fixed a mistake regarding the value of
>      Qcow2MetadataFragment::relative_start; this value should be relative
>      to the window start, not relative to the end of the last fragment
>      (destroy_window_bitmap() wrote the latter there in v1)
>    - Use uint64_t for the window index everywhere
> - Patch 8: Said "qcow2: Buffer L1 table in snapshot refcount update"
>    removes l1_allocated in qcow2_update_snapshot_refcount() and basically
>    replaces it by active_l1 (where active_l1 = !l1_allocated). In v1, the
>    condition it was used in was actually wrong, this is fixed here, too
>    (the active L2 cluster type should be removed from the list if we are
>    working on the active L1 table, not if we are working on an inactive
>    L1 table).
>
>
> git-backport-diff against v2:
>
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
>
> 001/12:[0010] [FC] 'qcow2: Add new overlap check functions'
> 002/12:[----] [--] 'qcow2: Pull up overlap check option evaluation'
> 003/12:[----] [--] 'qcow2: Create metadata list'
> 004/12:[----] [--] 'qcow2/overlaps: Protect image header'
> 005/12:[----] [--] 'qcow2/overlaps: Protect refcount table'
> 006/12:[----] [--] 'qcow2/overlaps: Protect refcount blocks'
> 007/12:[----] [--] 'qcow2/overlaps: Protect active L1 table'
> 008/12:[0002] [FC] 'qcow2/overlaps: Protect active L2 tables'
> 009/12:[----] [--] 'qcow2/overlaps: Protect snapshot table'
> 010/12:[----] [--] 'qcow2/overlaps: Protect inactive L1 tables'
> 011/12:[----] [-C] 'qcow2/overlaps: Protect inactive L2 tables'
> 012/12:[----] [--] 'qcow2: Use new metadata overlap check function'
>
>
> Max Reitz (12):
>    qcow2: Add new overlap check functions
>    qcow2: Pull up overlap check option evaluation
>    qcow2: Create metadata list
>    qcow2/overlaps: Protect image header
>    qcow2/overlaps: Protect refcount table
>    qcow2/overlaps: Protect refcount blocks
>    qcow2/overlaps: Protect active L1 table
>    qcow2/overlaps: Protect active L2 tables
>    qcow2/overlaps: Protect snapshot table
>    qcow2/overlaps: Protect inactive L1 tables
>    qcow2/overlaps: Protect inactive L2 tables
>    qcow2: Use new metadata overlap check function
>
>   block/Makefile.objs    |   3 +-
>   block/qcow2-cluster.c  |  13 ++
>   block/qcow2-overlap.c  | 400 +++++++++++++++++++++++++++++++++++++++++++++++++
>   block/qcow2-refcount.c | 202 ++++++++++---------------
>   block/qcow2-snapshot.c | 105 ++++++++++++-
>   block/qcow2.c          | 130 ++++++++++------
>   block/qcow2.h          |  13 ++
>   7 files changed, 694 insertions(+), 172 deletions(-)
>   create mode 100644 block/qcow2-overlap.c

Ping

next prev parent reply	other threads:[~2014-12-15  9:43 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-24 15:56 [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions Max Reitz
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 01/12] " Max Reitz
2014-11-25 11:02   ` Max Reitz
2015-01-21 16:53   ` Stefan Hajnoczi
2015-01-21 22:12     ` Max Reitz
2015-02-03 23:08   ` Eric Blake
2015-02-04 16:29     ` Max Reitz
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 02/12] qcow2: Pull up overlap check option evaluation Max Reitz
2015-02-03 23:33   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 03/12] qcow2: Create metadata list Max Reitz
2015-02-03 23:34   ` Eric Blake
2015-02-04 16:31     ` Max Reitz
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 04/12] qcow2/overlaps: Protect image header Max Reitz
2015-02-03 23:47   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 05/12] qcow2/overlaps: Protect refcount table Max Reitz
2015-02-03 23:55   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 06/12] qcow2/overlaps: Protect refcount blocks Max Reitz
2015-02-05  1:24   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 07/12] qcow2/overlaps: Protect active L1 table Max Reitz
2015-02-05  1:25   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 08/12] qcow2/overlaps: Protect active L2 tables Max Reitz
2015-02-05 15:27   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 09/12] qcow2/overlaps: Protect snapshot table Max Reitz
2015-02-05 15:29   ` Eric Blake
2015-02-05 15:30     ` Max Reitz
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 10/12] qcow2/overlaps: Protect inactive L1 tables Max Reitz
2015-02-05 19:41   ` Eric Blake
2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 11/12] qcow2/overlaps: Protect inactive L2 tables Max Reitz
2015-01-21 15:23   ` Stefan Hajnoczi
2015-01-21 16:07     ` Max Reitz
2014-11-24 15:57 ` [Qemu-devel] [PATCH v2 12/12] qcow2: Use new metadata overlap check function Max Reitz
2015-01-21 15:28   ` Stefan Hajnoczi
2014-11-24 16:56 ` [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions Eric Blake
2014-11-24 17:35 ` Max Reitz
2014-11-25 10:55 ` Max Reitz
2014-11-25 13:25 ` Stefan Hajnoczi
2014-12-15  9:43 ` Max Reitz [this message]
2015-01-20 22:48 ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=548EAD43.90407@redhat.com \
    --to=mreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).