From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54502) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0SBz-0006Ct-1F for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:57 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y0SBs-0006gi-Ql for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:50 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58311) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0SBs-0006gT-Go for qemu-devel@nongnu.org; Mon, 15 Dec 2014 04:43:44 -0500 Message-ID: <548EAD43.90407@redhat.com> Date: Mon, 15 Dec 2014 10:43:31 +0100 From: Max Reitz MIME-Version: 1.0 References: <1416844620-17717-1-git-send-email-mreitz@redhat.com> In-Reply-To: <1416844620-17717-1-git-send-email-mreitz@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Kevin Wolf , Peter Lieven , Stefan Hajnoczi On 2014-11-24 at 16:56, Max Reitz wrote: > As has been requested, this series adds new overlap check functions to > the qcow2 code. My local branch is called "qcow2-improved-overlap-v1", > but I am not so sure whether it is actually an improvement; that is lef= t > for you to decide, dear reviewers. > > See patch 1 for an explanation of why this series exists and what it > does. Patch 1 is basically the core of this series, the rest just > employs the functions introduced there. > > In a later patch, we may want to change the meaning of the "constant" > overlap checking option to mean the same as "cached", which is > everything except for inactive L2 tables. This series does make > checking for overlaps with inactive L2 tables at runtime just as cheap > as everything else (constant time plus caching), but using these checks > means qemu has to read all the snapshot L1 tables when opening a qcow2 > file. This does not take long, of course, but it does result in a bit o= f > overhead so I did not want to enable it by default. > > I think just enabling all overlap checks by default after this series > should be fine and useful, though. > > > Benchmarks > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > First, let me repeat the results I wrote as a reply to the cover letter > of v1: > > I had a 1 GB qcow2 image in tmpfs (sorry, I "only" have 4 GB of RAM in > this laptop, and a 2 GB image will kill it) which I accessed from a > guest from inside qemu (Arch Linux, to be exact, because its image just > dumps you into a console and that's all I need). I ran > > $ for i in $(seq 1 42); do \ > dd if=3D/dev/zero of=3D/dev/vda bs=3D65536 \ > done > > because it does not matter what data is written to the image, I just > need to write to a lot of clusters fast to maximize pressure on the > overlap check. > > The results were 13.5/10.5/12.41 % of CPU usage (recorded using perf) i= n > qcow2_check_metadata_overlap() with the current implementation and > 0.08/0.09 % with this series applied (0.08 % with overlap-check=3Dcache= d, > 0.09 % with overlap-check=3Dall). > > Now as a table, just to have one here (and because it is useful when > skipping through this text): > > Current implementation: 12.14 (=C2=B1 1.519) % (n =3D 3) > With this series applied: 0.08 % (n =3D 1) > overlap-check=3Dall: 0.09 % (n =3D 1) > > > Now I did some other benchmarks. > > The first of those is just running (on the host, obviously): > > $ perf record -ag \ > qemu-img create -f qcow2 -o preallocation=3Dmetadata,cluster_size=3D= 512 \ > /tmp/test.qcow2 2G > > I ran ten rounds with the current implementation, ten rounds with these > patches applied and five rounds with cache_size set to 0 in > qcow2_create_empty_metadata_list() (which results in only one bitmap > existing at one time, and therefore a lot of conversions between the > bitmap and the list representation). > > The current implementation had a CPU usage (fraction of cycles) in > qcow2_check_metadata_overlap() of 47.65 (=C2=B14.24) %. There was one r= ather > extreme outlier (36.31 %), with that removed it is 48.91 (=C2=B11.54) %. > > With this series applied, the usage dropped to 0.149 (=C2=B1 0.028) %. > Additionally, qcow2_metadata_list_enter() took 0.046 (=C2=B1 0.021) %. = Both > together took thus 0.195 (=C2=B1 0.040) %. > > With cache_size set to 0, the usage was 0.130 % (=C2=B1 0.012 %) in > qcow2_check_metadata_overlap(); 0.056 % (=C2=B1 0.011 %) in > qcow2_metadata_list_enter(); and 0.186 % (=C2=B1 0.021 %). It dropped, = but > still in range of standard deviation. Thus, heavy conversion between > bitmap and list conversion (in normal use cases due to random accesses) > should not be a problem at all. > > As a table: > > Current implementation: 48.91 (=C2=B1 1.537) % (n =3D 9) > With this series applied: 0.195 (=C2=B1 0.040) % (n =3D 10) > Only one bitmap cached: 0.186 (=C2=B1 0.021) % (n =3D 5) > > And as a really noticeable consequence: Before this series, creating > the image like that took 16.624 s (15.97 s user). With this series, it > takes 4.502 s (3.94 s user). > > Because statistics are fun, I collected the number of metadata list > operations for the second and the third tests: > > Second test (default of 15 bitmaps cached): > > List to bitmap conversions: 1053 > Bitmap to list conversions: 1038 > Bitmaps deallocated: 1038 > > Insert operations: 82266 > Remove operations: 14 > Queries: 312265 > > Third test (only one bitmap cached): > > List to bitmap conversions: 135943 > Bitmap to list conversions: 66546 > Bitmaps deallocated: 135942 > > Insert operations: 82266 > Remove operations: 14 > Queries: 312265 > > As expected, the number of conversions is much higher. Interestingly, i= n > the first case, every bitmap deallocation also meant a bitmap to list > conversion; that means, every bitmap was modified at least once while i= t > was cached. In contrast to that, in the second case, only every second > deallocation resulted in a conversion, which means that half of the > cached bitmaps were only read (which is bad considering all was done wa= s > allocating metadata structures). > > But for performance, there is no real information here (we only see tha= t > setting cache_size to 0 actually did increase the number of > conversions). > > > The second benchmark I ran today was a program which opened /dev/vda in > qemu (again on Arch) with O_DIRECT | O_SYNC | O_RDWR. It then wrote an > uninitialized 512 byte buffer to random places (512-byte-aligned) to a > 1 GB image freshly created before VM launch with metadata preallocation > and 512 byte clusters. The intention was to break bitmap caching and > having to convert back and forth between list and bitmap representation. > The results are as follows: > > Current implementation: 18.73 (=C2=B1 0.157) % (n =3D 10) > With this series applied: 0.068 (=C2=B1 0.009) % (n =3D 10) > Only one bitmap cached: 0.062 (=C2=B1 0.008) % (n =3D 5) > > > Runtime conclusion > ------------------ > > As can be seen from the benchmarks, runtime CPU usage by the metadata > overlap checks is greatly decreased by this series (to 1/150 in the > first benchmark, to 1/250 in the second, and to 1/275 in the third). > > > Where is the caveat? > -------------------- > > The problem with this series is obviously that all the metadata > information is read when reading a qcow2 image. iotest 044 is a good > test for this. I personally haven't seen a problem here, but I can't > outrule any. I never noticed any waits when opening images. > > When approaching this from a theoretical approach, it becomes clear tha= t > there shouldn't be any problems here. If using the default of > overlap-check=3Dcached, only information available anyway is used to bu= ilt > up the metadata list (the same information that is iterated through for > every overlap check in the current implementation). Since building the > list simply means setting a bitmask in a bitmap and then transforming > that bitmap to a list (which has been shown not pose any performance > issues by the second benchmark), there should not be any problem here. > > So, the only caveat I do see is increased code complexity. I initially > thought it might be too complex for its purpose; but after having done > the benchmarks, it became apparent to me that there is a problem with > the current implementation and that this series does fix it. > > And RAM usage may pose a problem. > > > RAM usage > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > So I have looked at my 2 GB image above, and the list uses 40 kB, which > may or may not be too much (sounds completely fine to me for an image > with 512 byte clusters); but it is a least a number I can use for > testing the following theoretical inspection: > > > So, once again, we assume the worst. Every metadata structure needs its > own entry in the list - actually, the worst would be that every cluster > needs its own entry, but that only makes a difference for L1 tables and > the refcount table, so we can ignore that. In fact, we can completely > ignore those tables; it makes things much easier. > > Let's name the cluster size "CS", and name the image size in bytes "IS". > > So, for a refcount width of 16 bits per entry, we can describe CS / 2 > clusters per refblock; which means we need (IS / CS) / (CS / 2) =3D > 2 * IS / (CS * CS) refblocks. Let's be generous and say that the L2 > tables are capable of describing the complete image size (which in > practice they do not because the guest disk size is smaller than the > physical size). This assumption also has the neat side-effect of not > having to care about snapshots. If we have a lot of internal snapshots > with copied L2 tables, IS grows and therefore the next formula knows > that the number of L2 tables grows as well. Nice. So, as every L2 table > can describe CS / 8 (guest) clusters, we need 8 * IS / (CS * CS) L2 > tables. > > Therefore, ignoring the image header, L1 tables, the reftable and the > snapshot table, we have about the following amount of metadata clusters > per image (with 16 bit refcount entries): > > (2 + 8) * IS / (CS * CS) =3D 10 * IS / (CS * CS) > > We said that for every metadata cluster, we would need one entry in the > list. Every list entry takes 32 bits (4 bytes). Thus, we need > approximately up to > > 40 * IS / (CS * CS) > > bytes for the metadata list (please ignore units and append "bytes" at > the resulting number...). > > Let's test that for the above image, which has a disk size of 266 MB: > > 40 * 266M / (512 * 512) =3D 42 kB > > Great! It works. > > > So, now let's set CS to 64 kB, because that is basically the only > cluster size we really care about. For a 1 TB image, we need 10 kB for > the list. Sounds great to me. For a 1 PB image, we will need 10 MB. Fai= r > enough. (Note that you don't need 10 MB of RAM to create a 1 PB image; > you only need that once the disk size of the image has reached 1 PB). > > And 1 TB with 512 byte clusters? 160 MB. Urgh, that is a lot. But then > again, you can switch off the overlap check with overlap-check=3Doff; a= nd > trying to actually use a 1 TB image with 512 byte clusters is crazy in > itself (have you tried just creating one without preallocation? It take= s > forever). So I can live with that. > > > tl;dr > =3D=3D=3D=3D=3D > > * CPU usage at runtime decreased by 150 to 275 percent on > overlap-check-heavy tasks > * No additional performance problems at loading time (in theory has the > same runtime complexity as a single overlap check right now; in > practice I could not find any problems) > * Decent RAM usage (40 kB for a 1 TB image with 64 kB clusters; 40 MB > for a 1 PB image etc. pp.) > > > > As of this version, this series depends on > "[PATCH v6] qcow2: Buffer L1 table in snapshot refcount update". > > > v2: > - Patch 1: > - Fixed a mistake regarding the value of > Qcow2MetadataFragment::relative_start; this value should be relati= ve > to the window start, not relative to the end of the last fragment > (destroy_window_bitmap() wrote the latter there in v1) > - Use uint64_t for the window index everywhere > - Patch 8: Said "qcow2: Buffer L1 table in snapshot refcount update" > removes l1_allocated in qcow2_update_snapshot_refcount() and basical= ly > replaces it by active_l1 (where active_l1 =3D !l1_allocated). In v1,= the > condition it was used in was actually wrong, this is fixed here, too > (the active L2 cluster type should be removed from the list if we ar= e > working on the active L1 table, not if we are working on an inactive > L1 table). > > > git-backport-diff against v2: > > Key: > [----] : patches are identical > [####] : number of functional differences between upstream/downstream p= atch > [down] : patch is downstream-only > The flags [FC] indicate (F)unctional and (C)ontextual differences, resp= ectively > > 001/12:[0010] [FC] 'qcow2: Add new overlap check functions' > 002/12:[----] [--] 'qcow2: Pull up overlap check option evaluation' > 003/12:[----] [--] 'qcow2: Create metadata list' > 004/12:[----] [--] 'qcow2/overlaps: Protect image header' > 005/12:[----] [--] 'qcow2/overlaps: Protect refcount table' > 006/12:[----] [--] 'qcow2/overlaps: Protect refcount blocks' > 007/12:[----] [--] 'qcow2/overlaps: Protect active L1 table' > 008/12:[0002] [FC] 'qcow2/overlaps: Protect active L2 tables' > 009/12:[----] [--] 'qcow2/overlaps: Protect snapshot table' > 010/12:[----] [--] 'qcow2/overlaps: Protect inactive L1 tables' > 011/12:[----] [-C] 'qcow2/overlaps: Protect inactive L2 tables' > 012/12:[----] [--] 'qcow2: Use new metadata overlap check function' > > > Max Reitz (12): > qcow2: Add new overlap check functions > qcow2: Pull up overlap check option evaluation > qcow2: Create metadata list > qcow2/overlaps: Protect image header > qcow2/overlaps: Protect refcount table > qcow2/overlaps: Protect refcount blocks > qcow2/overlaps: Protect active L1 table > qcow2/overlaps: Protect active L2 tables > qcow2/overlaps: Protect snapshot table > qcow2/overlaps: Protect inactive L1 tables > qcow2/overlaps: Protect inactive L2 tables > qcow2: Use new metadata overlap check function > > block/Makefile.objs | 3 +- > block/qcow2-cluster.c | 13 ++ > block/qcow2-overlap.c | 400 ++++++++++++++++++++++++++++++++++++++++= +++++++++ > block/qcow2-refcount.c | 202 ++++++++++--------------- > block/qcow2-snapshot.c | 105 ++++++++++++- > block/qcow2.c | 130 ++++++++++------ > block/qcow2.h | 13 ++ > 7 files changed, 694 insertions(+), 172 deletions(-) > create mode 100644 block/qcow2-overlap.c Ping