From: Max Reitz <mreitz@redhat.com>
To: Stefan Hajnoczi <stefanha@gmail.com>, Zhang Haoyu <zhanghy@sangfor.com>
Cc: kvm <kvm@vger.kernel.org>, qemu-devel <qemu-devel@nongnu.org>,
Zhang Haoyu <ahzhanghaoyu@gmail.com>,
Christian Borntraeger <borntraeger@de.ibm.com>,
Amit Shah <amit.shah@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
Date: Sat, 13 Sep 2014 19:22:59 +0200 [thread overview]
Message-ID: <54147D73.1080609@redhat.com> (raw)
In-Reply-To: <20140912123857.GA6207@stefanha-thinkpad.redhat.com>
On 12.09.2014 14:38, Stefan Hajnoczi wrote:
> Max: Unrelated to this performance issue but I notice that the qcow2
> metadata overlap check is high in the host CPU profile. Have you had
> any thoughts about optimizing the check?
>
> Stefan
In fact, I have done so (albeit only briefly). Instead of gathering all
the information in the overlap function itself, we could either have a
generic list of typed ranges (e.g. "cluster 0: header", "clusters 1 to
5: L1 table", etc.) or a not-really-bitmap (with 4 bits per entry
specifying the cluster type (header, L1 table, free or data cluster, etc.)).
The disadvantage of the former would be that in its simplest form we'd
have to run through the whole list to find out whether a cluster is
already reserved for metadata or not. We could easily optimize this by
keeping the list in order and then performing a binary search.
The disadvantage of the latter would obviously be its memory size. For a
1 TB image with 64 kB clusters, it would be 8 MB in size. Could be
considered acceptable, but I deem it too large. The advantage would be
constant access time, of course.
We could combine both approaches, that is, using the bitmap as a cache:
Whenever a cluster is overlap checked, the corresponding bitmap range
(or "bitmap window") is requested; if that is not available, it is
generated from the range list and then put into the cache.
The remaining question is how large the range list would be in memory.
Basically, its size would be comparable to an RLE version of the bitmap.
In contrast to a raw RLE version, however, we'd have to add the start
cluster to each entry in order to be able to perform binary search and
we'd omit free and/or data clusters. So, we'd have 4 bits for the
cluster type, let's say 12 bits for the cluster count and of course 64
bits for the first cluster index. Or, for maximum efficiency, we'd have
64 - 9 - 1 = 54 bits for the cluster index, 4 bits for the type and then
6 bits for the cluster count. The first variant gives us 10 bytes per
metadata range, the second 8. Considering one refcount block can handle
cluster_size / 2 entries and one L2 table can handle cluster_size / 8
entries, we have (for images with a cluster size of 64 kB) a ratio of
about 1/32768 refcount blocks per cluster and 1/8192 L2 tables per
cluster. I guess we therefore have a metadata ratio of about 1/6000. At
the worst, each metadata cluster requires its own range list entry,
which for 10 bytes per entry means less than 30 kB for the list of a 1
TB image with 64 kB clusters. I think that's acceptable.
We could compress that list even more by making it a real RLE version of
the bitmap, removing the cluster index from each entry; remember that
for this mixed range list/bitmap approach we no longer need to be able
to perform exact binary search but only need to be able to quickly seek
to the beginning of a bitmap window. This can be achieved by forcing
breaks in the range list at every window border and keeping track of
those offsets along with the corresponding bitmap window index. When we
want to generate a bitmap window, we look up the start offset in the
range list (constant time), then generate it (linear to window size) and
can then perform constant-time lookups for each overlap checks in that
window.
I think that could greatly speed things up and also allow us to always
perform range checks on data structures not kept in memory (inactive L1
and L2 tables). The only question now remaining to me is whether that
caching is actually feasible or whether binary search into the range
list (which then would have to include the cluster index for each entry)
would be faster than generating bitmap windows which might suffer from
ping-pong effects.
Max
next prev parent reply other threads:[~2014-09-13 17:23 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-29 7:45 [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial Zhang Haoyu
2014-08-29 14:38 ` Amit Shah
2014-09-01 12:38 ` [Qemu-devel] [question] virtio-blk performance degradationhappened " Zhang Haoyu
2014-09-01 12:46 ` Amit Shah
2014-09-01 12:57 ` [Qemu-devel] [question] virtio-blk performancedegradationhappened " Zhang Haoyu
2014-09-01 12:52 ` [Qemu-devel] [question] virtio-blk performance degradationhappened " Zhang Haoyu
2014-09-01 13:09 ` Christian Borntraeger
2014-09-01 13:12 ` Paolo Bonzini
2014-09-01 13:22 ` Christian Borntraeger
2014-09-01 13:29 ` Paolo Bonzini
2014-09-01 14:03 ` Christian Borntraeger
2014-09-01 14:15 ` Christian Borntraeger
2014-09-04 7:56 ` [Qemu-devel] [question] virtio-blk performance degradationhappenedwith virito-serial Zhang Haoyu
2014-09-07 9:46 ` Zhang Haoyu
2014-09-11 6:11 ` Amit Shah
2014-09-12 3:21 ` [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial Zhang Haoyu
2014-09-12 12:38 ` Stefan Hajnoczi
2014-09-13 17:22 ` Max Reitz [this message]
2014-09-16 14:59 ` Zhang Haoyu
2014-09-02 6:36 ` [Qemu-devel] [question] virtio-blk performance degradationhappened " Amit Shah
2014-09-02 18:05 ` Andrey Korolyov
2014-09-02 18:11 ` Amit Shah
2014-09-02 18:27 ` Andrey Korolyov
2014-09-04 2:20 ` [Qemu-devel] [question] virtio-blk performancedegradationhappened " Zhang Haoyu
2014-09-19 5:53 ` [Qemu-devel] [question] virtio-blk performance degradationhappened " Fam Zheng
2014-09-19 13:35 ` Paolo Bonzini
2014-09-22 13:23 ` [Qemu-devel] [question] virtio-blk performancedegradationhappened " Zhang Haoyu
2014-09-23 1:29 ` Fam Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54147D73.1080609@redhat.com \
--to=mreitz@redhat.com \
--cc=ahzhanghaoyu@gmail.com \
--cc=amit.shah@redhat.com \
--cc=borntraeger@de.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=zhanghy@sangfor.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).