From: Eric Blake <eblake@redhat.com>
To: Anton Nefedov <anton.nefedov@virtuozzo.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, "Denis V. Lunev" <den@openvz.org>,
den@virtuozzo.com, mreitz@redhat.com
Subject: Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk
Date: Mon, 22 May 2017 14:00:53 -0500 [thread overview]
Message-ID: <f628e214-2faa-0cf2-c5d2-39b0b6505fc2@redhat.com> (raw)
In-Reply-To: <1495186480-114192-2-git-send-email-anton.nefedov@virtuozzo.com>
[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]
On 05/19/2017 04:34 AM, Anton Nefedov wrote:
> From: "Denis V. Lunev" <den@openvz.org>
>
> Currently each single write operation can result in 3 write operations
> if guest offsets are not cluster aligned. One write is performed for the
> real payload and two for COW-ed areas. Thus the data possibly lays
> non-contiguously on the host filesystem. This will reduce further
> sequential read performance significantly.
>
> The patch allocates the space in the file with cluster granularity,
> ensuring
> 1. better host offset locality
> 2. less space allocation operations
> (which can be expensive on distributed storages)
s/storages/storage/
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
> ---
> block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
> 1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index a8d61f0..2e6a0ec 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1575,6 +1575,32 @@ fail:
> return ret;
> }
>
> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + BlockDriverState *file = bs->file->bs;
> + QCowL2Meta *m;
> + int ret;
> +
> + for (m = l2meta; m != NULL; m = m->next) {
> + uint64_t bytes = m->nb_clusters << s->cluster_bits;
> +
> + if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
> + continue;
> + }
> +
> + /* try to alloc host space in one chunk for better locality */
> + ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, bytes, 0);
Are we guaranteed that this is a fast operation? (That is, it either
results in a hole or an error, and doesn't waste time tediously writing
actual zeroes)
> +
> + if (ret != 0) {
> + continue;
> + }
Supposing we are using a file system that doesn't support holes, then
ret will not be zero, and we ended up not allocating anything after all.
Is that a problem that we are just blindly continuing the loop as our
reaction to the error?
/reads further
I guess not - you aren't reacting to any error call, but merely using
the side effect that an allocation happened for speed when it worked,
and ignoring failure (you get the old behavior of the write() now
causing the allocation) when it didn't.
> +
> + file->total_sectors = MAX(file->total_sectors,
> + (m->alloc_offset + bytes) / BDRV_SECTOR_SIZE);
> + }
> +}
> +
> static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
> uint64_t bytes, QEMUIOVector *qiov,
> int flags)
> @@ -1656,8 +1682,12 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
> if (ret < 0) {
> goto fail;
> }
> -
> qemu_co_mutex_unlock(&s->lock);
> +
> + if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
> + handle_alloc_space(bs, l2meta);
> + }
Is it really a good idea to be modifying the underlying protocol image
outside of the mutex?
At any rate, it looks like your patch is doing a best-effort write
zeroes as an attempt to trigger consecutive allocation of the entire
cluster in the underlying protocol right after a cluster has been
allocated at the qcow2 format layer. Which means there are more
syscalls now than there were previously, but now when we do three
write() calls at offsets B, A, C, those three calls are into file space
that was allocated earlier by the write zeroes, rather than fresh calls
into unallocated space that is likely to trigger up to three disjoint
allocations.
As a discussion point, wouldn't we achieve the same effect of less
fragmentation if we instead collect our data into a bounce buffer, and
only then do a single write() (or more likely, a writev() where the iov
is set up to reconstruct a single buffer on the syscall, but where the
source data is still at different offsets)? We'd be avoiding the extra
syscalls of pre-allocating the cluster, and while our write() call is
still causing allocations, at least it is now one cluster-aligned
write() rather than three sub-cluster out-of-order write()s.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]
next prev parent reply other threads:[~2017-05-22 19:01 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-19 9:34 [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk Anton Nefedov
2017-05-22 19:00 ` Eric Blake [this message]
2017-05-23 8:28 ` Anton Nefedov
2017-05-23 9:13 ` Denis V. Lunev
2017-05-26 8:11 ` Kevin Wolf
2017-05-26 8:57 ` Denis V. Lunev
2017-05-26 10:09 ` Anton Nefedov
2017-05-26 11:16 ` Kevin Wolf
2017-05-26 10:57 ` Denis V. Lunev
2017-05-26 11:32 ` Kevin Wolf
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 02/13] qcow2: is_zero_sectors(): return true if area is outside of backing file Anton Nefedov
2017-05-22 19:12 ` Eric Blake
2017-05-22 19:14 ` Eric Blake
2017-05-23 8:35 ` Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 03/13] qcow2: do not COW the empty areas Anton Nefedov
2017-05-22 19:24 ` Eric Blake
2017-05-23 8:31 ` Anton Nefedov
2017-05-23 9:15 ` Denis V. Lunev
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 04/13] qcow2: preallocation at image expand Anton Nefedov
2017-05-22 19:29 ` Eric Blake
2017-05-24 16:57 ` Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 05/13] qcow2: set inactive flag Anton Nefedov
2017-05-26 8:11 ` Kevin Wolf
2017-05-31 16:56 ` Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 06/13] qcow2: truncate preallocated space Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 07/13] qcow2: check space leak at the end of the image Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 08/13] qcow2: handle_prealloc(): find out if area zeroed by earlier preallocation Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 09/13] qcow2: fix misleading comment about L2 linking Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 10/13] qcow2-cluster: slightly refactor handle_dependencies() Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 11/13] qcow2-cluster: make handle_dependencies() logic easier to follow Anton Nefedov
2017-05-22 19:37 ` Eric Blake
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 12/13] qcow2: allow concurrent unaligned writes to the same clusters Anton Nefedov
2017-05-19 9:34 ` [Qemu-devel] [PATCH v1 13/13] iotest 046: test simultaneous cluster write error case Anton Nefedov
2017-05-23 14:35 ` [Qemu-devel] [PATCH v1 00/13] qcow2: space preallocation and COW improvements Eric Blake
2017-05-23 14:51 ` Denis V. Lunev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f628e214-2faa-0cf2-c5d2-39b0b6505fc2@redhat.com \
--to=eblake@redhat.com \
--cc=anton.nefedov@virtuozzo.com \
--cc=den@openvz.org \
--cc=den@virtuozzo.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).