From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>, Eric Blake <eblake@redhat.com>
Cc: "Denis V. Lunev" <den@openvz.org>,
anton.nefedov@virtuozzo.com, qemu-devel@nongnu.org,
qemu-block@nongnu.org, mreitz@redhat.com
Subject: Re: [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space()
Date: Tue, 9 Jun 2020 18:29:22 +0300 [thread overview]
Message-ID: <177d9401-e040-13bd-2e77-26bfeda4a3d6@virtuozzo.com> (raw)
In-Reply-To: <20200609151810.GD11003@linux.fritz.box>
09.06.2020 18:18, Kevin Wolf wrote:
> Am 09.06.2020 um 16:46 hat Eric Blake geschrieben:
>> On 6/9/20 9:28 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.06.2020 17:08, Kevin Wolf wrote:
>>>> Since commit c8bb23cbdbe, handle_alloc_space() is called for newly
>>>> allocated clusters to efficiently initialise the COW areas with zeros if
>>>> necessary. It skips the whole operation if both start_cow nor end_cow
>>>> are empty. However, it requests zeroing the whole request size (possibly
>>>> multiple megabytes) even if only one end of the request actually needs
>>>> this.
>>>>
>>>> This patch reduces the write_zeroes request size in this case so that we
>>>> don't unnecessarily zero-initialise a region that we're going to
>>>> overwrite immediately.
>>>>
>>
>>>
>>> Hmm, I'm afraid, that this may make things worse in some cases, as with
>>> one big write-zero request
>>> we preallocate data-region in the protocol file, so we have better
>>> locality for the clusters we
>>> are going to write. And, in the same time, with BDRV_REQ_NO_FALLBACK
>>> flag write-zero must be
>>> fast anyway (especially in comparison with the following write request).
>>>
>>>> Â Â Â Â Â Â Â Â Â /*
>>>> Â Â Â Â Â Â Â Â Â Â * instead of writing zero COW buffers,
>>>> Â Â Â Â Â Â Â Â Â Â * efficiently zero out the whole clusters
>>>> Â Â Â Â Â Â Â Â Â Â */
>>>> -Â Â Â Â Â Â Â ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
>>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â m->nb_clusters *
>>>> s->cluster_size,
>>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â true);
>>>> +Â Â Â Â Â Â Â ret = qcow2_pre_write_overlap_check(bs, 0, start, len, true);
>>>> Â Â Â Â Â Â Â Â Â if (ret < 0) {
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â return ret;
>>>> Â Â Â Â Â Â Â Â Â }
>>>> Â Â Â Â Â Â Â Â Â BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
>>>> -Â Â Â Â Â Â Â ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
>>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â m->nb_clusters * s->cluster_size,
>>>> +Â Â Â Â Â Â Â ret = bdrv_co_pwrite_zeroes(s->data_file, start, len,
>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â BDRV_REQ_NO_FALLBACK);
>>
>> Good point. If we weren't using BDRV_REQ_NO_FALLBACK, then avoiding a
>> pre-zero pass over the middle is essential. But since we are insisting that
>> the pre-zero pass be fast or else immediately fail, the time spent in
>> pre-zeroing should not be a concern. Do you have benchmark numbers stating
>> otherwise?
>
> I stumbled across this behaviour (write_zeros for 2 MB, then overwrite
> almost everything) in the context of a different bug, and it just didn't
> make much sense to me. Is there really a file system where fragmentation
> is introduced by not zeroing the area first and then overwriting it?
>
> I'm not insisting on making this change because the behaviour is
> harmless if odd, but if we think that writing twice to some blocks is an
> optimisation, maybe we should actually measure and document this.
Not to same blocks: first we do write-zeroes to the area aligned-up to cluster bound. So it's more probable that the resulting clusters would be contigous on file-system.. With your patch it may be split into two parts. (a bit too theoretical, I'd better prove it by example)
Also, we (Virtuozzo) have to support some custom distributed fs, where allocation itself is expensive, so the additional benefit of first (larger) write-zero request is that we have one allocation request instead of two (with your patch) or three (if we decide to make two write-zero opersions).
>
>
> Anyway, let's talk about the reported bug that made me look at the
> strace that showed this behaviour because I feel it supports my last
> point. It's a bit messy, but anyway:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1666864
>
> So initially, bad performance on a fragmented image file was reported.
> Not much to do there, but then in comment 16, QA reported a performance
> regression in this case between 4.0 and 4.2. And this change caused by
> c8bb23cbdbe, i.e. the commit that introduced handle_alloc_space().
>
> Turns out that BDRV_REQ_NO_FALLBACK doesn't always guarantee that it's
> _really_ fast. fallocate(FALLOC_FL_ZERO_RANGE) causes some kind of flush
> on XFS and buffered writes don't. So with the old code, qemu-img convert
> to a file on a very full filesystem that will cause fragmentation, was
> much faster with writing a zero buffer than with write_zeroes (because
> it didn't flush the result).
>
> I don't fully understand why this is and hope that XFS can do something
> about it. I also don't really think we should revert the change in QEMU,
> though I'm not completely sure. But I just wanted to share this to show
> that "obvious" characteristics of certain types of requests aren't
> always true and doing obscure optimisations based on what we think
> filesystems may do can actually achieve the opposite in some cases.
>
> Kevin
>
--
Best regards,
Vladimir
next prev parent reply other threads:[~2020-06-09 15:30 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-09 14:08 [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space() Kevin Wolf
2020-06-09 14:28 ` Vladimir Sementsov-Ogievskiy
2020-06-09 14:46 ` Eric Blake
2020-06-09 15:18 ` Kevin Wolf
2020-06-09 15:29 ` Vladimir Sementsov-Ogievskiy [this message]
2020-06-10 8:38 ` Vladimir Sementsov-Ogievskiy
2020-06-09 16:19 ` Eric Blake
2020-06-10 6:50 ` Vladimir Sementsov-Ogievskiy
2020-06-10 11:25 ` Kevin Wolf
2020-06-09 14:43 ` Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=177d9401-e040-13bd-2e77-26bfeda4a3d6@virtuozzo.com \
--to=vsementsov@virtuozzo.com \
--cc=anton.nefedov@virtuozzo.com \
--cc=den@openvz.org \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.