From: Eric Blake <eblake@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org, mreitz@redhat.com,
anton.nefedov@virtuozzo.com, "Denis V. Lunev" <den@openvz.org>
Subject: Re: [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space()
Date: Tue, 9 Jun 2020 11:19:49 -0500 [thread overview]
Message-ID: <b2c59302-2c14-474b-3bb8-3b48806f2689@redhat.com> (raw)
In-Reply-To: <20200609151810.GD11003@linux.fritz.box>
On 6/9/20 10:18 AM, Kevin Wolf wrote:
>>>> - ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
>>>> - m->nb_clusters * s->cluster_size,
>>>> + ret = bdrv_co_pwrite_zeroes(s->data_file, start, len,
>>>> BDRV_REQ_NO_FALLBACK);
>>
>> Good point. If we weren't using BDRV_REQ_NO_FALLBACK, then avoiding a
>> pre-zero pass over the middle is essential. But since we are insisting that
>> the pre-zero pass be fast or else immediately fail, the time spent in
>> pre-zeroing should not be a concern. Do you have benchmark numbers stating
>> otherwise?
>
> I stumbled across this behaviour (write_zeros for 2 MB, then overwrite
> almost everything) in the context of a different bug, and it just didn't
> make much sense to me. Is there really a file system where fragmentation
> is introduced by not zeroing the area first and then overwriting it?
>
> I'm not insisting on making this change because the behaviour is
> harmless if odd, but if we think that writing twice to some blocks is an
> optimisation, maybe we should actually measure and document this.
>
>
> Anyway, let's talk about the reported bug that made me look at the
> strace that showed this behaviour because I feel it supports my last
> point. It's a bit messy, but anyway:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1666864
>
> So initially, bad performance on a fragmented image file was reported.
> Not much to do there, but then in comment 16, QA reported a performance
> regression in this case between 4.0 and 4.2. And this change caused by
> c8bb23cbdbe, i.e. the commit that introduced handle_alloc_space().
>
> Turns out that BDRV_REQ_NO_FALLBACK doesn't always guarantee that it's
> _really_ fast. fallocate(FALLOC_FL_ZERO_RANGE) causes some kind of flush
> on XFS and buffered writes don't. So with the old code, qemu-img convert
> to a file on a very full filesystem that will cause fragmentation, was
> much faster with writing a zero buffer than with write_zeroes (because
> it didn't flush the result).
Wow. That makes it sound like we should NOT attempt
fallocate(FALLOC_FL_ZERO_RANGE) on the fast path, because we don't have
guarantees that it is fast.
I really wish the kernel would give us
fallocate(FALLOC_FL_ZERO_RANGE|FALLOC_FL_NO_FALLBACK) which would fail
fast rather than doing a flush or other slow fallback.
>
> I don't fully understand why this is and hope that XFS can do something
> about it. I also don't really think we should revert the change in QEMU,
> though I'm not completely sure. But I just wanted to share this to show
> that "obvious" characteristics of certain types of requests aren't
> always true and doing obscure optimisations based on what we think
> filesystems may do can actually achieve the opposite in some cases.
It also goes to show us that the kernel does NOT yet give us enough
fine-grained control over what we really want (which is: 'pre-zero this
if it is fast, but don't waste time if it is not). Most of the kernel
interfaces end up being 'pre-zero this, and it might be fast, fail fast,
or even fall back to something safe but slow, and you can't tell the
difference short of trying'.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
next prev parent reply other threads:[~2020-06-09 16:21 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-09 14:08 [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space() Kevin Wolf
2020-06-09 14:28 ` Vladimir Sementsov-Ogievskiy
2020-06-09 14:46 ` Eric Blake
2020-06-09 15:18 ` Kevin Wolf
2020-06-09 15:29 ` Vladimir Sementsov-Ogievskiy
2020-06-10 8:38 ` Vladimir Sementsov-Ogievskiy
2020-06-09 16:19 ` Eric Blake [this message]
2020-06-10 6:50 ` Vladimir Sementsov-Ogievskiy
2020-06-10 11:25 ` Kevin Wolf
2020-06-09 14:43 ` Eric Blake
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b2c59302-2c14-474b-3bb8-3b48806f2689@redhat.com \
--to=eblake@redhat.com \
--cc=anton.nefedov@virtuozzo.com \
--cc=den@openvz.org \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).