qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Eric Blake <eblake@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org, mreitz@redhat.com,
	anton.nefedov@virtuozzo.com, "Denis V. Lunev" <den@openvz.org>
Subject: Re: [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space()
Date: Tue, 9 Jun 2020 11:19:49 -0500	[thread overview]
Message-ID: <b2c59302-2c14-474b-3bb8-3b48806f2689@redhat.com> (raw)
In-Reply-To: <20200609151810.GD11003@linux.fritz.box>

On 6/9/20 10:18 AM, Kevin Wolf wrote:

>>>> -        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
>>>> -                                    m->nb_clusters * s->cluster_size,
>>>> +        ret = bdrv_co_pwrite_zeroes(s->data_file, start, len,
>>>>                                        BDRV_REQ_NO_FALLBACK);
>>
>> Good point.  If we weren't using BDRV_REQ_NO_FALLBACK, then avoiding a
>> pre-zero pass over the middle is essential.  But since we are insisting that
>> the pre-zero pass be fast or else immediately fail, the time spent in
>> pre-zeroing should not be a concern.  Do you have benchmark numbers stating
>> otherwise?
> 
> I stumbled across this behaviour (write_zeros for 2 MB, then overwrite
> almost everything) in the context of a different bug, and it just didn't
> make much sense to me. Is there really a file system where fragmentation
> is introduced by not zeroing the area first and then overwriting it?
> 
> I'm not insisting on making this change because the behaviour is
> harmless if odd, but if we think that writing twice to some blocks is an
> optimisation, maybe we should actually measure and document this.
> 
> 
> Anyway, let's talk about the reported bug that made me look at the
> strace that showed this behaviour because I feel it supports my last
> point. It's a bit messy, but anyway:
> 
>      https://bugzilla.redhat.com/show_bug.cgi?id=1666864
> 
> So initially, bad performance on a fragmented image file was reported.
> Not much to do there, but then in comment 16, QA reported a performance
> regression in this case between 4.0 and 4.2. And this change caused by
> c8bb23cbdbe, i.e. the commit that introduced handle_alloc_space().
> 
> Turns out that BDRV_REQ_NO_FALLBACK doesn't always guarantee that it's
> _really_ fast. fallocate(FALLOC_FL_ZERO_RANGE) causes some kind of flush
> on XFS and buffered writes don't. So with the old code, qemu-img convert
> to a file on a very full filesystem that will cause fragmentation, was
> much faster with writing a zero buffer than with write_zeroes (because
> it didn't flush the result).

Wow. That makes it sound like we should NOT attempt 
fallocate(FALLOC_FL_ZERO_RANGE) on the fast path, because we don't have 
guarantees that it is fast.

I really wish the kernel would give us 
fallocate(FALLOC_FL_ZERO_RANGE|FALLOC_FL_NO_FALLBACK) which would fail 
fast rather than doing a flush or other slow fallback.

> 
> I don't fully understand why this is and hope that XFS can do something
> about it. I also don't really think we should revert the change in QEMU,
> though I'm not completely sure. But I just wanted to share this to show
> that "obvious" characteristics of certain types of requests aren't
> always true and doing obscure optimisations based on what we think
> filesystems may do can actually achieve the opposite in some cases.

It also goes to show us that the kernel does NOT yet give us enough 
fine-grained control over what we really want (which is: 'pre-zero this 
if it is fast, but don't waste time if it is not).  Most of the kernel 
interfaces end up being 'pre-zero this, and it might be fast, fail fast, 
or even fall back to something safe but slow, and you can't tell the 
difference short of trying'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



  parent reply	other threads:[~2020-06-09 16:21 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-09 14:08 [PATCH] qcow2: Reduce write_zeroes size in handle_alloc_space() Kevin Wolf
2020-06-09 14:28 ` Vladimir Sementsov-Ogievskiy
2020-06-09 14:46   ` Eric Blake
2020-06-09 15:18     ` Kevin Wolf
2020-06-09 15:29       ` Vladimir Sementsov-Ogievskiy
2020-06-10  8:38         ` Vladimir Sementsov-Ogievskiy
2020-06-09 16:19       ` Eric Blake [this message]
2020-06-10  6:50         ` Vladimir Sementsov-Ogievskiy
2020-06-10 11:25           ` Kevin Wolf
2020-06-09 14:43 ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b2c59302-2c14-474b-3bb8-3b48806f2689@redhat.com \
    --to=eblake@redhat.com \
    --cc=anton.nefedov@virtuozzo.com \
    --cc=den@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).