From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes()
Date: Wed, 27 Apr 2016 10:07:35 +0300 [thread overview]
Message-ID: <57206537.4040706@openvz.org> (raw)
In-Reply-To: <20160426101941.GB4213@noname.str.redhat.com>
On 04/26/2016 01:19 PM, Kevin Wolf wrote:
[skipped]
> Did you ever talk to the kernel people?
>
> We can try to make the best out of suboptimal requests in qemu, but it
> looks to me as if the real fix was in the kernel, and if we don't get it
> fixed there, we'll see more and more of this kind of problems. I think
> this is relevant not only for VMs, but probably on real hardware as
> well.
it is difficult and could be done partially only :(
>> Locally with QEMU-IO the reproduction is simple. We can repeat above
>> requests or could simple do the following:
>> qemu-io -c "write 0xff 32k 1M" 1.img
> I assume you mean "-z" instead of "0xff"?
sure
>> The code without the patch will allocate 2 blocks for guest offsets
>> 0-64k and 1M-(1M+64k) and performs writes there. The code with
>> the patch will skip creation of blocks if possible.
> Okay, that's the second of the optimisations you mentioned in your
> commit message. I can see how this adds something that the generic block
> layer can't easily add, if it can be made safe (I think it can, even
> though your patch doesn't get it completely right yet, see below).
>
>> I have recorded parameters in qcow2_co_do_write_zeroes for the
>> reference (1 MB is written, memory is not aligned as sent in the
>> letter above, [sudo ./a.out /dev/nbd3]):
>>
>> qcow2_co_write_zeroes off=0 size=10000
>> qcow2_co_write_zeroes off=1fe00 size=200
>> qcow2_co_write_zeroes off=3fe00 size=200
>> qcow2_co_write_zeroes off=5fe00 size=200
>> qcow2_co_write_zeroes off=7fe00 size=200
>> qcow2_co_write_zeroes off=9fe00 size=200
>> qcow2_co_write_zeroes off=bfe00 size=200
>> qcow2_co_write_zeroes off=dfe00 size=200
>> qcow2_co_write_zeroes off=ffe00 size=200
>> qcow2_co_write_zeroes off=10000 size=fe00
>> qcow2_co_write_zeroes off=20000 size=10000
>> qcow2_co_write_zeroes off=30000 size=fe00
>> qcow2_co_write_zeroes off=60000 size=10000
>> qcow2_co_write_zeroes off=70000 size=fe00
>> qcow2_co_write_zeroes off=c0000 size=10000
>> qcow2_co_write_zeroes off=d0000 size=fe00
>> qcow2_co_write_zeroes off=e0000 size=10000
>> qcow2_co_write_zeroes off=f0000 size=fe00
>> qcow2_co_write_zeroes off=80000 size=10000
>> qcow2_co_write_zeroes off=90000 size=fe00
>> qcow2_co_write_zeroes off=a0000 size=10000
>> qcow2_co_write_zeroes off=b0000 size=fe00
>> qcow2_co_write_zeroes off=40000 size=10000
>> qcow2_co_write_zeroes off=50000 size=fe00
> I don't see any requests here where your code actually ends up splitting
> the request into head, aligned part and tail. Which is expected because
> bdrv_co_do_write_zeroes() already does that.
>
> What I can't see here is whether this actually happened (10000 + fe00
> could be a split request) or whether it already came in this way over
> NBD.
bdrv_aligned_pwritev off=0 size=1fe00
qcow2_co_write_zeroes off=0 size=10000
qcow2_co_write_zeroes off=10000 size=fe00
bdrv_aligned_pwritev off=1fe00 size=20000
qcow2_co_write_zeroes off=1fe00 size=200
qcow2_co_write_zeroes off=20000 size=10000
qcow2_co_write_zeroes off=30000 size=fe00
bdrv_aligned_pwritev off=3fe00 size=20000
qcow2_co_write_zeroes off=3fe00 size=200
qcow2_co_write_zeroes off=40000 size=10000
qcow2_co_write_zeroes off=50000 size=fe00
bdrv_aligned_pwritev off=5fe00 size=20000
qcow2_co_write_zeroes off=5fe00 size=200
qcow2_co_write_zeroes off=60000 size=10000
qcow2_co_write_zeroes off=70000 size=fe00
bdrv_aligned_pwritev off=7fe00 size=20000
qcow2_co_write_zeroes off=7fe00 size=200
qcow2_co_write_zeroes off=80000 size=10000
qcow2_co_write_zeroes off=90000 size=fe00
bdrv_aligned_pwritev off=9fe00 size=20000
qcow2_co_write_zeroes off=9fe00 size=200
qcow2_co_write_zeroes off=a0000 size=10000
qcow2_co_write_zeroes off=b0000 size=fe00
bdrv_aligned_pwritev off=bfe00 size=20000
qcow2_co_write_zeroes off=bfe00 size=200
qcow2_co_write_zeroes off=c0000 size=10000
qcow2_co_write_zeroes off=d0000 size=fe00
bdrv_aligned_pwritev off=dfe00 size=20000
qcow2_co_write_zeroes off=dfe00 size=200
qcow2_co_write_zeroes off=e0000 size=10000
qcow2_co_write_zeroes off=f0000 size=fe00
bdrv_aligned_pwritev off=ffe00 size=200
qcow2_co_write_zeroes off=ffe00 size=200
[skipped]
> Here is the trick that I think will save us:
>
> On a misaligned call, we call bdrv_get_block_status_above() for the
> whole cluster that we're in. We know that it's only a single cluster
> because bdrv_co_do_write_zeroes() splits things this way; only aligned
> requests can be longer than a sector (we can even assert this).
>
> If the result is that the cluster already reads as zero, instead of
> doing nothing and possibly breaking backing chain manipulations, we
> simply extend the write zeroes operation to the whole cluster and
> continue as normal with an aligned request. This way we end up with a
> zero cluster instead of an unallocated one, and that should be safe.
>
> If the result is that the cluster isn't completed zeroed, return
> -ENOTSUP as you did in the snippet above.
>
> That approach should probably result in an (even) simpler patch, too.
let's start from the simple thing. I'll send a patch in a couple of minutes.
It really looks MUCH better than the original one. Thank you for this cool
suggestion.
>
> Hm... Or actually, if we want something more complex that will help all
> block drivers, extending the range of the request could even be done in
> bdrv_co_do_write_zeroes(), I guess. I won't insist on it, though.
I do not think that this should be done now. This patch should go into
our own stable and thus it would better be as simple as possible.
Den
next prev parent reply other threads:[~2016-04-27 7:07 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-23 12:05 [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve qcow2_co_write_zeroes() Denis V. Lunev
2016-04-25 9:05 ` Kevin Wolf
2016-04-25 10:20 ` Denis V. Lunev
2016-04-25 19:35 ` Eric Blake
2016-04-25 21:00 ` Denis V. Lunev
2016-04-26 8:23 ` Kevin Wolf
2016-04-26 9:35 ` Denis V. Lunev
2016-04-26 10:19 ` Kevin Wolf
2016-04-27 7:07 ` Denis V. Lunev [this message]
2016-04-27 8:12 ` Kevin Wolf
2016-04-27 8:32 ` Denis V. Lunev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57206537.4040706@openvz.org \
--to=den@openvz.org \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).