From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59221)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1avJZZ-00083k-G2
	for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:46 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1avJZW-0008AS-7k
	for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:45 -0400
Received: from mail-am1on0121.outbound.protection.outlook.com
	([157.56.112.121]:46880
	helo=emea01-am1-obe.outbound.protection.outlook.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1avJZV-0008AG-Jx
	for qemu-devel@nongnu.org; Wed, 27 Apr 2016 03:07:42 -0400
References: <1461413127-2594-1-git-send-email-den@openvz.org>
	<20160425090553.GA5293@noname.str.redhat.com>
	<571DEF7D.1010304@openvz.org>
	<20160426082321.GA4213@noname.str.redhat.com>
	<571F3653.2040007@openvz.org>
	<20160426101941.GB4213@noname.str.redhat.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <57206537.4040706@openvz.org>
Date: Wed, 27 Apr 2016 10:07:35 +0300
MIME-Version: 1.0
In-Reply-To: <20160426101941.GB4213@noname.str.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve
 qcow2_co_write_zeroes()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>

On 04/26/2016 01:19 PM, Kevin Wolf wrote:

[skipped]
> Did you ever talk to the kernel people?
>
> We can try to make the best out of suboptimal requests in qemu, but it
> looks to me as if the real fix was in the kernel, and if we don't get it
> fixed there, we'll see more and more of this kind of problems. I think
> this is relevant not only for VMs, but probably on real hardware as
> well.
it is difficult and could be done partially only :(


>> Locally with QEMU-IO the reproduction is simple. We can repeat above
>> requests or could simple do the following:
>>      qemu-io -c "write 0xff 32k 1M" 1.img
> I assume you mean "-z" instead of "0xff"?
sure


>> The code without the patch will allocate 2 blocks for guest offsets
>> 0-64k and 1M-(1M+64k) and performs writes there. The code with
>> the patch will skip creation of blocks if possible.
> Okay, that's the second of the optimisations you mentioned in your
> commit message. I can see how this adds something that the generic block
> layer can't easily add, if it can be made safe (I think it can, even
> though your patch doesn't get it completely right yet, see below).
>
>> I have recorded parameters in qcow2_co_do_write_zeroes for the
>> reference (1 MB is written, memory is not aligned as sent in the
>> letter above, [sudo ./a.out /dev/nbd3]):
>>
>> qcow2_co_write_zeroes off=0 size=10000
>> qcow2_co_write_zeroes off=1fe00 size=200
>> qcow2_co_write_zeroes off=3fe00 size=200
>> qcow2_co_write_zeroes off=5fe00 size=200
>> qcow2_co_write_zeroes off=7fe00 size=200
>> qcow2_co_write_zeroes off=9fe00 size=200
>> qcow2_co_write_zeroes off=bfe00 size=200
>> qcow2_co_write_zeroes off=dfe00 size=200
>> qcow2_co_write_zeroes off=ffe00 size=200
>> qcow2_co_write_zeroes off=10000 size=fe00
>> qcow2_co_write_zeroes off=20000 size=10000
>> qcow2_co_write_zeroes off=30000 size=fe00
>> qcow2_co_write_zeroes off=60000 size=10000
>> qcow2_co_write_zeroes off=70000 size=fe00
>> qcow2_co_write_zeroes off=c0000 size=10000
>> qcow2_co_write_zeroes off=d0000 size=fe00
>> qcow2_co_write_zeroes off=e0000 size=10000
>> qcow2_co_write_zeroes off=f0000 size=fe00
>> qcow2_co_write_zeroes off=80000 size=10000
>> qcow2_co_write_zeroes off=90000 size=fe00
>> qcow2_co_write_zeroes off=a0000 size=10000
>> qcow2_co_write_zeroes off=b0000 size=fe00
>> qcow2_co_write_zeroes off=40000 size=10000
>> qcow2_co_write_zeroes off=50000 size=fe00
> I don't see any requests here where your code actually ends up splitting
> the request into head, aligned part and tail. Which is expected because
> bdrv_co_do_write_zeroes() already does that.
>
> What I can't see here is whether this actually happened (10000 + fe00
> could be a split request) or whether it already came in this way over
> NBD.

bdrv_aligned_pwritev off=0 size=1fe00
qcow2_co_write_zeroes off=0 size=10000
qcow2_co_write_zeroes off=10000 size=fe00
bdrv_aligned_pwritev off=1fe00 size=20000
qcow2_co_write_zeroes off=1fe00 size=200
qcow2_co_write_zeroes off=20000 size=10000
qcow2_co_write_zeroes off=30000 size=fe00
bdrv_aligned_pwritev off=3fe00 size=20000
qcow2_co_write_zeroes off=3fe00 size=200
qcow2_co_write_zeroes off=40000 size=10000
qcow2_co_write_zeroes off=50000 size=fe00
bdrv_aligned_pwritev off=5fe00 size=20000
qcow2_co_write_zeroes off=5fe00 size=200
qcow2_co_write_zeroes off=60000 size=10000
qcow2_co_write_zeroes off=70000 size=fe00
bdrv_aligned_pwritev off=7fe00 size=20000
qcow2_co_write_zeroes off=7fe00 size=200
qcow2_co_write_zeroes off=80000 size=10000
qcow2_co_write_zeroes off=90000 size=fe00
bdrv_aligned_pwritev off=9fe00 size=20000
qcow2_co_write_zeroes off=9fe00 size=200
qcow2_co_write_zeroes off=a0000 size=10000
qcow2_co_write_zeroes off=b0000 size=fe00
bdrv_aligned_pwritev off=bfe00 size=20000
qcow2_co_write_zeroes off=bfe00 size=200
qcow2_co_write_zeroes off=c0000 size=10000
qcow2_co_write_zeroes off=d0000 size=fe00
bdrv_aligned_pwritev off=dfe00 size=20000
qcow2_co_write_zeroes off=dfe00 size=200
qcow2_co_write_zeroes off=e0000 size=10000
qcow2_co_write_zeroes off=f0000 size=fe00
bdrv_aligned_pwritev off=ffe00 size=200
qcow2_co_write_zeroes off=ffe00 size=200


[skipped]

> Here is the trick that I think will save us:
>
> On a misaligned call, we call bdrv_get_block_status_above() for the
> whole cluster that we're in. We know that it's only a single cluster
> because bdrv_co_do_write_zeroes() splits things this way; only aligned
> requests can be longer than a sector (we can even assert this).
>
> If the result is that the cluster already reads as zero, instead of
> doing nothing and possibly breaking backing chain manipulations, we
> simply extend the write zeroes operation to the whole cluster and
> continue as normal with an aligned request. This way we end up with a
> zero cluster instead of an unallocated one, and that should be safe.
>
> If the result is that the cluster isn't completed zeroed, return
> -ENOTSUP as you did in the snippet above.
>
> That approach should probably result in an (even) simpler patch, too.

let's start from the simple thing. I'll send a patch in a couple of minutes.
It really looks MUCH better than the original one. Thank you for this cool
suggestion.

>
> Hm... Or actually, if we want something more complex that will help all
> block drivers, extending the range of the request could even be done in
> bdrv_co_do_write_zeroes(), I guess. I won't insist on it, though.
I do not think that this should be done now. This patch should go into
our own stable and thus it would better be as simple as possible.

Den