From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:37146)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1auzOw-0007Er-JV
	for qemu-devel@nongnu.org; Tue, 26 Apr 2016 05:35:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1auzOt-0006S4-AC
	for qemu-devel@nongnu.org; Tue, 26 Apr 2016 05:35:26 -0400
Received: from mail-db3on0124.outbound.protection.outlook.com
	([157.55.234.124]:6990
	helo=emea01-db3-obe.outbound.protection.outlook.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1auzOs-0006Rw-Fn
	for qemu-devel@nongnu.org; Tue, 26 Apr 2016 05:35:23 -0400
From: "Denis V. Lunev" <den@openvz.org>
References: <1461413127-2594-1-git-send-email-den@openvz.org>
	<20160425090553.GA5293@noname.str.redhat.com>
	<571DEF7D.1010304@openvz.org>
	<20160426082321.GA4213@noname.str.redhat.com>
Message-ID: <571F3653.2040007@openvz.org>
Date: Tue, 26 Apr 2016 12:35:15 +0300
MIME-Version: 1.0
In-Reply-To: <20160426082321.GA4213@noname.str.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH for 2.7 1/1] qcow2: improve
 qcow2_co_write_zeroes()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>

On 04/26/2016 11:23 AM, Kevin Wolf wrote:
> Am 25.04.2016 um 12:20 hat Denis V. Lunev geschrieben:
>> On 04/25/2016 12:05 PM, Kevin Wolf wrote:
>>> Am 23.04.2016 um 14:05 hat Denis V. Lunev geschrieben:
>>>> Unfortunately Linux kernel could send non-aligned requests to qemu-nbd
>>>> if the caller is using O_DIRECT and does not align in-memory data to
>>>> page. Thus qemu-nbd will call block layer with non-aligned requests.
>>>>
>>>> qcow2_co_write_zeroes forcibly asks the caller to supply block-aligned
>>>> data. In the other case it rejects with ENOTSUP which is properly
>>>> handled on the upper level. The problem is that this grows the image.
>>>>
>>>> This could be optimized a bit:
>>>> - particular request could be split to block aligned part and head/tail,
>>>>    which could be handled separately
>>> In fact, this is what bdrv_co_do_write_zeroes() is already supposed to
>>> do. qcow2 exposes its cluster size as bs->bl.write_zeroes_alignment, so
>>> block/io.c should split the request in three.
>>>
>>> If you see something different happening, we may have a bug there.
>>>
>> Pls look to the commit
>>
>> commit 459b4e66129d091a11e9886ecc15a8bf9f7f3d92
>> Author: Denis V. Lunev<den@openvz.org>
>> Date:   Tue May 12 17:30:56 2015 +0300
>>
>>      block: align bounce buffers to page
>>
>> The situation is exactly like the described there. The user
>> of the /dev/nbd0 writes with O_DIRECT and has unaligned
>> to page buffers. Thus real operations on qemu-nbd
>> layer becomes unaligned to block size.
> I don't understand the connection to this patch. Unaligned buffers on
> the NBD client shouldn't even be visible in the server, unless they
> already result in the client requesting different things. If so, what is
> the difference in the NBD requests? And can we reproduce the same
> locally with qemu-io and no NBD involved?

NBD device is mapped by the kernel as /dev/nbd0
The descriptor to /dev/nbd0 is opened with O_DIRECT
by the program. The program performs write
of 1MB at the offset 0 of the device.

There are 2 cases:
(1) the program has buffer aligned to 512 bytes
(2) the program has buffer aligned to 4096

The kernel splits writes before passing it to elevator
DIFFERENTLY for above cases.

In the case (2) the request is split by 256 KB chunks .
For the case we will have 4 requests 256Kb each with
offsets 0, 256Kb, 512Kb, 768Kb. In this case NBD and
QCOW2 driver behaves fine.

In the case (1) the kernel split packets in a very lame
way. For each 256Kb chunk actually several requests
are generated like this:

       9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
       9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
       9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
       9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
       9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
       9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
       9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
       9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]

These requests will be passed from kernel VFS to kernel
NBD client. Thus we will have requests like this in NBD client
and subsequently in QEMU (offsets in bytes)
      0..261632 (256k - 512)
      261632..261632
      etc
This is how requests splitting is working in the VFS :( and this is the
problem which can not be fixed easily.

Locally with QEMU-IO the reproduction is simple. We can repeat above
requests or could simple do the following:
     qemu-io -c "write 0xff 32k 1M" 1.img
The code without the patch will allocate 2 blocks for guest offsets
0-64k and 1M-(1M+64k) and performs writes there. The code with
the patch will skip creation of blocks if possible.

I have recorded parameters in qcow2_co_do_write_zeroes for the
reference (1 MB is written, memory is not aligned as sent in the
letter above, [sudo ./a.out /dev/nbd3]):

qcow2_co_write_zeroes off=0 size=10000
qcow2_co_write_zeroes off=1fe00 size=200
qcow2_co_write_zeroes off=3fe00 size=200
qcow2_co_write_zeroes off=5fe00 size=200
qcow2_co_write_zeroes off=7fe00 size=200
qcow2_co_write_zeroes off=9fe00 size=200
qcow2_co_write_zeroes off=bfe00 size=200
qcow2_co_write_zeroes off=dfe00 size=200
qcow2_co_write_zeroes off=ffe00 size=200
qcow2_co_write_zeroes off=10000 size=fe00
qcow2_co_write_zeroes off=20000 size=10000
qcow2_co_write_zeroes off=30000 size=fe00
qcow2_co_write_zeroes off=60000 size=10000
qcow2_co_write_zeroes off=70000 size=fe00
qcow2_co_write_zeroes off=c0000 size=10000
qcow2_co_write_zeroes off=d0000 size=fe00
qcow2_co_write_zeroes off=e0000 size=10000
qcow2_co_write_zeroes off=f0000 size=fe00
qcow2_co_write_zeroes off=80000 size=10000
qcow2_co_write_zeroes off=90000 size=fe00
qcow2_co_write_zeroes off=a0000 size=10000
qcow2_co_write_zeroes off=b0000 size=fe00
qcow2_co_write_zeroes off=40000 size=10000
qcow2_co_write_zeroes off=50000 size=fe00


>> Thus bdrv_co_do_write_zeroes is helpless here unfortunately.
> How can qcow2 fix something that bdrv_co_do_write_zeroes() can't
> possibly fix?
Yes. We are writing zeroes. If the block is not allocated - we could
skip the operation entirely as soon as there is no backing file or
there is no block at this guest offset in entire backing chain.

>   In particular, why does splitting the request in head,
> tail and aligned part help when done by qcow2, but the same thing
> doesn't help when done by bdrv_co_do_write_zeroes()?
The operation is skipped as far as you could see. May be we could
just return ENOTSUP if the block is allocated to allow upper
level to work. Something like

+static int write_zeroes_chunk(BlockDriverState *bs, int64_t sector_num, int nr)
+{
+    int ret, count;
+    BlockDriverState *file;
+
+    ret = bdrv_get_block_status_above(bs, NULL, sector_num, nr, &count, &file);
+    if (ret > 0 && (ret & BDRV_BLOCK_ZERO) && count == nr) {
+        /* Nothing to do. The area is zeroed already.
+           Worth to check to avoid image expansion for non-aligned reqs. */
+        return 0;
+    }
+    return -ENOTSUP;
+}


> I'd actually be interested in both parts of the answer, because I'm not
> sure how _memory_ alignment on the client can possibly be fixed in
> qcow2; but if it's about _disk_ alignment, I don't understand why it
> can't be fixed in bdrv_co_do_write_zeroes().
The question is "why to write zeroes if we know that we will read
zeroes on the next attempt?" We could skip this write. This is the
idea, see above.

>>>> - writes could be omitted when we do know that the image already contains
>>>>    zeroes at the offsets being written
>>> I don't think this is a valid shortcut. The semantics of a write_zeroes
>>> operation is that the zeros (literal or as flags) are stored in this
>>> layer and that the backing file isn't involved any more for the given
>>> sectors. For example, a streaming operation or qemu-img rebase may
>>> involve write_zeroes operations, and relying on the backing file would
>>> cause corruption there (because the whole point of the operation is that
>>> the backing file can be removed).
>> this is not a problem. The block will be abscent and thus it will be
>> read as zeroes.
> Removing a backing file doesn't mean that there won't still be another
> backing file. You may have only removed one node in the backing file
> chain, or in the case of rebase, you switch to another backing file.
hmmm.... We are on a tricky ground. We have read zeroes but can not
read zeroes on a next attempt especially if backing chain is changed.

Den