From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60279)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Fu7-0007Wk-CC
	for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:33 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Fu2-00047d-7K
	for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:31 -0400
Received: from mail-db3on0136.outbound.protection.outlook.com
	([157.55.234.136]:57376
	helo=emea01-db3-obe.outbound.protection.outlook.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Fu1-000470-H4
	for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:26 -0400
References: <1462950014-1821-1-git-send-email-den@openvz.org>
	<20160511112830.GB4524@noname.str.redhat.com>
	<57344646.7050803@openvz.org>
	<20160512103730.GD4794@noname.redhat.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <5735FC43.9000700@openvz.org>
Date: Fri, 13 May 2016 19:09:39 +0300
MIME-Version: 1.0
In-Reply-To: <20160512103730.GD4794@noname.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve
 qcow2_co_write_zeroes()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>

On 05/12/2016 01:37 PM, Kevin Wolf wrote:
> Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben:
>> On 05/11/2016 02:28 PM, Kevin Wolf wrote:
>>> Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
>>>> There is a possibility that qcow2_co_write_zeroes() will be called
>>>> with the partial block. This could be synthetically triggered with
>>>>      qemu-io -c "write -z 32k 4k"
>>>> and can happen in the real life in qemu-nbd. The latter happens under
>>>> the following conditions:
>>>>      (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
>>>>          kernel NBD client
>>>>      (2) third party program opens kernel NBD device with O_DIRECT
>>>>      (3) third party program performs write operation with memory buffer
>>>>          not aligned to the page
>>>> In this case qcow2_co_write_zeroes() is unable to perform the operation
>>>> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
>>>> switches to non-optimized version and writes real zeroes to the disk.
>>>>
>>>> The patch creates a shortcut. If the block is read as zeroes, f.e. if
>>>> it is unallocated, the request is extended to cover full block.
>>>> User-visible situation with this block is not changed. Before the patch
>>>> the block is filled in the image with real zeroes. After that patch the
>>>> block is marked as zeroed in metadata. Thus any subsequent changes in
>>>> backing store chain are not affected.
>>>>
>>>> Kevin, thank you for a cool suggestion.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>> Changes from v2:
>>>> - checked head/tail clusters separately (one can be zeroed, one unallocated)
>>>> - fixed range calculations
>>>> - fixed race when the block can become used just after the check
>>>> - fixed zero cluster detection
>>>> - minor tweaks in the description
>>>>
>>>> Changes from v1:
>>>> - description rewritten completely
>>>> - new approach suggested by Kevin is implemented
>>>>
>>>>   block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>   1 file changed, 59 insertions(+), 6 deletions(-)
oops, the patch gets committed... that is unexpected but great ;)


>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 470734b..c2474c1 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2411,21 +2411,74 @@ finish:
>>>>       return ret;
>>>>   }
>>>> +
>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int nr;
>>>> +    BlockDriverState *file;
>>>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>>>> +                                              s->cluster_sectors, &nr, &file);
>>>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
>>> all unallocated clusters return true, even if the backing file contains
>>> non-zero data for them.
>> this is correct. From my POW this means that this area is unallocated
>> in the entire backing chain and thus it will be read as zeroes. Thus
>> we could cover it with zeroes.
> You're right that I made a mistake, I was thinking of the non-recursive
> bdrv_get_block_status().
>
> However, I still think that we may not assume that !BDRV_BLOCK_DATA
> means zero data, even though that affects only more obscure cases. We
> have bdrv_unallocated_blocks_are_zero() to check whether the assumption
> is true. However, bdrv_co_get_block_status() already checks this
> internally and sets BDRV_BLOCK_ZERO in this case, so just checking
> BDRV_BLOCK_ZERO in qcow2 should be good.
>
> Did you find a case where you got !DATA, but not ZERO, and assuming
> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
actually we may have the following case (artificial)!:
- assuming we do not have bdrv_has_zero_init in backing store
- and qcow2 on top of this file
- reading from unallocated block should return 0 (no data in both 
places), qcow2
   layer will return 0

It looks like we will have this situation.

[skipped]

> Hm, I see:
>
>      if (bs->bl.write_zeroes_alignment
>          && num > bs->bl.write_zeroes_alignment) {
>
> Removing the second part should fix this, i.e. it would split a request
> into two unaligned halves even if there is no aligned "bulk" in the
> middle.
>
> I think it would match my expectations better, but maybe that's just me.
> What do you think?
actually the code here will not be significantly better (I presume),
but I'll make a try


>>>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>>>> +            if (!is_zero_cluster(bs, cl_end)) {
>>>> +                return -ENOTSUP;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        qemu_co_mutex_lock(&s->lock);
>>>> +        /* We can have new write after previous check */
>>>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>>>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>>>> +            qemu_co_mutex_unlock(&s->lock);
>>>> +            return -ENOTSUP;
>>>> +        }
>>> Just lock the mutex before the check, the possible optimisation for the
>>> emulation case (which is slow anyway) isn't worth the additional code
>>> complexity.
>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
>> recursive thus the code will hang. This is the problem trying to be
>> addressed with this split of checks.
>>
>> May be we could make the lock recursive...
> Maybe your version is no far from the best we can do then. It deserves a
> comment, though, because it's not completely obvious.
>
> The other option that we have and that looks reasonable enough to me is
> checking is_zero_cluster_top_locked() first and only if that returns
> false, we check the block status of the backing chain, starting at
> bs->backing->bs. This way we would bypass the recursive call and could
> take the lock from the beginning. If we go that way, it deserves a
> comment as well.
>
> Kevin
OK. I'll send at least improved comments and (may be)
removal of "&& num > bs->bl.write_zeroes_alignment"
as follow up.

thank you for ideas ;)

Den