From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43269)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Gb7-0005QB-Vf
	for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:59 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Gb2-0006bj-PV
	for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:57 -0400
Received: from mail-db5eur01on0111.outbound.protection.outlook.com
	([104.47.2.111]:54950
	helo=EUR01-DB5-obe.outbound.protection.outlook.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1b1Gb2-0006bb-1l
	for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:52 -0400
References: <1462950014-1821-1-git-send-email-den@openvz.org>
	<20160511112830.GB4524@noname.str.redhat.com>
	<57344646.7050803@openvz.org>
	<20160512103730.GD4794@noname.redhat.com> <5735FC43.9000700@openvz.org>
	<20160513162449.GD5529@noname.redhat.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <573602B1.6060801@openvz.org>
Date: Fri, 13 May 2016 19:37:05 +0300
MIME-Version: 1.0
In-Reply-To: <20160513162449.GD5529@noname.redhat.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve
 qcow2_co_write_zeroes()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>

On 05/13/2016 07:24 PM, Kevin Wolf wrote:
> Am 13.05.2016 um 18:09 hat Denis V. Lunev geschrieben:
>> oops, the patch gets committed... that is unexpected but great ;)
> Oops, that was not intended, I just forgot to remove it from the queue
> when I realised that it's not ready yet.
>
> Should I stage a revert or do you prefer to fix it on top?
I'd better do this on top. You will have the fix tomorrow.


>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index 470734b..c2474c1 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -2411,21 +2411,74 @@ finish:
>>>>>>       return ret;
>>>>>>   }
>>>>>> +
>>>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
>>>>>> +{
>>>>>> +    BDRVQcow2State *s = bs->opaque;
>>>>>> +    int nr;
>>>>>> +    BlockDriverState *file;
>>>>>> +    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
>>>>>> +                                              s->cluster_sectors, &nr, &file);
>>>>>> +    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA));
>>>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
>>>>> all unallocated clusters return true, even if the backing file contains
>>>>> non-zero data for them.
>>>> this is correct. From my POW this means that this area is unallocated
>>>> in the entire backing chain and thus it will be read as zeroes. Thus
>>>> we could cover it with zeroes.
>>> You're right that I made a mistake, I was thinking of the non-recursive
>>> bdrv_get_block_status().
>>>
>>> However, I still think that we may not assume that !BDRV_BLOCK_DATA
>>> means zero data, even though that affects only more obscure cases. We
>>> have bdrv_unallocated_blocks_are_zero() to check whether the assumption
>>> is true. However, bdrv_co_get_block_status() already checks this
>>> internally and sets BDRV_BLOCK_ZERO in this case, so just checking
>>> BDRV_BLOCK_ZERO in qcow2 should be good.
>>>
>>> Did you find a case where you got !DATA, but not ZERO, and assuming
>>> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().
>> actually we may have the following case (artificial)!:
>> - assuming we do not have bdrv_has_zero_init in backing store
>> - and qcow2 on top of this file
>> - reading from unallocated block should return 0 (no data in both
>> places), qcow2
>>    layer will return 0
>>
>> It looks like we will have this situation.
> qcow2 sets bdi->unallocated_blocks_are_zero = true, so in this case you
> should correctly get BDRV_BLOCK_ZERO from bdrv_get_block_status().
I will make a check.

>>> Hm, I see:
>>>
>>>      if (bs->bl.write_zeroes_alignment
>>>          && num > bs->bl.write_zeroes_alignment) {
>>>
>>> Removing the second part should fix this, i.e. it would split a request
>>> into two unaligned halves even if there is no aligned "bulk" in the
>>> middle.
>>>
>>> I think it would match my expectations better, but maybe that's just me.
>>> What do you think?
>> actually the code here will not be significantly better (I presume),
>> but I'll make a try
> Yes, I agree that it won't make the qcow2 code significantly simpler. I
> just think that it would be less surprising semantics.
>
>>>>>> +            cl_end = sector_num + nb_sectors - s->cluster_sectors;
>>>>>> +            if (!is_zero_cluster(bs, cl_end)) {
>>>>>> +                return -ENOTSUP;
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>> +        qemu_co_mutex_lock(&s->lock);
>>>>>> +        /* We can have new write after previous check */
>>>>>> +        if (!is_zero_cluster_top_locked(bs, sector_num) ||
>>>>>> +                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
>>>>>> +            qemu_co_mutex_unlock(&s->lock);
>>>>>> +            return -ENOTSUP;
>>>>>> +        }
>>>>> Just lock the mutex before the check, the possible optimisation for the
>>>>> emulation case (which is slow anyway) isn't worth the additional code
>>>>> complexity.
>>>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
>>>> recursive thus the code will hang. This is the problem trying to be
>>>> addressed with this split of checks.
>>>>
>>>> May be we could make the lock recursive...
>>> Maybe your version is no far from the best we can do then. It deserves a
>>> comment, though, because it's not completely obvious.
>>>
>>> The other option that we have and that looks reasonable enough to me is
>>> checking is_zero_cluster_top_locked() first and only if that returns
>>> false, we check the block status of the backing chain, starting at
>>> bs->backing->bs. This way we would bypass the recursive call and could
>>> take the lock from the beginning. If we go that way, it deserves a
>>> comment as well.
>>>
>>> Kevin
>> OK. I'll send at least improved comments and (may be)
>> removal of "&& num > bs->bl.write_zeroes_alignment"
>> as follow up.
> The most important part is actually fixing is_zero_cluster(), because
> that's a real bug that can corrupt data in the !has_zero_init case. This
> is also the reason why I would revert the patch if we don't have a fix
> for this until my next pull request.
>
> The rest is just making things a bit nicer, so follow-ups are very
> welcome, but not as critical.
>
> Kevin
you will have the fixup tomorrow. I don't want
to touch this too late.

Den