From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43269) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b1Gb7-0005QB-Vf for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b1Gb2-0006bj-PV for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:57 -0400 Received: from mail-db5eur01on0111.outbound.protection.outlook.com ([104.47.2.111]:54950 helo=EUR01-DB5-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b1Gb2-0006bb-1l for qemu-devel@nongnu.org; Fri, 13 May 2016 13:09:52 -0400 References: <1462950014-1821-1-git-send-email-den@openvz.org> <20160511112830.GB4524@noname.str.redhat.com> <57344646.7050803@openvz.org> <20160512103730.GD4794@noname.redhat.com> <5735FC43.9000700@openvz.org> <20160513162449.GD5529@noname.redhat.com> From: "Denis V. Lunev" Message-ID: <573602B1.6060801@openvz.org> Date: Fri, 13 May 2016 19:37:05 +0300 MIME-Version: 1.0 In-Reply-To: <20160513162449.GD5529@noname.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Max Reitz On 05/13/2016 07:24 PM, Kevin Wolf wrote: > Am 13.05.2016 um 18:09 hat Denis V. Lunev geschrieben: >> oops, the patch gets committed... that is unexpected but great ;) > Oops, that was not intended, I just forgot to remove it from the queue > when I realised that it's not ready yet. > > Should I stage a revert or do you prefer to fix it on top? I'd better do this on top. You will have the fix tomorrow. >>>>>> diff --git a/block/qcow2.c b/block/qcow2.c >>>>>> index 470734b..c2474c1 100644 >>>>>> --- a/block/qcow2.c >>>>>> +++ b/block/qcow2.c >>>>>> @@ -2411,21 +2411,74 @@ finish: >>>>>> return ret; >>>>>> } >>>>>> + >>>>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start) >>>>>> +{ >>>>>> + BDRVQcow2State *s = bs->opaque; >>>>>> + int nr; >>>>>> + BlockDriverState *file; >>>>>> + int64_t res = bdrv_get_block_status_above(bs, NULL, start, >>>>>> + s->cluster_sectors, &nr, &file); >>>>>> + return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA)); >>>>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that >>>>> all unallocated clusters return true, even if the backing file contains >>>>> non-zero data for them. >>>> this is correct. From my POW this means that this area is unallocated >>>> in the entire backing chain and thus it will be read as zeroes. Thus >>>> we could cover it with zeroes. >>> You're right that I made a mistake, I was thinking of the non-recursive >>> bdrv_get_block_status(). >>> >>> However, I still think that we may not assume that !BDRV_BLOCK_DATA >>> means zero data, even though that affects only more obscure cases. We >>> have bdrv_unallocated_blocks_are_zero() to check whether the assumption >>> is true. However, bdrv_co_get_block_status() already checks this >>> internally and sets BDRV_BLOCK_ZERO in this case, so just checking >>> BDRV_BLOCK_ZERO in qcow2 should be good. >>> >>> Did you find a case where you got !DATA, but not ZERO, and assuming >>> zeroes was valid? If so, we may need to fix bdrv_co_get_block_status(). >> actually we may have the following case (artificial)!: >> - assuming we do not have bdrv_has_zero_init in backing store >> - and qcow2 on top of this file >> - reading from unallocated block should return 0 (no data in both >> places), qcow2 >> layer will return 0 >> >> It looks like we will have this situation. > qcow2 sets bdi->unallocated_blocks_are_zero = true, so in this case you > should correctly get BDRV_BLOCK_ZERO from bdrv_get_block_status(). I will make a check. >>> Hm, I see: >>> >>> if (bs->bl.write_zeroes_alignment >>> && num > bs->bl.write_zeroes_alignment) { >>> >>> Removing the second part should fix this, i.e. it would split a request >>> into two unaligned halves even if there is no aligned "bulk" in the >>> middle. >>> >>> I think it would match my expectations better, but maybe that's just me. >>> What do you think? >> actually the code here will not be significantly better (I presume), >> but I'll make a try > Yes, I agree that it won't make the qcow2 code significantly simpler. I > just think that it would be less surprising semantics. > >>>>>> + cl_end = sector_num + nb_sectors - s->cluster_sectors; >>>>>> + if (!is_zero_cluster(bs, cl_end)) { >>>>>> + return -ENOTSUP; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + qemu_co_mutex_lock(&s->lock); >>>>>> + /* We can have new write after previous check */ >>>>>> + if (!is_zero_cluster_top_locked(bs, sector_num) || >>>>>> + (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) { >>>>>> + qemu_co_mutex_unlock(&s->lock); >>>>>> + return -ENOTSUP; >>>>>> + } >>>>> Just lock the mutex before the check, the possible optimisation for the >>>>> emulation case (which is slow anyway) isn't worth the additional code >>>>> complexity. >>>> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not >>>> recursive thus the code will hang. This is the problem trying to be >>>> addressed with this split of checks. >>>> >>>> May be we could make the lock recursive... >>> Maybe your version is no far from the best we can do then. It deserves a >>> comment, though, because it's not completely obvious. >>> >>> The other option that we have and that looks reasonable enough to me is >>> checking is_zero_cluster_top_locked() first and only if that returns >>> false, we check the block status of the backing chain, starting at >>> bs->backing->bs. This way we would bypass the recursive call and could >>> take the lock from the beginning. If we go that way, it deserves a >>> comment as well. >>> >>> Kevin >> OK. I'll send at least improved comments and (may be) >> removal of "&& num > bs->bl.write_zeroes_alignment" >> as follow up. > The most important part is actually fixing is_zero_cluster(), because > that's a real bug that can corrupt data in the !has_zero_init case. This > is also the reason why I would revert the patch if we don't have a fix > for this until my next pull request. > > The rest is just making things a bit nicer, so follow-ups are very > welcome, but not as critical. > > Kevin you will have the fixup tomorrow. I don't want to touch this too late. Den