From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60279) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b1Fu7-0007Wk-CC for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b1Fu2-00047d-7K for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:31 -0400 Received: from mail-db3on0136.outbound.protection.outlook.com ([157.55.234.136]:57376 helo=emea01-db3-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b1Fu1-000470-H4 for qemu-devel@nongnu.org; Fri, 13 May 2016 12:25:26 -0400 References: <1462950014-1821-1-git-send-email-den@openvz.org> <20160511112830.GB4524@noname.str.redhat.com> <57344646.7050803@openvz.org> <20160512103730.GD4794@noname.redhat.com> From: "Denis V. Lunev" Message-ID: <5735FC43.9000700@openvz.org> Date: Fri, 13 May 2016 19:09:39 +0300 MIME-Version: 1.0 In-Reply-To: <20160512103730.GD4794@noname.redhat.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Max Reitz On 05/12/2016 01:37 PM, Kevin Wolf wrote: > Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben: >> On 05/11/2016 02:28 PM, Kevin Wolf wrote: >>> Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben: >>>> There is a possibility that qcow2_co_write_zeroes() will be called >>>> with the partial block. This could be synthetically triggered with >>>> qemu-io -c "write -z 32k 4k" >>>> and can happen in the real life in qemu-nbd. The latter happens under >>>> the following conditions: >>>> (1) qemu-nbd is started with --detect-zeroes=on and is connected to the >>>> kernel NBD client >>>> (2) third party program opens kernel NBD device with O_DIRECT >>>> (3) third party program performs write operation with memory buffer >>>> not aligned to the page >>>> In this case qcow2_co_write_zeroes() is unable to perform the operation >>>> and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller >>>> switches to non-optimized version and writes real zeroes to the disk. >>>> >>>> The patch creates a shortcut. If the block is read as zeroes, f.e. if >>>> it is unallocated, the request is extended to cover full block. >>>> User-visible situation with this block is not changed. Before the patch >>>> the block is filled in the image with real zeroes. After that patch the >>>> block is marked as zeroed in metadata. Thus any subsequent changes in >>>> backing store chain are not affected. >>>> >>>> Kevin, thank you for a cool suggestion. >>>> >>>> Signed-off-by: Denis V. Lunev >>>> Reviewed-by: Roman Kagan >>>> CC: Kevin Wolf >>>> CC: Max Reitz >>>> --- >>>> Changes from v2: >>>> - checked head/tail clusters separately (one can be zeroed, one unallocated) >>>> - fixed range calculations >>>> - fixed race when the block can become used just after the check >>>> - fixed zero cluster detection >>>> - minor tweaks in the description >>>> >>>> Changes from v1: >>>> - description rewritten completely >>>> - new approach suggested by Kevin is implemented >>>> >>>> block/qcow2.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++------ >>>> 1 file changed, 59 insertions(+), 6 deletions(-) oops, the patch gets committed... that is unexpected but great ;) >>>> diff --git a/block/qcow2.c b/block/qcow2.c >>>> index 470734b..c2474c1 100644 >>>> --- a/block/qcow2.c >>>> +++ b/block/qcow2.c >>>> @@ -2411,21 +2411,74 @@ finish: >>>> return ret; >>>> } >>>> + >>>> +static bool is_zero_cluster(BlockDriverState *bs, int64_t start) >>>> +{ >>>> + BDRVQcow2State *s = bs->opaque; >>>> + int nr; >>>> + BlockDriverState *file; >>>> + int64_t res = bdrv_get_block_status_above(bs, NULL, start, >>>> + s->cluster_sectors, &nr, &file); >>>> + return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & BDRV_BLOCK_DATA)); >>> Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that >>> all unallocated clusters return true, even if the backing file contains >>> non-zero data for them. >> this is correct. From my POW this means that this area is unallocated >> in the entire backing chain and thus it will be read as zeroes. Thus >> we could cover it with zeroes. > You're right that I made a mistake, I was thinking of the non-recursive > bdrv_get_block_status(). > > However, I still think that we may not assume that !BDRV_BLOCK_DATA > means zero data, even though that affects only more obscure cases. We > have bdrv_unallocated_blocks_are_zero() to check whether the assumption > is true. However, bdrv_co_get_block_status() already checks this > internally and sets BDRV_BLOCK_ZERO in this case, so just checking > BDRV_BLOCK_ZERO in qcow2 should be good. > > Did you find a case where you got !DATA, but not ZERO, and assuming > zeroes was valid? If so, we may need to fix bdrv_co_get_block_status(). actually we may have the following case (artificial)!: - assuming we do not have bdrv_has_zero_init in backing store - and qcow2 on top of this file - reading from unallocated block should return 0 (no data in both places), qcow2 layer will return 0 It looks like we will have this situation. [skipped] > Hm, I see: > > if (bs->bl.write_zeroes_alignment > && num > bs->bl.write_zeroes_alignment) { > > Removing the second part should fix this, i.e. it would split a request > into two unaligned halves even if there is no aligned "bulk" in the > middle. > > I think it would match my expectations better, but maybe that's just me. > What do you think? actually the code here will not be significantly better (I presume), but I'll make a try >>>> + cl_end = sector_num + nb_sectors - s->cluster_sectors; >>>> + if (!is_zero_cluster(bs, cl_end)) { >>>> + return -ENOTSUP; >>>> + } >>>> + } >>>> + >>>> + qemu_co_mutex_lock(&s->lock); >>>> + /* We can have new write after previous check */ >>>> + if (!is_zero_cluster_top_locked(bs, sector_num) || >>>> + (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) { >>>> + qemu_co_mutex_unlock(&s->lock); >>>> + return -ENOTSUP; >>>> + } >>> Just lock the mutex before the check, the possible optimisation for the >>> emulation case (which is slow anyway) isn't worth the additional code >>> complexity. >> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not >> recursive thus the code will hang. This is the problem trying to be >> addressed with this split of checks. >> >> May be we could make the lock recursive... > Maybe your version is no far from the best we can do then. It deserves a > comment, though, because it's not completely obvious. > > The other option that we have and that looks reasonable enough to me is > checking is_zero_cluster_top_locked() first and only if that returns > false, we check the block status of the backing chain, starting at > bs->backing->bs. This way we would bypass the recursive call and could > take the lock from the beginning. If we go that way, it deserves a > comment as well. > > Kevin OK. I'll send at least improved comments and (may be) removal of "&& num > bs->bl.write_zeroes_alignment" as follow up. thank you for ideas ;) Den