From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43335) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bDBPu-0006BE-To for qemu-devel@nongnu.org; Wed, 15 Jun 2016 10:03:45 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bDBPp-0007Rh-6v for qemu-devel@nongnu.org; Wed, 15 Jun 2016 10:03:37 -0400 References: <1465917916-22348-1-git-send-email-den@openvz.org> <1465917916-22348-5-git-send-email-den@openvz.org> <5760C4CB.6080703@redhat.com> <576115F9.8020109@openvz.org> <57614B5A.4030902@redhat.com> From: "Denis V. Lunev" Message-ID: <57615590.2010801@openvz.org> Date: Wed, 15 Jun 2016 16:18:08 +0300 MIME-Version: 1.0 In-Reply-To: <57614B5A.4030902@redhat.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: vsementsov@virtuozzo.com, Stefan Hajnoczi , Fam Zheng , Kevin Wolf , Max Reitz , Jeff Cody On 06/15/2016 03:34 PM, Eric Blake wrote: > On 06/15/2016 02:46 AM, Denis V. Lunev wrote: >> On 06/15/2016 06:00 AM, Eric Blake wrote: >>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote: >>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be >>>> placed >>>> into the wire. Thus the target could be very efficiently zeroed out. >>>> This >>>> is should be done with the largest chunk possible. >>>> >>> Probably nicer to track this in bytes. And do you really want a >>> hard-coded arbitrary limit, or is it better to live with >>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)? >> unfortunately we should. INT_MAX is not aligned as required. >> May be we should align INT_MAX properly to fullfill >> write_zeroes alignment. >> >> Hmm, may be we can align INT_MAX properly down. OK, >> I'll try to do that gracefully. > It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an > aligned value; we already have code in io.c that does that in > bdrv_co_do_pwrite_zeroes(). ok >>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s) >>>> end = s->bdev_length / BDRV_SECTOR_SIZE; >>>> - if (base == NULL && !bdrv_has_zero_init(target_bs)) { >>>> + if (base == NULL && !bdrv_has_zero_init(target_bs) && >>>> + target_bs->drv->bdrv_co_write_zeroes == NULL) { >>> Indentation is off, although if checkpatch.pl doesn't complain I guess >>> it doesn't matter that much. >>> >>> Why should you care whether the target_bs->drv implements a callback? >>> Can't you just rely on the normal bdrv_*() functions to do the dirty >>> work of picking the most efficient implementation without you having to >>> bypass the block layer? In fact, isn't that the whole goal of >>> bdrv_make_zero() - why not call that instead of reimplementing it? >> this is the idea of the patch actually. If the callback is not >> implemented, we >> will have zeroes actually written or send to the wire. In this case >> there is >> not much sense to do that, the amount of data actually written will be >> significantly increased (some areas will be written twice - with zeroes and >> with the actual data). >> > But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you > can use the public interface to learn whether bdrv_make_zero() will be > efficient or not, without having to probe what the backend supports. bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs) { BlockDriverInfo bdi; if (bs->backing || !(bs->open_flags & BDRV_O_UNMAP)) { return false; } if (bdrv_get_info(bs, &bdi) == 0) { return bdi.can_write_zeroes_with_unmap; } return false; } This function looks rotten. We CAN efficiently zero out QCOW2 images even with backing store available. Though the availability of the bdrv_co_write_zeroes does not guarantee that it is working (NFS, CIFS etc for raw_posix.c). >> On the other hand, if callback is implemented, we will have very small >> amount >> of data in the wire and written actually and thus will have a benefit. I am >> trying to avoid very small chunks of data. Here (during the migration >> process) >> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD. >> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes >> of data >> on the transport layer. > I agree that we don't want to pre-initialize the device to zero unless > write zeroes is an efficient operation, but I don't think that the > existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find > that out. > > I also think that we need to push harder on the NBD list that under the > new block limits proposal, we WANT to be able to advertise when the new > NBD_CMD_WRITE_ZEROES command will accept a larger size than > NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal > states that if a server advertises a max transaction size to the client, > then the client must honor that size for all commands including > NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed > 2G - 4k request] is invalid and would have to be a bunch of 32M requests). > https://sourceforge.net/p/nbd/mailman/message/35081223/ > I see...