From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43335)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1bDBPu-0006BE-To
	for qemu-devel@nongnu.org; Wed, 15 Jun 2016 10:03:45 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@virtuozzo.com>) id 1bDBPp-0007Rh-6v
	for qemu-devel@nongnu.org; Wed, 15 Jun 2016 10:03:37 -0400
References: <1465917916-22348-1-git-send-email-den@openvz.org>
	<1465917916-22348-5-git-send-email-den@openvz.org>
	<5760C4CB.6080703@redhat.com> <576115F9.8020109@openvz.org>
	<57614B5A.4030902@redhat.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <57615590.2010801@openvz.org>
Date: Wed, 15 Jun 2016 16:18:08 +0300
MIME-Version: 1.0
In-Reply-To: <57614B5A.4030902@redhat.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org, qemu-block@nongnu.org
Cc: vsementsov@virtuozzo.com, Stefan Hajnoczi <stefanha@redhat.com>, Fam
	Zheng <famz@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>, Jeff Cody <jcody@redhat.com>

On 06/15/2016 03:34 PM, Eric Blake wrote:
> On 06/15/2016 02:46 AM, Denis V. Lunev wrote:
>> On 06/15/2016 06:00 AM, Eric Blake wrote:
>>> On 06/14/2016 09:25 AM, Denis V. Lunev wrote:
>>>> With a bdrv_co_write_zeroes method on a target BDS zeroes will not be
>>>> placed
>>>> into the wire. Thus the target could be very efficiently zeroed out.
>>>> This
>>>> is should be done with the largest chunk possible.
>>>>
>>> Probably nicer to track this in bytes.  And do you really want a
>>> hard-coded arbitrary limit, or is it better to live with
>>> MIN_NON_ZERO(target_bs->bl.max_pwrite_zeroes, INT_MAX)?
>> unfortunately we should. INT_MAX is not aligned as required.
>> May be we should align INT_MAX properly to fullfill
>> write_zeroes alignment.
>>
>> Hmm, may be we can align INT_MAX properly down. OK,
>> I'll try to do that gracefully.
> It's fairly easy to round a max_transfer or max_pwrite_zeroes down to an
> aligned value; we already have code in io.c that does that in
> bdrv_co_do_pwrite_zeroes().

ok


>>>> @@ -512,7 +513,8 @@ static int mirror_dirty_init(MirrorBlockJob *s)
>>>>          end = s->bdev_length / BDRV_SECTOR_SIZE;
>>>>    -    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
>>>> +    if (base == NULL && !bdrv_has_zero_init(target_bs) &&
>>>> +            target_bs->drv->bdrv_co_write_zeroes == NULL) {
>>> Indentation is off, although if checkpatch.pl doesn't complain I guess
>>> it doesn't matter that much.
>>>
>>> Why should you care whether the target_bs->drv implements a callback?
>>> Can't you just rely on the normal bdrv_*() functions to do the dirty
>>> work of picking the most efficient implementation without you having to
>>> bypass the block layer?  In fact, isn't that the whole goal of
>>> bdrv_make_zero() - why not call that instead of reimplementing it?
>> this is the idea of the patch actually. If the callback is not
>> implemented, we
>> will have zeroes actually written or send to the wire. In this case
>> there is
>> not much sense to do that, the amount of data actually written will be
>> significantly increased (some areas will be written twice - with zeroes and
>> with the actual data).
>>
> But that's where bdrv_can_write_zeroes_with_unmap() comes in handy - you
> can use the public interface to learn whether bdrv_make_zero() will be
> efficient or not, without having to probe what the backend supports.

bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs)
{
     BlockDriverInfo bdi;

     if (bs->backing || !(bs->open_flags & BDRV_O_UNMAP)) {
         return false;
     }

     if (bdrv_get_info(bs, &bdi) == 0) {
         return bdi.can_write_zeroes_with_unmap;
     }

     return false;
}


This function looks rotten. We CAN efficiently zero out
QCOW2 images even with backing store available. Though
the availability of the bdrv_co_write_zeroes does not
guarantee that it is working (NFS, CIFS etc for raw_posix.c).


>> On the other hand, if callback is implemented, we will have very small
>> amount
>> of data in the wire and written actually and thus will have a benefit. I am
>> trying to avoid very small chunks of data. Here (during the migration
>> process)
>> the data is sent with 10 Mb chunks and with takes a LOT of time with NBD.
>> We can send chunks 1.5 Gb (currently). They occupies the same 26 bytes
>> of data
>> on the transport layer.
> I agree that we don't want to pre-initialize the device to zero unless
> write zeroes is an efficient operation, but I don't think that the
> existence of bs->drv->bdrv_co_[p]write_zeroes is the right way to find
> that out.
>
> I also think that we need to push harder on the NBD list that under the
> new block limits proposal, we WANT to be able to advertise when the new
> NBD_CMD_WRITE_ZEROES command will accept a larger size than
> NBD_CMD_WRITE (as currently written, the BLOCK_INFO extension proposal
> states that if a server advertises a max transaction size to the client,
> then the client must honor that size for all commands including
> NBD_CMD_WRITE_ZEROES, which would mean your 1.5G request [or my proposed
> 2G - 4k request] is invalid and would have to be a bunch of 32M requests).
> https://sourceforge.net/p/nbd/mailman/message/35081223/
>


I see...