From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60306)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1WHsjn-0000bn-JT
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:21 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pl@kamp.de>) id 1WHsjh-0002p1-Uj
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:15 -0500
Received: from mx.ipv6.kamp.de ([2a02:248:0:51::16]:52744 helo=mx01.kamp.de)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1WHsjh-0002ot-K4
	for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:09 -0500
Message-ID: <530B1E3D.8050204@kamp.de>
Date: Mon, 24 Feb 2014 11:26:05 +0100
From: Peter Lieven <pl@kamp.de>
MIME-Version: 1.0
References: <1393074022-32388-1-git-send-email-pl@kamp.de>
	<20140224101152.GE3775@dhcp-200-207.str.redhat.com>
In-Reply-To: <20140224101152.GE3775@dhcp-200-207.str.redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with
	bdrv_write_zeroes
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, mreitz@redhat.com

On 24.02.2014 11:11, Kevin Wolf wrote:
> Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> As you probably have expected, there's no way I can let the patch in in
> this form. The least you need to introduce is a boolean option to enable
> or disable the zero check. (The default would probably be disabled, but
> we can discuss this.)
I would have been really suprised *g*. As you and Fam already pointed
out, the desired behaviour is heavily dependant on the use case.

I personally do not need this for QCOW2 but for iSCSI. Here the optimization
is basically saved bandwidth since a zero write becomes a WRITESAME.
Unless the user specifies unmap=on there is no change in what is written to disk.

I would be fine with a default off boolean variable. For my case it would also be
sufficient to have boolean flag in the BlockDriver that indicates if this optimization
is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both
cases I basically saves bandwidth and let the backend storage more efficiently
write zeroes if it is capable. A third use case would be a raw device on an SSD.
In all cases if unmap=on it would additionally save disk space.

>
>>   block.c               |    8 ++++++++
>>   include/qemu-common.h |    1 +
>>   util/iov.c            |   20 ++++++++++++++++++++
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 6f4baca..505888e 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>   
>>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>   
>> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
>> +        flags |= BDRV_REQ_MAY_UNMAP;
> I'm not sure about this one. I think it is reasonable to expect that
> after an explicit write of a buffer filled with zeros the block is
> allocated.
>
> In a simple qcow2-on-file case, we basically have three options for
> handling all-zero writes:
>
> - Allocate the cluster on a qcow2 and file level and write literal zeros
>    to it. No metadata updates involved in the next write to the cluster.
>
> - Set the qcow2 zero flag, but leave the allocation in place. The next
>    write in theory just needs to remove the zero flag (I think in
>    practice we're doing an unnecessary COW) from the L2 table and that's
>    it.
>
> - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
>    the filesystem layer. The next write causes new allocations in both
>    layers, which means multiple metadata updates and possibly added
>    fragmentation. The upside is that we use less disk space if there is
>    no next write to this cluster.
>
> I think it's pretty clear that the right behaviour depends on your use
> case and we can't find an one-size-fits-all solution.
I wouldn't mind have this optimization only work on raw format for
the moment.

Peter
>
>> +    }
>> +
>>       if (ret < 0) {
>>           /* Do nothing, write notifier decided to fail this request */
>>       } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin