From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60306) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WHsjn-0000bn-JT for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WHsjh-0002p1-Uj for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:15 -0500 Received: from mx.ipv6.kamp.de ([2a02:248:0:51::16]:52744 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WHsjh-0002ot-K4 for qemu-devel@nongnu.org; Mon, 24 Feb 2014 05:26:09 -0500 Message-ID: <530B1E3D.8050204@kamp.de> Date: Mon, 24 Feb 2014 11:26:05 +0100 From: Peter Lieven MIME-Version: 1.0 References: <1393074022-32388-1-git-send-email-pl@kamp.de> <20140224101152.GE3775@dhcp-200-207.str.redhat.com> In-Reply-To: <20140224101152.GE3775@dhcp-200-207.str.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, mreitz@redhat.com On 24.02.2014 11:11, Kevin Wolf wrote: > Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben: >> this patch tries to optimize zero write requests >> by automatically using bdrv_write_zeroes if it is >> supported by the format. >> >> i know that there is a lot of potential for discussion, but i would >> like to know what the others think. >> >> this should significantly speed up file system initialization and >> should speed zero write test used to test backend storage performance. >> >> the difference can simply be tested by e.g. >> >> dd if=/dev/zero of=/dev/vdX bs=1M >> >> Signed-off-by: Peter Lieven > As you probably have expected, there's no way I can let the patch in in > this form. The least you need to introduce is a boolean option to enable > or disable the zero check. (The default would probably be disabled, but > we can discuss this.) I would have been really suprised *g*. As you and Fam already pointed out, the desired behaviour is heavily dependant on the use case. I personally do not need this for QCOW2 but for iSCSI. Here the optimization is basically saved bandwidth since a zero write becomes a WRITESAME. Unless the user specifies unmap=on there is no change in what is written to disk. I would be fine with a default off boolean variable. For my case it would also be sufficient to have boolean flag in the BlockDriver that indicates if this optimization is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both cases I basically saves bandwidth and let the backend storage more efficiently write zeroes if it is capable. A third use case would be a raw device on an SSD. In all cases if unmap=on it would additionally save disk space. > >> block.c | 8 ++++++++ >> include/qemu-common.h | 1 + >> util/iov.c | 20 ++++++++++++++++++++ >> 3 files changed, 29 insertions(+) >> >> diff --git a/block.c b/block.c >> index 6f4baca..505888e 100644 >> --- a/block.c >> +++ b/block.c >> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs, >> >> ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req); >> >> + if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) && >> + drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) { >> + flags |= BDRV_REQ_ZERO_WRITE; >> + /* if the device was not opened with discard=on the below flag >> + * is immediately cleared again in bdrv_co_do_write_zeroes */ >> + flags |= BDRV_REQ_MAY_UNMAP; > I'm not sure about this one. I think it is reasonable to expect that > after an explicit write of a buffer filled with zeros the block is > allocated. > > In a simple qcow2-on-file case, we basically have three options for > handling all-zero writes: > > - Allocate the cluster on a qcow2 and file level and write literal zeros > to it. No metadata updates involved in the next write to the cluster. > > - Set the qcow2 zero flag, but leave the allocation in place. The next > write in theory just needs to remove the zero flag (I think in > practice we're doing an unnecessary COW) from the L2 table and that's > it. > > - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and > the filesystem layer. The next write causes new allocations in both > layers, which means multiple metadata updates and possibly added > fragmentation. The upside is that we use less disk space if there is > no next write to this cluster. > > I think it's pretty clear that the right behaviour depends on your use > case and we can't find an one-size-fits-all solution. I wouldn't mind have this optimization only work on raw format for the moment. Peter > >> + } >> + >> if (ret < 0) { >> /* Do nothing, write notifier decided to fail this request */ >> } else if (flags & BDRV_REQ_ZERO_WRITE) { > Kevin