From: Peter Lieven <pl@kamp.de>
To: Kevin Wolf <kwolf@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com,
mreitz@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes
Date: Mon, 24 Feb 2014 11:26:05 +0100 [thread overview]
Message-ID: <530B1E3D.8050204@kamp.de> (raw)
In-Reply-To: <20140224101152.GE3775@dhcp-200-207.str.redhat.com>
On 24.02.2014 11:11, Kevin Wolf wrote:
> Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> As you probably have expected, there's no way I can let the patch in in
> this form. The least you need to introduce is a boolean option to enable
> or disable the zero check. (The default would probably be disabled, but
> we can discuss this.)
I would have been really suprised *g*. As you and Fam already pointed
out, the desired behaviour is heavily dependant on the use case.
I personally do not need this for QCOW2 but for iSCSI. Here the optimization
is basically saved bandwidth since a zero write becomes a WRITESAME.
Unless the user specifies unmap=on there is no change in what is written to disk.
I would be fine with a default off boolean variable. For my case it would also be
sufficient to have boolean flag in the BlockDriver that indicates if this optimization
is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both
cases I basically saves bandwidth and let the backend storage more efficiently
write zeroes if it is capable. A third use case would be a raw device on an SSD.
In all cases if unmap=on it would additionally save disk space.
>
>> block.c | 8 ++++++++
>> include/qemu-common.h | 1 +
>> util/iov.c | 20 ++++++++++++++++++++
>> 3 files changed, 29 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 6f4baca..505888e 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>
>> ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>
>> + if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
>> + drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>> + flags |= BDRV_REQ_ZERO_WRITE;
>> + /* if the device was not opened with discard=on the below flag
>> + * is immediately cleared again in bdrv_co_do_write_zeroes */
>> + flags |= BDRV_REQ_MAY_UNMAP;
> I'm not sure about this one. I think it is reasonable to expect that
> after an explicit write of a buffer filled with zeros the block is
> allocated.
>
> In a simple qcow2-on-file case, we basically have three options for
> handling all-zero writes:
>
> - Allocate the cluster on a qcow2 and file level and write literal zeros
> to it. No metadata updates involved in the next write to the cluster.
>
> - Set the qcow2 zero flag, but leave the allocation in place. The next
> write in theory just needs to remove the zero flag (I think in
> practice we're doing an unnecessary COW) from the L2 table and that's
> it.
>
> - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
> the filesystem layer. The next write causes new allocations in both
> layers, which means multiple metadata updates and possibly added
> fragmentation. The upside is that we use less disk space if there is
> no next write to this cluster.
>
> I think it's pretty clear that the right behaviour depends on your use
> case and we can't find an one-size-fits-all solution.
I wouldn't mind have this optimization only work on raw format for
the moment.
Peter
>
>> + }
>> +
>> if (ret < 0) {
>> /* Do nothing, write notifier decided to fail this request */
>> } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin
next prev parent reply other threads:[~2014-02-24 10:26 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-22 13:00 [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes Peter Lieven
2014-02-22 16:45 ` Fam Zheng
2014-02-23 19:10 ` Peter Lieven
2014-02-24 1:01 ` Fam Zheng
2014-02-24 10:39 ` Paolo Bonzini
2014-02-24 11:33 ` Fam Zheng
2014-02-24 11:51 ` Paolo Bonzini
2014-02-24 12:04 ` Fam Zheng
2014-02-24 12:07 ` Kevin Wolf
2014-02-24 12:10 ` Paolo Bonzini
2014-02-24 12:22 ` Kevin Wolf
2014-02-24 10:11 ` Kevin Wolf
2014-02-24 10:26 ` Peter Lieven [this message]
2014-02-24 10:38 ` Paolo Bonzini
2014-02-24 11:50 ` Peter Lieven
2014-02-24 13:01 ` Peter Lieven
2014-02-25 13:41 ` Kevin Wolf
2014-02-25 17:03 ` Peter Lieven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=530B1E3D.8050204@kamp.de \
--to=pl@kamp.de \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).