qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Kevin Wolf <kwolf@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com,
	mreitz@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes
Date: Mon, 24 Feb 2014 11:26:05 +0100	[thread overview]
Message-ID: <530B1E3D.8050204@kamp.de> (raw)
In-Reply-To: <20140224101152.GE3775@dhcp-200-207.str.redhat.com>

On 24.02.2014 11:11, Kevin Wolf wrote:
> Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> As you probably have expected, there's no way I can let the patch in in
> this form. The least you need to introduce is a boolean option to enable
> or disable the zero check. (The default would probably be disabled, but
> we can discuss this.)
I would have been really suprised *g*. As you and Fam already pointed
out, the desired behaviour is heavily dependant on the use case.

I personally do not need this for QCOW2 but for iSCSI. Here the optimization
is basically saved bandwidth since a zero write becomes a WRITESAME.
Unless the user specifies unmap=on there is no change in what is written to disk.

I would be fine with a default off boolean variable. For my case it would also be
sufficient to have boolean flag in the BlockDriver that indicates if this optimization
is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both
cases I basically saves bandwidth and let the backend storage more efficiently
write zeroes if it is capable. A third use case would be a raw device on an SSD.
In all cases if unmap=on it would additionally save disk space.

>
>>   block.c               |    8 ++++++++
>>   include/qemu-common.h |    1 +
>>   util/iov.c            |   20 ++++++++++++++++++++
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 6f4baca..505888e 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>   
>>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>   
>> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
>> +        flags |= BDRV_REQ_MAY_UNMAP;
> I'm not sure about this one. I think it is reasonable to expect that
> after an explicit write of a buffer filled with zeros the block is
> allocated.
>
> In a simple qcow2-on-file case, we basically have three options for
> handling all-zero writes:
>
> - Allocate the cluster on a qcow2 and file level and write literal zeros
>    to it. No metadata updates involved in the next write to the cluster.
>
> - Set the qcow2 zero flag, but leave the allocation in place. The next
>    write in theory just needs to remove the zero flag (I think in
>    practice we're doing an unnecessary COW) from the L2 table and that's
>    it.
>
> - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
>    the filesystem layer. The next write causes new allocations in both
>    layers, which means multiple metadata updates and possibly added
>    fragmentation. The upside is that we use less disk space if there is
>    no next write to this cluster.
>
> I think it's pretty clear that the right behaviour depends on your use
> case and we can't find an one-size-fits-all solution.
I wouldn't mind have this optimization only work on raw format for
the moment.

Peter
>
>> +    }
>> +
>>       if (ret < 0) {
>>           /* Do nothing, write notifier decided to fail this request */
>>       } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin

  reply	other threads:[~2014-02-24 10:26 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-22 13:00 [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes Peter Lieven
2014-02-22 16:45 ` Fam Zheng
2014-02-23 19:10   ` Peter Lieven
2014-02-24  1:01     ` Fam Zheng
2014-02-24 10:39       ` Paolo Bonzini
2014-02-24 11:33         ` Fam Zheng
2014-02-24 11:51           ` Paolo Bonzini
2014-02-24 12:04             ` Fam Zheng
2014-02-24 12:07             ` Kevin Wolf
2014-02-24 12:10               ` Paolo Bonzini
2014-02-24 12:22                 ` Kevin Wolf
2014-02-24 10:11 ` Kevin Wolf
2014-02-24 10:26   ` Peter Lieven [this message]
2014-02-24 10:38     ` Paolo Bonzini
2014-02-24 11:50       ` Peter Lieven
2014-02-24 13:01       ` Peter Lieven
2014-02-25 13:41         ` Kevin Wolf
2014-02-25 17:03           ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=530B1E3D.8050204@kamp.de \
    --to=pl@kamp.de \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).