All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Kevin Wolf <kwolf@redhat.com>
Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com,
	mreitz@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes
Date: Mon, 24 Feb 2014 11:26:05 +0100	[thread overview]
Message-ID: <530B1E3D.8050204@kamp.de> (raw)
In-Reply-To: <20140224101152.GE3775@dhcp-200-207.str.redhat.com>

On 24.02.2014 11:11, Kevin Wolf wrote:
> Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> As you probably have expected, there's no way I can let the patch in in
> this form. The least you need to introduce is a boolean option to enable
> or disable the zero check. (The default would probably be disabled, but
> we can discuss this.)
I would have been really suprised *g*. As you and Fam already pointed
out, the desired behaviour is heavily dependant on the use case.

I personally do not need this for QCOW2 but for iSCSI. Here the optimization
is basically saved bandwidth since a zero write becomes a WRITESAME.
Unless the user specifies unmap=on there is no change in what is written to disk.

I would be fine with a default off boolean variable. For my case it would also be
sufficient to have boolean flag in the BlockDriver that indicates if this optimization
is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both
cases I basically saves bandwidth and let the backend storage more efficiently
write zeroes if it is capable. A third use case would be a raw device on an SSD.
In all cases if unmap=on it would additionally save disk space.

>
>>   block.c               |    8 ++++++++
>>   include/qemu-common.h |    1 +
>>   util/iov.c            |   20 ++++++++++++++++++++
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 6f4baca..505888e 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>   
>>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>   
>> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
>> +        flags |= BDRV_REQ_MAY_UNMAP;
> I'm not sure about this one. I think it is reasonable to expect that
> after an explicit write of a buffer filled with zeros the block is
> allocated.
>
> In a simple qcow2-on-file case, we basically have three options for
> handling all-zero writes:
>
> - Allocate the cluster on a qcow2 and file level and write literal zeros
>    to it. No metadata updates involved in the next write to the cluster.
>
> - Set the qcow2 zero flag, but leave the allocation in place. The next
>    write in theory just needs to remove the zero flag (I think in
>    practice we're doing an unnecessary COW) from the L2 table and that's
>    it.
>
> - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
>    the filesystem layer. The next write causes new allocations in both
>    layers, which means multiple metadata updates and possibly added
>    fragmentation. The upside is that we use less disk space if there is
>    no next write to this cluster.
>
> I think it's pretty clear that the right behaviour depends on your use
> case and we can't find an one-size-fits-all solution.
I wouldn't mind have this optimization only work on raw format for
the moment.

Peter
>
>> +    }
>> +
>>       if (ret < 0) {
>>           /* Do nothing, write notifier decided to fail this request */
>>       } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin

  reply	other threads:[~2014-02-24 10:26 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-22 13:00 [Qemu-devel] [RFC PATCH] block: optimize zero writes with bdrv_write_zeroes Peter Lieven
2014-02-22 16:45 ` Fam Zheng
2014-02-23 19:10   ` Peter Lieven
2014-02-24  1:01     ` Fam Zheng
2014-02-24 10:39       ` Paolo Bonzini
2014-02-24 11:33         ` Fam Zheng
2014-02-24 11:51           ` Paolo Bonzini
2014-02-24 12:04             ` Fam Zheng
2014-02-24 12:07             ` Kevin Wolf
2014-02-24 12:10               ` Paolo Bonzini
2014-02-24 12:22                 ` Kevin Wolf
2014-02-24 10:11 ` Kevin Wolf
2014-02-24 10:26   ` Peter Lieven [this message]
2014-02-24 10:38     ` Paolo Bonzini
2014-02-24 11:50       ` Peter Lieven
2014-02-24 13:01       ` Peter Lieven
2014-02-25 13:41         ` Kevin Wolf
2014-02-25 17:03           ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=530B1E3D.8050204@kamp.de \
    --to=pl@kamp.de \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.