All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Lieven <pl@kamp.de>
To: Kevin Wolf <kwolf@redhat.com>
Cc: famz@redhat.com, qemu-devel@nongnu.org, mreitz@redhat.com,
	stefanha@redhat.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_zeroes
Date: Thu, 15 May 2014 07:16:26 +0200	[thread overview]
Message-ID: <53744DAA.9030001@kamp.de> (raw)
In-Reply-To: <20140514114104.GH3610@noname.redhat.com>

Am 14.05.2014 13:41, schrieb Kevin Wolf:
> Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> This significantly speeds up file system initialization and
>> should speed zero write test used to test backend storage
>> performance.
>>
>> I ran the following 2 tests on my internal SSD with a
>> 50G QCOW2 container and on an attached iSCSI storage.
>>
>> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       14secs    1.1secs  1.1secs
>> filesize:      937M      18M      18M
>>
>> iSCSI         [off]     [on]     [unmap]
>> ----
>> runtime:       9.3s      0.9s     0.9s
>>
>> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       246secs   18secs   18secs
>> filesize:      51G       192K     192K
>> throughput:    203M/s    2.3G/s   2.3G/s
>>
>> iSCSI*        [off]     [on]     [unmap]
>> ----
>> runtime:       8mins     45secs   33secs
>> throughput:    106M/s    1.2G/s   1.6G/s
>> allocated:     100%      100%     0%
>>
>> * The storage was connected via an 1Gbit interface.
>>   It seems to internally handle writing zeroes
>>   via WRITESAME16 very fast.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>>         - added more details about the options in the comments
>>           of the qapi-schema [Eric]
>>         - changed the type of detect_zeroes from str to
>>           BlockdevDetectZeroesOptions. I left the name
>>           as is because it is consistent with e.g.
>>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>>         - changed the parse function in blockdev_init to
>>           be generic usable for other enum parameters
>>         
>> v2->v3: - moved parameter parsing to blockdev_init
>>         - added per device detect_zeroes status to 
>>           hmp (info block -v) and qmp (query-block) [Eric]
>>         - added support to enable detect-zeroes also
>>           for hot added devices [Eric]
>>         - added missing entry to qemu_common_drive_opts
>>         - fixed description of qemu_iovec_is_zero [Fam]
>>
>> v1->v2: - added tests to commit message (Markus)
>> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>>            - fixed typo (choosen->chosen) (Eric)
>>            - added second example to commit msg
>>
>> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
>>               - call zero detection only for format (bs->file != NULL)
>>
>>  block.c                   |   11 ++++++++++
>>  block/qapi.c              |    1 +
>>  blockdev.c                |   34 +++++++++++++++++++++++++++++
>>  hmp.c                     |    5 +++++
>>  include/block/block_int.h |    1 +
>>  include/qemu-common.h     |    1 +
>>  qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
>>  qemu-options.hx           |    6 ++++++
>>  qmp-commands.hx           |    3 +++
>>  util/iov.c                |   21 ++++++++++++++++++
>>  10 files changed, 120 insertions(+), 15 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index b749d31..aea4c77 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3244,6 +3244,17 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>  
>>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>  
>> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> Pretty long condition. :-)
>
> Looks like most is obviously needed, but I wonder what the bs->file part
> is good for? That looks rather arbitrary.

What I wanted to achieve is that this condition is only true if we handle
the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
zero write this should always end on the disk and should not be optimizable.


>
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
> not a function that seems to be called from here.

You are right. Question, do we want to support detect_zeroes = unmap
if discard = ignore? If yes, I have to update the docs. Otherwise
I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.

>
>> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> +            flags |= BDRV_REQ_MAY_UNMAP;
>> +        }
>> +    }
>> +
>>      if (ret < 0) {
>>          /* Do nothing, write notifier decided to fail this request */
>>      } else if (flags & BDRV_REQ_ZERO_WRITE) {
>> diff --git a/block/qapi.c b/block/qapi.c
>> index af11445..75f44f1 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
>>      }
>>  
>>      info->backing_file_depth = bdrv_get_backing_file_depth(bs);
>> +    info->detect_zeroes = bs->detect_zeroes;
>>  
>>      if (bs->io_limits_enabled) {
>>          ThrottleConfig cfg;
>> diff --git a/blockdev.c b/blockdev.c
>> index 7810e9f..955bd49 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
>>      }
>>  }
>>  
>> +static int parse_enum_option(const char *lookup[], const char *buf, int max,
>> +                             int def, Error **errp)
>> +{
>> +    int i;
>> +    if (!buf) {
>> +        return def;
>> +    }
>> +    for (i = 0; i < max; i++) {
>> +        if (!strcmp(buf, lookup[i])) {
>> +            return i;
>> +        }
>> +    }
>> +    error_setg(errp, "invalid parameter value: %s",
>> +               buf);
>> +    return def;
>> +}
>> +
>>  static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
>>  {
>>      if (throttle_conflicting(cfg)) {
> This hunk could have been a patch of its own (not a reason for a respin,
> but if you need to respin to address one of the other comments, please
> split the patch).

Yes, will do. Erik also already suggested this.

>
>> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>      QemuOpts *opts;
>>      const char *id;
>>      bool has_driver_specific_opts;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>      BlockDriver *drv = NULL;
>>  
>>      /* Check common options by copying from bs_opts to opts, all other options
>> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>          }
>>      }
>>  
>> +    detect_zeroes =
>> +        parse_enum_option(BlockdevDetectZeroesOptions_lookup,
>> +                          qemu_opt_get(opts, "detect-zeroes"),
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
>> +                          &error);
>> +    if (error) {
>> +        error_propagate(errp, error);
>> +        goto early_err;
>> +    }
>> +
>>      /* init */
>>      dinfo = g_malloc0(sizeof(*dinfo));
>>      dinfo->id = g_strdup(qemu_opts_id(opts));
>> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>      }
>>      dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
>>      dinfo->bdrv->read_only = ro;
>> +    dinfo->bdrv->detect_zeroes = detect_zeroes;
>>      dinfo->refcount = 1;
>>      if (serial != NULL) {
>>          dinfo->serial = g_strdup(serial);
>> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = {
>>              .name = "copy-on-read",
>>              .type = QEMU_OPT_BOOL,
>>              .help = "copy read data from backing file into image file",
>> +        },{
>> +            .name = "detect-zeroes",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "try to optimize zero writes",
>>          },
>>          { /* end of list */ }
>>      },
>> diff --git a/hmp.c b/hmp.c
>> index ca869ba..a541e7d 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
>>                             info->value->inserted->backing_file_depth);
>>          }
>>  
>> +        if (verbose) {
>> +            monitor_printf(mon, "    Detect zeroes:    %s\n",
>> +                           BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
>> +        }
> Perhaps we should always display the information if zero detection is
> enabled. This would be similar to things like I/O throttling limits,
> which are only printed if throttling is enabled.

Ok

>
>>          if (info->value->inserted->bps
>>              || info->value->inserted->bps_rd
>>              || info->value->inserted->bps_wr
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 9ffcb69..b8cc926 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -364,6 +364,7 @@ struct BlockDriverState {
>>      BlockJob *job;
>>  
>>      QDict *options;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>  };
>>  
>>  int get_tmp_filename(char *filename, int size);
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index a998e8d..8e3d6eb 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>  void qemu_iovec_concat_iov(QEMUIOVector *dst,
>>                             struct iovec *src_iov, unsigned int src_cnt,
>>                             size_t soffset, size_t sbytes);
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov);
>>  void qemu_iovec_destroy(QEMUIOVector *qiov);
>>  void qemu_iovec_reset(QEMUIOVector *qiov);
>>  size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 0b00427..d1620b1 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -937,6 +937,8 @@
>>  # @encryption_key_missing: true if the backing device is encrypted but an
>>  #                          valid encryption key is missing
>>  #
>> +# @detect_zeroes: detect and optimize zero writes (Since 2.1)
>> +#
>>  # @bps: total throughput limit in bytes per second is specified
>>  #
>>  # @bps_rd: read throughput limit in bytes per second is specified
>> @@ -972,6 +974,7 @@
>>    'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
>>              '*backing_file': 'str', 'backing_file_depth': 'int',
>>              'encrypted': 'bool', 'encryption_key_missing': 'bool',
>> +            'detect_zeroes': 'BlockdevDetectZeroesOptions',
>>              'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
>>              'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
>>              'image': 'ImageInfo',
>> @@ -4250,6 +4253,22 @@
>>    'data': [ 'ignore', 'unmap' ] }
>>  
>>  ##
>> +# @BlockdevDetectZeroesOptions
>> +#
>> +# Describes the operation mode for the automatic conversion of plain
>> +# zero writes by the OS to driver specific optimized zero write commands.
>> +#
>> +# @off:      Disabled (default)
>> +# @on:       Enabled
>> +# @unmap:    Enabled and even try to unmap blocks if possible. This requires
>> +#            also that @BlockdevDiscardOptions is set to unmap for this device.
>> +#
>> +# Since: 2.1
>> +##
>> +{ 'enum': 'BlockdevDetectZeroesOptions',
>> +  'data': [ 'off', 'on', 'unmap' ] }
>> +
>> +##
>>  # @BlockdevAioOptions
>>  #
>>  # Selects the AIO backend to handle I/O requests
>> @@ -4301,20 +4320,22 @@
>>  # Options that are available for all block devices, independent of the block
>>  # driver.
>>  #
>> -# @driver:      block driver name
>> -# @id:          #optional id by which the new block device can be referred to.
>> -#               This is a required option on the top level of blockdev-add, and
>> -#               currently not allowed on any other level.
>> -# @node-name:   #optional the name of a block driver state node (Since 2.0)
>> -# @discard:     #optional discard-related options (default: ignore)
>> -# @cache:       #optional cache-related options
>> -# @aio:         #optional AIO backend (default: threads)
>> -# @rerror:      #optional how to handle read errors on the device
>> -#               (default: report)
>> -# @werror:      #optional how to handle write errors on the device
>> -#               (default: enospc)
>> -# @read-only:   #optional whether the block device should be read-only
>> -#               (default: false)
>> +# @driver:        block driver name
>> +# @id:            #optional id by which the new block device can be referred to.
>> +#                 This is a required option on the top level of blockdev-add, and
>> +#                 currently not allowed on any other level.
>> +# @node-name:     #optional the name of a block driver state node (Since 2.0)
>> +# @discard:       #optional discard-related options (default: ignore)
>> +# @cache:         #optional cache-related options
>> +# @aio:           #optional AIO backend (default: threads)
>> +# @rerror:        #optional how to handle read errors on the device
>> +#                 (default: report)
>> +# @werror:        #optional how to handle write errors on the device
>> +#                 (default: enospc)
>> +# @read-only:     #optional whether the block device should be read-only
>> +#                 (default: false)
>> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>> +#                 (default: off)
>>  #
>>  # Since: 1.7
>>  ##
>> @@ -4327,7 +4348,8 @@
>>              '*aio': 'BlockdevAioOptions',
>>              '*rerror': 'BlockdevOnError',
>>              '*werror': 'BlockdevOnError',
>> -            '*read-only': 'bool' } }
>> +            '*read-only': 'bool',
>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
>>  
>>  ##
>>  # @BlockdevOptionsFile
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 781af14..5ee94ea 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>>      "       [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
>>      "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
>>      "       [,readonly=on|off][,copy-on-read=on|off]\n"
>> +    "       [,detect-zeroes=on|off|unmap]\n"
>>      "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
>>      "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
>>      "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
>> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write attempts will fail.
>>  @item copy-on-read=@var{copy-on-read}
>>  @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
>>  file sectors into the image file.
>> +@item detect-zeroes=@var{detect-zeroes}
>> +@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
>> +conversion of plain zero writes by the OS to driver specific optimized
>> +zero write commands. If "unmap" is chosen and @var{discard} is "on"
>> +a zero write may even be converted to an UNMAP operation.
>>  @end table
>>  
>>  By default, the @option{cache=writeback} mode is used. It will report data
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index ed3ab92..a535955 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -2032,6 +2032,8 @@ Each json-object contain the following:
>>           - "iops_rd_max":  read I/O operations max (json-int)
>>           - "iops_wr_max":  write I/O operations max (json-int)
>>           - "iops_size": I/O size when limiting by iops (json-int)
>> +         - "detect_zeroes": detect and optimize zero writing (json-string)
>> +             - Possible values: "off", "on", "unmap"
>>           - "image": the detail of the image, it is a json-object containing
>>              the following:
>>               - "filename": image file name (json-string)
>> @@ -2108,6 +2110,7 @@ Example:
>>                 "iops_rd_max": 0,
>>                 "iops_wr_max": 0,
>>                 "iops_size": 0,
>> +               "detect_zeroes": "on",
>>                 "image":{
>>                    "filename":"disks/test.qcow2",
>>                    "format":"qcow2",
>> diff --git a/util/iov.c b/util/iov.c
>> index 6569b5a..f8c49a1 100644
>> --- a/util/iov.c
>> +++ b/util/iov.c
>> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>      qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
>>  }
>>  
>> +/*
>> + *  check if the contents of the iovecs are all zero
>> + */
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov)
>> +{
>> +    int i;
>> +    for (i = 0; i < qiov->niov; i++) {
>> +        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
> I think it gets a bit more readable like this:
>
> size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long));

Yes, indeed.

>
>> +        uint8_t *ptr = qiov->iov[i].iov_base;
>> +        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
>> +            return false;
>> +        }
>> +        for (; offs < qiov->iov[i].iov_len; offs++) {
>> +            if (ptr[offs]) {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
> This function could be a separate patch as well.

Ok, will split as well.

Thank you for your and Erik for his comments,
Peter

  reply	other threads:[~2014-05-15  5:17 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-08 16:22 [Qemu-devel] [PATCHv4] block: optimize zero writes with bdrv_write_zeroes Peter Lieven
2014-05-12 20:28 ` Eric Blake
2014-05-13 12:06   ` Peter Lieven
2014-05-14 13:15     ` Eric Blake
2014-05-14 11:41 ` Kevin Wolf
2014-05-15  5:16   ` Peter Lieven [this message]
2014-05-15  9:54     ` Kevin Wolf
2014-05-15 21:20       ` Peter Lieven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53744DAA.9030001@kamp.de \
    --to=pl@kamp.de \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.