Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Eric Blake <eblake@redhat.com>
To: Francesco Romani <fromani@redhat.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, mdroth@linux.vnet.ibm.com, stefanha@redhat.com,
	lcapitulino@redhat.com
Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Date: Mon, 01 Dec 2014 14:07:38 -0700	[thread overview]
Message-ID: <547CD89A.2090903@redhat.com> (raw)
In-Reply-To: <1417177867-27918-2-git-send-email-fromani@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6161 bytes --]

On 11/28/2014 05:31 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> a disk usage threshold (so called 'high water mark') based on the occupation
> of the device,  and automatically extends the image once the threshold
> is reached or exceeded.
> 
> In order to detect the crossing of the threshold, oVirt has no choice but
> aggressively polling the QEMU monitor using the query-blockstats command.
> This lead to unnecessary system load, and is made even worse under scale:
> deployments with hundreds of VMs are no longer rare.
> 
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
> 
> This will allow the managing application to drop the polling
> altogether and just wait for a watermark crossing event.

I like the idea!

Question - what happens if management misses the event (for example, if
libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
'query-named-block-nodes' still work to query the current threshold and
whether it has been exceeded, as a poll-once command executed when
reconnecting to the monitor?

> 
> Signed-off-by: Francesco Romani <fromani@redhat.com>
> ---

No need for a 0/1 cover letter on a 1-patch series; you have the option
of just putting the side-band information here and sending it as a
single mail.  But the cover letter approach doesn't hurt either, and I
can see how it can be easier for some workflows to always send a cover
letter than to special-case a 1-patch series.

> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> +                                            void *opaque)
> +{
> +    BdrvTrackedRequest *req = opaque;
> +    BlockDriverState *bs = req->bs;
> +    uint64_t amount = 0;
> +
> +    amount = bdrv_usage_threshold_exceeded(bs, req);
> +    if (amount > 0) {
> +        qapi_event_send_block_usage_threshold(
> +            bs->node_name,
> +            amount,
> +            bs->write_threshold_offset,
> +            &error_abort);
> +
> +        /* autodisable to avoid to flood the monitor */

s/to flood/flooding/

> +/*
> + * bdrv_usage_threshold_is_set
> + *
> + * Tell if an usage threshold is set for a given BDS.

s/an usage/a usage/

(in English, the difference between "a" and "an" is whether the leading
sound of the next word is pronounced or not; in this case, "usage" is
pronounced with a hard "yoo-sage".  It may help to remember "an umbrella
for a unicorn")

> +++ b/qapi/block-core.json
> @@ -239,6 +239,9 @@
>  #
>  # @iops_size: #optional an I/O size in bytes (Since 1.7)
>  #
> +# @write-threshold: configured write threshold for the device.
> +#                   0 if disabled. (Since 2.3)
> +#
>  # Since: 0.14.0
>  #
>  ##
> @@ -253,7 +256,7 @@
>              '*bps_max': 'int', '*bps_rd_max': 'int',
>              '*bps_wr_max': 'int', '*iops_max': 'int',
>              '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> -            '*iops_size': 'int' } }
> +            '*iops_size': 'int', 'write-threshold': 'uint64' } }

In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
with either spelling, although 'int' is more common.

Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
we also favor consistency, and BlockDeviceInfo is one of those dinosaur
commands that uses _ everywhere until your addition.  So naming this
field 'write_threshold' would be more consistent.

> +##
> +# @BLOCK_USAGE_THRESHOLD
> +#
> +# Emitted when writes on block device reaches or exceeds the
> +# configured threshold. For thin-provisioned devices, this
> +# means the device should be extended to avoid pausing for
> +# disk exaustion.

s/exaustion/exhaustion/

> +#
> +# @node-name: graph node name on which the threshold was exceeded.
> +#
> +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> +#
> +# @offset-threshold: last configured threshold, in bytes.
> +#

Might want to mention that this event is one-shot; after it triggers, a
user must re-register a threshold to get the event again.

> +# Since: 2.3
> +##
> +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> +  'data': { 'node-name': 'str',
> +	    'amount-exceeded': 'uint64',

TAB damage.  Please use spaces.  ./scripts/checkpatch.pl will catch some
offenders (although I didn't test if it will catch this one).

However, here you are correct in using '-' for naming :)

> +	    'threshold': 'uint64' } }
> +
> +##
> +# @block-set-threshold
> +#
> +# Change usage threshold for a block drive. An event will be delivered
> +# if a write to this block drive crosses the configured threshold.
> +# This is useful to transparently resize thin-provisioned drives without
> +# the guest OS noticing.
> +#
> +# @node-name: graph node name on which the threshold must be set.
> +#
> +# @write-threshold: configured threshold for the block device, bytes.
> +#                   Use 0 to disable the threshold.
> +#
> +# Returns: Nothing on success
> +#          If @node name is not found on the block device graph,
> +#          DeviceNotFound
> +#
> +# Since: 2.3
> +##
> +{ 'command': 'block-set-threshold',
> +  'data': { 'node-name': 'str', 'threshold': 'uint64' } }

Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt
with either spelling.

> +SQMP
> +block-set-threshold
> +------------
> +
> +Change the write threshold for a block drive. The threshold is an offset,
> +thus must be non-negative. Default is not write threshold.

s/not/no/

> +To set the threshold to zero disables it.

s/To set/Setting/

Missing the change to the 'query-block' and 'query-named-block-nodes'
examples to show the new always-present output field.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

next prev parent reply	other threads:[~2014-12-01 21:07 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-28 12:31 [Qemu-devel] [PATCH v3] add write threshold reporting for block devices Francesco Romani
2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
2014-12-01 21:07   ` Eric Blake [this message]
2014-12-02  7:47     ` Francesco Romani
2014-12-02  8:07       ` Francesco Romani
2014-12-02 15:01   ` Stefan Hajnoczi
2014-12-02 15:11     ` Francesco Romani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=547CD89A.2090903@redhat.com \
    --to=eblake@redhat.com \
    --cc=fromani@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lcapitulino@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).