From: Eric Blake <eblake@redhat.com>
To: Francesco Romani <fromani@redhat.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, mdroth@linux.vnet.ibm.com, stefanha@redhat.com,
lcapitulino@redhat.com
Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Date: Mon, 01 Dec 2014 14:07:38 -0700 [thread overview]
Message-ID: <547CD89A.2090903@redhat.com> (raw)
In-Reply-To: <1417177867-27918-2-git-send-email-fromani@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 6161 bytes --]
On 11/28/2014 05:31 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> a disk usage threshold (so called 'high water mark') based on the occupation
> of the device, and automatically extends the image once the threshold
> is reached or exceeded.
>
> In order to detect the crossing of the threshold, oVirt has no choice but
> aggressively polling the QEMU monitor using the query-blockstats command.
> This lead to unnecessary system load, and is made even worse under scale:
> deployments with hundreds of VMs are no longer rare.
>
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
>
> This will allow the managing application to drop the polling
> altogether and just wait for a watermark crossing event.
I like the idea!
Question - what happens if management misses the event (for example, if
libvirtd is restarted)? Does the existing 'query-blockstats' and/or
'query-named-block-nodes' still work to query the current threshold and
whether it has been exceeded, as a poll-once command executed when
reconnecting to the monitor?
>
> Signed-off-by: Francesco Romani <fromani@redhat.com>
> ---
No need for a 0/1 cover letter on a 1-patch series; you have the option
of just putting the side-band information here and sending it as a
single mail. But the cover letter approach doesn't hurt either, and I
can see how it can be easier for some workflows to always send a cover
letter than to special-case a 1-patch series.
> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> + void *opaque)
> +{
> + BdrvTrackedRequest *req = opaque;
> + BlockDriverState *bs = req->bs;
> + uint64_t amount = 0;
> +
> + amount = bdrv_usage_threshold_exceeded(bs, req);
> + if (amount > 0) {
> + qapi_event_send_block_usage_threshold(
> + bs->node_name,
> + amount,
> + bs->write_threshold_offset,
> + &error_abort);
> +
> + /* autodisable to avoid to flood the monitor */
s/to flood/flooding/
> +/*
> + * bdrv_usage_threshold_is_set
> + *
> + * Tell if an usage threshold is set for a given BDS.
s/an usage/a usage/
(in English, the difference between "a" and "an" is whether the leading
sound of the next word is pronounced or not; in this case, "usage" is
pronounced with a hard "yoo-sage". It may help to remember "an umbrella
for a unicorn")
> +++ b/qapi/block-core.json
> @@ -239,6 +239,9 @@
> #
> # @iops_size: #optional an I/O size in bytes (Since 1.7)
> #
> +# @write-threshold: configured write threshold for the device.
> +# 0 if disabled. (Since 2.3)
> +#
> # Since: 0.14.0
> #
> ##
> @@ -253,7 +256,7 @@
> '*bps_max': 'int', '*bps_rd_max': 'int',
> '*bps_wr_max': 'int', '*iops_max': 'int',
> '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> - '*iops_size': 'int' } }
> + '*iops_size': 'int', 'write-threshold': 'uint64' } }
In QMP specs, 'uint64' and 'int' are practically synonyms. I can live
with either spelling, although 'int' is more common.
Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
we also favor consistency, and BlockDeviceInfo is one of those dinosaur
commands that uses _ everywhere until your addition. So naming this
field 'write_threshold' would be more consistent.
> +##
> +# @BLOCK_USAGE_THRESHOLD
> +#
> +# Emitted when writes on block device reaches or exceeds the
> +# configured threshold. For thin-provisioned devices, this
> +# means the device should be extended to avoid pausing for
> +# disk exaustion.
s/exaustion/exhaustion/
> +#
> +# @node-name: graph node name on which the threshold was exceeded.
> +#
> +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> +#
> +# @offset-threshold: last configured threshold, in bytes.
> +#
Might want to mention that this event is one-shot; after it triggers, a
user must re-register a threshold to get the event again.
> +# Since: 2.3
> +##
> +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> + 'data': { 'node-name': 'str',
> + 'amount-exceeded': 'uint64',
TAB damage. Please use spaces. ./scripts/checkpatch.pl will catch some
offenders (although I didn't test if it will catch this one).
However, here you are correct in using '-' for naming :)
> + 'threshold': 'uint64' } }
> +
> +##
> +# @block-set-threshold
> +#
> +# Change usage threshold for a block drive. An event will be delivered
> +# if a write to this block drive crosses the configured threshold.
> +# This is useful to transparently resize thin-provisioned drives without
> +# the guest OS noticing.
> +#
> +# @node-name: graph node name on which the threshold must be set.
> +#
> +# @write-threshold: configured threshold for the block device, bytes.
> +# Use 0 to disable the threshold.
> +#
> +# Returns: Nothing on success
> +# If @node name is not found on the block device graph,
> +# DeviceNotFound
> +#
> +# Since: 2.3
> +##
> +{ 'command': 'block-set-threshold',
> + 'data': { 'node-name': 'str', 'threshold': 'uint64' } }
Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt
with either spelling.
> +SQMP
> +block-set-threshold
> +------------
> +
> +Change the write threshold for a block drive. The threshold is an offset,
> +thus must be non-negative. Default is not write threshold.
s/not/no/
> +To set the threshold to zero disables it.
s/To set/Setting/
Missing the change to the 'query-block' and 'query-named-block-nodes'
examples to show the new always-present output field.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]
next prev parent reply other threads:[~2014-12-01 21:07 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-28 12:31 [Qemu-devel] [PATCH v3] add write threshold reporting for block devices Francesco Romani
2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
2014-12-01 21:07 ` Eric Blake [this message]
2014-12-02 7:47 ` Francesco Romani
2014-12-02 8:07 ` Francesco Romani
2014-12-02 15:01 ` Stefan Hajnoczi
2014-12-02 15:11 ` Francesco Romani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=547CD89A.2090903@redhat.com \
--to=eblake@redhat.com \
--cc=fromani@redhat.com \
--cc=kwolf@redhat.com \
--cc=lcapitulino@redhat.com \
--cc=mdroth@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.