From: Eric Blake <eblake@redhat.com>
To: Francesco Romani <fromani@redhat.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, mdroth@linux.vnet.ibm.com, stefanha@redhat.com,
lcapitulino@redhat.com
Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Date: Mon, 01 Dec 2014 14:07:38 -0700 [thread overview]
Message-ID: <547CD89A.2090903@redhat.com> (raw)
In-Reply-To: <1417177867-27918-2-git-send-email-fromani@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 6161 bytes --]
On 11/28/2014 05:31 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> To let the guest run smoothly and be not unnecessarily paused, oVirt sets
> a disk usage threshold (so called 'high water mark') based on the occupation
> of the device, and automatically extends the image once the threshold
> is reached or exceeded.
>
> In order to detect the crossing of the threshold, oVirt has no choice but
> aggressively polling the QEMU monitor using the query-blockstats command.
> This lead to unnecessary system load, and is made even worse under scale:
> deployments with hundreds of VMs are no longer rare.
>
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
>
> This will allow the managing application to drop the polling
> altogether and just wait for a watermark crossing event.
I like the idea!
Question - what happens if management misses the event (for example, if
libvirtd is restarted)? Does the existing 'query-blockstats' and/or
'query-named-block-nodes' still work to query the current threshold and
whether it has been exceeded, as a poll-once command executed when
reconnecting to the monitor?
>
> Signed-off-by: Francesco Romani <fromani@redhat.com>
> ---
No need for a 0/1 cover letter on a 1-patch series; you have the option
of just putting the side-band information here and sending it as a
single mail. But the cover letter approach doesn't hurt either, and I
can see how it can be easier for some workflows to always send a cover
letter than to special-case a 1-patch series.
> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> + void *opaque)
> +{
> + BdrvTrackedRequest *req = opaque;
> + BlockDriverState *bs = req->bs;
> + uint64_t amount = 0;
> +
> + amount = bdrv_usage_threshold_exceeded(bs, req);
> + if (amount > 0) {
> + qapi_event_send_block_usage_threshold(
> + bs->node_name,
> + amount,
> + bs->write_threshold_offset,
> + &error_abort);
> +
> + /* autodisable to avoid to flood the monitor */
s/to flood/flooding/
> +/*
> + * bdrv_usage_threshold_is_set
> + *
> + * Tell if an usage threshold is set for a given BDS.
s/an usage/a usage/
(in English, the difference between "a" and "an" is whether the leading
sound of the next word is pronounced or not; in this case, "usage" is
pronounced with a hard "yoo-sage". It may help to remember "an umbrella
for a unicorn")
> +++ b/qapi/block-core.json
> @@ -239,6 +239,9 @@
> #
> # @iops_size: #optional an I/O size in bytes (Since 1.7)
> #
> +# @write-threshold: configured write threshold for the device.
> +# 0 if disabled. (Since 2.3)
> +#
> # Since: 0.14.0
> #
> ##
> @@ -253,7 +256,7 @@
> '*bps_max': 'int', '*bps_rd_max': 'int',
> '*bps_wr_max': 'int', '*iops_max': 'int',
> '*iops_rd_max': 'int', '*iops_wr_max': 'int',
> - '*iops_size': 'int' } }
> + '*iops_size': 'int', 'write-threshold': 'uint64' } }
In QMP specs, 'uint64' and 'int' are practically synonyms. I can live
with either spelling, although 'int' is more common.
Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
we also favor consistency, and BlockDeviceInfo is one of those dinosaur
commands that uses _ everywhere until your addition. So naming this
field 'write_threshold' would be more consistent.
> +##
> +# @BLOCK_USAGE_THRESHOLD
> +#
> +# Emitted when writes on block device reaches or exceeds the
> +# configured threshold. For thin-provisioned devices, this
> +# means the device should be extended to avoid pausing for
> +# disk exaustion.
s/exaustion/exhaustion/
> +#
> +# @node-name: graph node name on which the threshold was exceeded.
> +#
> +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
> +#
> +# @offset-threshold: last configured threshold, in bytes.
> +#
Might want to mention that this event is one-shot; after it triggers, a
user must re-register a threshold to get the event again.
> +# Since: 2.3
> +##
> +{ 'event': 'BLOCK_USAGE_THRESHOLD',
> + 'data': { 'node-name': 'str',
> + 'amount-exceeded': 'uint64',
TAB damage. Please use spaces. ./scripts/checkpatch.pl will catch some
offenders (although I didn't test if it will catch this one).
However, here you are correct in using '-' for naming :)
> + 'threshold': 'uint64' } }
> +
> +##
> +# @block-set-threshold
> +#
> +# Change usage threshold for a block drive. An event will be delivered
> +# if a write to this block drive crosses the configured threshold.
> +# This is useful to transparently resize thin-provisioned drives without
> +# the guest OS noticing.
> +#
> +# @node-name: graph node name on which the threshold must be set.
> +#
> +# @write-threshold: configured threshold for the block device, bytes.
> +# Use 0 to disable the threshold.
> +#
> +# Returns: Nothing on success
> +# If @node name is not found on the block device graph,
> +# DeviceNotFound
> +#
> +# Since: 2.3
> +##
> +{ 'command': 'block-set-threshold',
> + 'data': { 'node-name': 'str', 'threshold': 'uint64' } }
Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt
with either spelling.
> +SQMP
> +block-set-threshold
> +------------
> +
> +Change the write threshold for a block drive. The threshold is an offset,
> +thus must be non-negative. Default is not write threshold.
s/not/no/
> +To set the threshold to zero disables it.
s/To set/Setting/
Missing the change to the 'query-block' and 'query-named-block-nodes'
examples to show the new always-present output field.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]
next prev parent reply other threads:[~2014-12-01 21:07 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-28 12:31 [Qemu-devel] [PATCH v3] add write threshold reporting for block devices Francesco Romani
2014-11-28 12:31 ` [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Francesco Romani
2014-12-01 21:07 ` Eric Blake [this message]
2014-12-02 7:47 ` Francesco Romani
2014-12-02 8:07 ` Francesco Romani
2014-12-02 15:01 ` Stefan Hajnoczi
2014-12-02 15:11 ` Francesco Romani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=547CD89A.2090903@redhat.com \
--to=eblake@redhat.com \
--cc=fromani@redhat.com \
--cc=kwolf@redhat.com \
--cc=lcapitulino@redhat.com \
--cc=mdroth@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).