From: "Benoît Canet" <benoit.canet@irqsave.net>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
kwolf@redhat.com, libvir-list@redhat.com, qemu-devel@nongnu.org,
stefanha@redhat.com, anshul.makkar@profitbricks.com
Subject: Re: [Qemu-devel] [libvirt] IO accounting overhaul
Date: Mon, 1 Sep 2014 12:44:38 +0200 [thread overview]
Message-ID: <20140901104437.GA15673@irqsave.net> (raw)
In-Reply-To: <8761h7fz27.fsf@blackfin.pond.sub.org>
The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> Cc'ing libvirt following Stefan's lead.
>
> Benoît Canet <benoit.canet@irqsave.net> writes:
>
> > Hi,
> >
> > I collected some items of a cloud provider wishlist regarding I/O accouting.
>
> Feedback from real power-users, lovely!
>
> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> > and doing metrology to help the cloud provider seeks hidden costs.
> >
> > I'll cover the two former topic in this mail because they are the most important
> > business wize.
> >
> > 1) prefered place to collect billing IO accounting data:
> > --------------------------------------------------------
> > For billing purpose the collected data must be as close as possible to what the
> > customer would see by using iostats in his vm.
>
> Good point.
>
> > The first conclusion we can draw is that the choice of collecting IO accouting
> > data used for billing in the block devices models is right.
>
> Slightly rephrasing: doing I/O accounting in the block device models is
> right for billing.
>
> There may be other uses for I/O accounting, with different preferences.
> For instance, data on how exactly guest I/O gets translated to host I/O
> as it flows through the nodes in the block graph could be useful.
I think this is the third point that I named as metrology.
Basically it boils down to "Where are the hidden IO costs of the QEMU block layer".
>
> Doesn't diminish the need for accurate billing information, of course.
>
> > 2) what to do with occurences of rare events:
> > ---------------------------------------------
> >
> > Another point is that QEMU developpers agree that they don't know which policy
> > to apply to some I/O accounting events.
> > Must QEMU discard invalid I/O write IO or account them as done ?
> > Must QEMU count a failed read I/O as done ?
> >
> > When discusting this with a cloud provider the following appears:
> > these decisions
> > are really specific to each cloud provider and QEMU should not implement them.
>
> Good point, consistent with the old advice to avoid baking policy into
> inappropriately low levels of the stack.
>
> > The right thing to do is to add accouting counters to collect these events.
> >
> > Moreover these rare events are precious troubleshooting data so it's
> > an additional
> > reason not to toss them.
>
> Another good point.
>
> > 3) list of block I/O accouting metrics wished for billing and helping
> > the customers
> > -----------------------------------------------------------------------------------
> >
> > Basic I/O accouting data will end up making the customers bills.
> > Extra I/O accouting informations would be a precious help for the cloud provider
> > to implement a monitoring panel like Amazon Cloudwatch.
>
> These are the first two from your list of three purposes, i.e. the ones
> you promised to cover here.
>
> > Here is the list of counters and statitics I would like to help
> > implement in QEMU.
> >
> > This is the most important part of the mail and the one I would like
> > the community
> > review the most.
> >
> > Once this list is settled I would proceed to implement the required
> > infrastructure
> > in QEMU before using it in the device models.
>
> For context, let me recap how I/O accounting works now.
>
> The BlockDriverState abstract data type (short: BDS) can hold the
> following accounting data:
>
> uint64_t nr_bytes[BDRV_MAX_IOTYPE];
> uint64_t nr_ops[BDRV_MAX_IOTYPE];
> uint64_t total_time_ns[BDRV_MAX_IOTYPE];
> uint64_t wr_highest_sector;
>
> where BDRV_MAX_IOTYPE enumerates read, write, flush.
>
> wr_highest_sector is a high watermark updated by the block layer as it
> writes sectors.
>
> The other three are *not* touched by the block layer. Instead, the
> block layer provides a pair of functions for device models to update
> them:
>
> void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
> int64_t bytes, enum BlockAcctType type);
> void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);
>
> bdrv_acct_start() initializes cookie for a read, write, or flush
> operation of a certain size. The size of a flush is always zero.
>
> bdrv_acct_done() adds the operations to the BDS's accounting data.
> total_time_ns is incremented by the time between _start() and _done().
>
> You may call _start() without calling _done(). That's a feature.
> Device models use it to avoid accounting some requests.
>
> Device models are not supposed to mess with cookie directly, only
> through these two functions.
>
> Some device models implement accounting, some don't. The ones that do
> don't agree on how to count invalid guest requests (the ones not passed
> to block layer) and failed requests (passed to block layer and failed
> there). It's a mess in part caused by us never writing down what
> exactly device models are expected to do.
>
> Accounting data is used by "query-blockstats", and nothing else.
>
> Corollary: even though every BDS holds accounting data, only the ones in
> "top" BDSes ever get used. This is a common block layer blemish, and
> we're working on cleaning it up.
>
> If a device model doesn't implement accounting, query-blockstats lies.
> Fortunately, its lies are pretty transparent (everything's zero) as long
> as you don't do things like connecting a backend to a device model that
> doesn't implement accounting after disconnecting it from a device model
> that does. Still, I'd welcome a more honest QMP interface.
>
> For me, this accounting data belongs to the device model, not the BDS.
> Naturally, the block device models should use common infrastructure. I
> guess they use the block layer only because it's obvious infrastructure
> they share. Clumsy design.
>
> > /* volume of data transfered by the IOs */
> > read_bytes
> > write_bytes
>
> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].
>
> nr_bytes[BDRV_ACCT_FLUSH] is always zero.
>
> Should this count only actual I/O, i.e. accumulated size of successful
> operations?
>
> > /* operation count */
> > read_ios
> > write_ios
> > flush_ios
> >
> > /* how many invalid IOs the guest submit */
> > invalid_read_ios
> > invalid_write_ios
> > invalid_flush_ios
> >
> > /* how many io error happened */
> > read_ios_error
> > write_ios_error
> > flush_ios_error
>
> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed.
>
> > /* account the time passed doing IOs */
> > total_read_time
> > total_write_time
> > total_flush_time
>
> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE],
> total_time_ns[BDRV_ACCT_FLUSH].
>
> I guess this should count both successful and failed I/O. Could throw
> in invalid, too, but it's probably too quick to matter.
>
> Please specify the unit clearly. Both total_FOO_time_ns or total_FOO_ns
> would work for me.
Yes _ns is fine for me too.
>
> > /* since when the volume is iddle */
> > qvolume_iddleness_time
>
> "idle"
>
> The obvious way to maintain this information with the current could
> would be saving the value of get_clock() in bdrv_acct_done().
>
> > /* the following would compute latecies for slices of 1 seconds then toss the
> > * result and start a new slice. A weighted sumation of the instant latencies
> > * could help to implement this.
> > */
> > 1s_read_average_latency
> > 1s_write_average_latency
> > 1s_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1
> > minute slice value */
> > 1m_read_average_latency
> > 1m_write_average_latency
> > 1m_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1 hours
> > slice value */
> > 1h_read_average_latency
> > 1h_write_average_latency
> > 1h_flush_average_latency
>
> This is something like "what we added to total_FOO_time in the last
> completed 1s / 1m / 1h time slice divided by the number of additions".
> Just another way to accumulate the same raw data, thus no worries.
>
> > /* 1 second average number of requests in flight */
> > 1s_read_queue_depth
> > 1s_write_queue_depth
> >
> > /* 1 minute average number of requests in flight */
> > 1m_read_queue_depth
> > 1m_write_queue_depth
> >
> > /* 1 hours average number of requests in flight */
> > 1h_read_queue_depth
> > 1h_write_queue_depth
>
> I guess this involves counting bdrv_acct_start() and bdrv_acct_done().
> The "you need not call bdrv_acct_done()" feature may get in the way.
> Solvable.
>
> Permit me a short detour into the other use for I/O accounting I
> mentioned: data on how exactly guest I/O gets translated to host I/O as
> it flows through the nodes in the block graph. Do you think this would
> be pretty much the same data, just collected at different points?
That something I would like to take care in a further sub project.
Optionally collecting the same data for each BDS of the graph.
>
> > 4) Making this happen
> > -------------------------
> >
> > Outscale want to make these IO stat happen and gave me the go to do whatever
> > grunt is required to do so.
> > That said we could collaborate on some part of the work.
>
> Cool!
>
> A quick stab at tasks:
>
> * QMP interface, either a compatible extension of query-blockstats or a
> new one.
I would like to extend query-blockstat in a first time but I also
would like to postpone the QMP interface changes and just write the
shared infrastructure and deploy it in the device models.
>
> * Rough idea on how to do the shared infrastructure.
-API wize I think about adding
bdrv_acct_invalid() and
bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
-To calculate the averages I think about a global timer firing every seconds
and iterating of the bds list to make the computations even when there is no
IO activity. Is it acceptable to have a qemu_mutex by statitic structure ?
>
> * Implement (can be split up into several tasks if desired)
First I would like to write a series implementing a backward-compatible API and get it
merged.
Then the deployment of the new API specifics in the devices models can be splitted/parallelized.
Best regards
Benoît
>
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
next prev parent reply other threads:[~2014-09-01 10:45 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-28 14:38 [Qemu-devel] IO accounting overhaul Benoît Canet
2014-08-29 16:04 ` Stefan Hajnoczi
2014-08-29 16:32 ` Benoît Canet
2014-09-01 9:52 ` Markus Armbruster
2014-09-01 10:44 ` Benoît Canet [this message]
2014-09-01 11:41 ` [Qemu-devel] [libvirt] " Markus Armbruster
2014-09-01 13:38 ` Benoît Canet
2014-09-02 13:59 ` Markus Armbruster
2014-09-05 14:30 ` Kevin Wolf
2014-09-05 14:56 ` Benoît Canet
2014-09-05 14:57 ` Benoît Canet
2014-09-05 15:24 ` Benoît Canet
2014-09-08 7:12 ` Markus Armbruster
2014-09-08 9:12 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140901104437.GA15673@irqsave.net \
--to=benoit.canet@irqsave.net \
--cc=anshul.makkar@profitbricks.com \
--cc=armbru@redhat.com \
--cc=kwolf@redhat.com \
--cc=libvir-list@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).