Re: [Qemu-devel] [libvirt] IO accounting overhaul

From: "Benoît Canet" <benoit.canet@irqsave.net>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Benoît Canet" <benoit.canet@irqsave.net>,
	kwolf@redhat.com, libvir-list@redhat.com, qemu-devel@nongnu.org,
	stefanha@redhat.com, anshul.makkar@profitbricks.com
Subject: Re: [Qemu-devel] [libvirt]  IO accounting overhaul
Date: Mon, 1 Sep 2014 12:44:38 +0200	[thread overview]
Message-ID: <20140901104437.GA15673@irqsave.net> (raw)
In-Reply-To: <8761h7fz27.fsf@blackfin.pond.sub.org>

The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> Cc'ing libvirt following Stefan's lead.
> 
> Benoît Canet <benoit.canet@irqsave.net> writes:
> 
> > Hi,
> >
> > I collected some items of a cloud provider wishlist regarding I/O accouting.
> 
> Feedback from real power-users, lovely!
> 
> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> > and doing metrology to help the cloud provider seeks hidden costs.
> >
> > I'll cover the two former topic in this mail because they are the most important
> > business wize.
> >
> > 1) prefered place to collect billing IO accounting data:
> > --------------------------------------------------------
> > For billing purpose the collected data must be as close as possible to what the
> > customer would see by using iostats in his vm.
> 
> Good point.
> 
> > The first conclusion we can draw is that the choice of collecting IO accouting
> > data used for billing in the block devices models is right.
> 
> Slightly rephrasing: doing I/O accounting in the block device models is
> right for billing.
> 
> There may be other uses for I/O accounting, with different preferences.
> For instance, data on how exactly guest I/O gets translated to host I/O
> as it flows through the nodes in the block graph could be useful.

I think this is the third point that I named as metrology.
Basically it boils down to "Where are the hidden IO costs of the QEMU block layer".

> 
> Doesn't diminish the need for accurate billing information, of course.
> 
> > 2) what to do with occurences of rare events:
> > ---------------------------------------------
> >
> > Another point is that QEMU developpers agree that they don't know which policy
> > to apply to some I/O accounting events.
> > Must QEMU discard invalid I/O write IO or account them as done ?
> > Must QEMU count a failed read I/O as done ?
> >
> > When discusting this with a cloud provider the following appears:
> > these decisions
> > are really specific to each cloud provider and QEMU should not implement them.
> 
> Good point, consistent with the old advice to avoid baking policy into
> inappropriately low levels of the stack.
> 
> > The right thing to do is to add accouting counters to collect these events.
> >
> > Moreover these rare events are precious troubleshooting data so it's
> > an additional
> > reason not to toss them.
> 
> Another good point.
> 
> > 3) list of block I/O accouting metrics wished for billing and helping
> > the customers
> > -----------------------------------------------------------------------------------
> >
> > Basic I/O accouting data will end up making the customers bills.
> > Extra I/O accouting informations would be a precious help for the cloud provider
> > to implement a monitoring panel like Amazon Cloudwatch.
> 
> These are the first two from your list of three purposes, i.e. the ones
> you promised to cover here.
> 
> > Here is the list of counters and statitics I would like to help
> > implement in QEMU.
> >
> > This is the most important part of the mail and the one I would like
> > the community
> > review the most.
> >
> > Once this list is settled I would proceed to implement the required
> > infrastructure
> > in QEMU before using it in the device models.
> 
> For context, let me recap how I/O accounting works now.
> 
> The BlockDriverState abstract data type (short: BDS) can hold the
> following accounting data:
> 
>     uint64_t nr_bytes[BDRV_MAX_IOTYPE];
>     uint64_t nr_ops[BDRV_MAX_IOTYPE];
>     uint64_t total_time_ns[BDRV_MAX_IOTYPE];
>     uint64_t wr_highest_sector;
> 
> where BDRV_MAX_IOTYPE enumerates read, write, flush.
> 
> wr_highest_sector is a high watermark updated by the block layer as it
> writes sectors.
> 
> The other three are *not* touched by the block layer.  Instead, the
> block layer provides a pair of functions for device models to update
> them:
> 
>     void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
>             int64_t bytes, enum BlockAcctType type);
>     void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);
> 
> bdrv_acct_start() initializes cookie for a read, write, or flush
> operation of a certain size.  The size of a flush is always zero.
> 
> bdrv_acct_done() adds the operations to the BDS's accounting data.
> total_time_ns is incremented by the time between _start() and _done().
> 
> You may call _start() without calling _done().  That's a feature.
> Device models use it to avoid accounting some requests.
> 
> Device models are not supposed to mess with cookie directly, only
> through these two functions.
> 
> Some device models implement accounting, some don't.  The ones that do
> don't agree on how to count invalid guest requests (the ones not passed
> to block layer) and failed requests (passed to block layer and failed
> there).  It's a mess in part caused by us never writing down what
> exactly device models are expected to do.
> 
> Accounting data is used by "query-blockstats", and nothing else.
> 
> Corollary: even though every BDS holds accounting data, only the ones in
> "top" BDSes ever get used.  This is a common block layer blemish, and
> we're working on cleaning it up.
> 
> If a device model doesn't implement accounting, query-blockstats lies.
> Fortunately, its lies are pretty transparent (everything's zero) as long
> as you don't do things like connecting a backend to a device model that
> doesn't implement accounting after disconnecting it from a device model
> that does.  Still, I'd welcome a more honest QMP interface.
> 
> For me, this accounting data belongs to the device model, not the BDS.
> Naturally, the block device models should use common infrastructure.  I
> guess they use the block layer only because it's obvious infrastructure
> they share.  Clumsy design.
> 
> > /* volume of data transfered by the IOs */
> > read_bytes
> > write_bytes
> 
> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].
> 
> nr_bytes[BDRV_ACCT_FLUSH] is always zero.
> 
> Should this count only actual I/O, i.e. accumulated size of successful
> operations?
> 
> > /* operation count */
> > read_ios
> > write_ios
> > flush_ios
> >
> > /* how many invalid IOs the guest submit */
> > invalid_read_ios
> > invalid_write_ios
> > invalid_flush_ios
> >
> > /* how many io error happened */
> > read_ios_error
> > write_ios_error
> > flush_ios_error
> 
> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed.
> 
> > /* account the time passed doing IOs */
> > total_read_time
> > total_write_time
> > total_flush_time
> 
> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE],
> total_time_ns[BDRV_ACCT_FLUSH].
> 
> I guess this should count both successful and failed I/O.  Could throw
> in invalid, too, but it's probably too quick to matter.
> 
> Please specify the unit clearly.  Both total_FOO_time_ns or total_FOO_ns
> would work for me.

Yes _ns is fine for me too.

> 
> > /* since when the volume is iddle */
> > qvolume_iddleness_time
> 
> "idle"
> 
> The obvious way to maintain this information with the current could
> would be saving the value of get_clock() in bdrv_acct_done().
> 
> > /* the following would compute latecies for slices of 1 seconds then toss the
> >  * result and start a new slice. A weighted sumation of the instant latencies
> >  * could help to implement this.
> >  */
> > 1s_read_average_latency
> > 1s_write_average_latency
> > 1s_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1
> > minute slice value */
> > 1m_read_average_latency
> > 1m_write_average_latency
> > 1m_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1 hours
> > slice value */
> > 1h_read_average_latency
> > 1h_write_average_latency
> > 1h_flush_average_latency
> 
> This is something like "what we added to total_FOO_time in the last
> completed 1s / 1m / 1h time slice divided by the number of additions".
> Just another way to accumulate the same raw data, thus no worries.
> 
> > /* 1 second average number of requests in flight */
> > 1s_read_queue_depth
> > 1s_write_queue_depth
> >
> > /* 1 minute average number of requests in flight */
> > 1m_read_queue_depth
> > 1m_write_queue_depth
> >
> > /* 1 hours average number of requests in flight */
> > 1h_read_queue_depth
> > 1h_write_queue_depth
> 
> I guess this involves counting bdrv_acct_start() and bdrv_acct_done().
> The "you need not call bdrv_acct_done()" feature may get in the way.
> Solvable.
> 
> Permit me a short detour into the other use for I/O accounting I
> mentioned: data on how exactly guest I/O gets translated to host I/O as
> it flows through the nodes in the block graph.  Do you think this would
> be pretty much the same data, just collected at different points?

That something I would like to take care in a further sub project.
Optionally collecting the same data for each BDS of the graph.

> 
> > 4) Making this happen
> > -------------------------
> >
> > Outscale want to make these IO stat happen and gave me the go to do whatever
> > grunt is required to do so.
> > That said we could collaborate on some part of the work.
> 
> Cool!
> 
> A quick stab at tasks:
> 
> * QMP interface, either a compatible extension of query-blockstats or a
>   new one.

I would like to extend query-blockstat in a first time but I also
would like to postpone the QMP interface changes and just write the
shared infrastructure and deploy it in the device models.

> 
> * Rough idea on how to do the shared infrastructure.

-API wize I think about adding
bdrv_acct_invalid() and
bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.

-To calculate the averages I think about a global timer firing every seconds
and iterating of the bds list to make the computations even when there is no
IO activity. Is it acceptable to have a qemu_mutex by statitic structure ?

> 
> * Implement (can be split up into several tasks if desired)

First I would like to write a series implementing a backward-compatible API and get it
merged.

Then the deployment of the new API specifics in the devices models can be splitted/parallelized.

Best regards

Benoît

> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list