Re: [PATCH 0/4] block: Per-partition block IO performance histograms

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <jens.axboe@oracle.com>
To: Divyesh Shah <dpshah@google.com>
Cc: linux-kernel@vger.kernel.org, nauman@google.com, rickyb@google.com
Subject: Re: [PATCH 0/4] block: Per-partition block IO performance histograms
Date: Thu, 15 Apr 2010 12:29:56 +0200	[thread overview]
Message-ID: <20100415102956.GV27497@kernel.dk> (raw)
In-Reply-To: <20100415054057.15836.17897.stgit@austin.mtv.corp.google.com>

On Wed, Apr 14 2010, Divyesh Shah wrote:
> The following patchset implements per partition 2-d histograms for IO to block
> devices. The 3 types of histograms added are:
> 
> 1) request histograms - 2-d histogram of total request time in ms (queueing +
>    service) broken down by IO size (in bytes).
> 2) dma histograms - 2-d histogram of total service time in ms broken down by
>    IO size (in bytes).
> 3) seek histograms - 1-d histogram of seek distance
> 
> All of these histograms are per-partition. The first 2 are further divided into
> separate read and write histograms. The buckets for these histograms are
> configurable via config options as well as at runtime (per-device).
> 
> These histograms have proven very valuable to us over the years to understand
> the seek distribution of IOs over our production machines, detect large
> queueing delays, find latency outliers, etc. by being used as part of an
> always-on monitoring system.
> 
> They can be reset by writing any value to them which makes them useful for
> tests and debugging too.
> 
> This was initially written by Edward Falk in 2006 and I've forward ported
> and improved it a few times it across kernel versions.
> 
> He had also sent a very old version of this patchset (minus some features like
> runtime configurable buckets) back then to lkml - see
> http://lkml.indiana.edu/hypermail/linux/kernel/0611.1/2684.html
> Some of the reasons mentioned for not including these patches are given below.
> 
> I'm requesting re-consideration for this patchset in light of the following
> arguments.
> 
> 1) This can be done with blktrace too, why add another API?
> 
> Yes blktrace can be used to get this kind of information w/ some help from
> userspace post-processing. However, to use this as an always-on monitoring tool
> w/ blktrace and have negligible performance overhead is difficult to achieve.
> I did a quick 10-thread iozone direct IO write phase run w/ and w/o blktrace
> on a traditional rotational disk to get a feel of the impact on throughput.
> This was kernel built from Jens' for-2.6.35 branch and did not have these new
> block histogram patches.
>   o w/o blktrace:
>         Children see throughput for 10 initial writers  =   95211.22 KB/sec
>         Parent sees throughput for 10 initial writers   =   37593.20 KB/sec
>         Min throughput per thread                       =    9078.65 KB/sec
>         Max throughput per thread                       =   10055.59 KB/sec
>         Avg throughput per thread                       =    9521.12 KB/sec
>         Min xfer                                        =  462848.00 KB
> 
>   o w/ blktrace:
>         Children see throughput for 10 initial writers  =   93527.98 KB/sec
>         Parent sees throughput for 10 initial writers   =   38594.47 KB/sec
>         Min throughput per thread                       =    9197.06 KB/sec
>         Max throughput per thread                       =    9640.09 KB/sec
>         Avg throughput per thread                       =    9352.80 KB/sec
>         Min xfer                                        =  490496.00 KB
> 
> This is about 1.8% average throughput loss per thread.
> The extra cpu time spent with blktrace is in addition to this loss of
> throughput. This overhead will only go up on faster SSDs.

blktrace definitely has a bit of overhead, even if I tried to keep it at
a minimum. I'm not too crazy about adding all this extra accounting for
something we can already get with the tracing that we have available.

The above blktrace run, I take it that was just a regular unmasked run?
Did you try and tailor the information logged? If you restricted to
logging just the particual event(s) that you need to generate this data,
the overhead would be a LOT smaller.

> 2) sysfs should be only for one value per file. There are some exceptions but we
>    are working on fixing them. Please don't add new ones.
> 
> There are excpetions like meminfo, etc. that violate this guideline (I'm not
> sure if its an enforced rule) and some actually make sense since there is no way
> of representing structured data. Though these block histograms are multi-valued
> one can also interpret them as one logical piece of information.

Not a problem in my book. There's also the case of giving a real
snapshot of the information as opposed to collecting from several files.

-- 
Jens Axboe

next prev parent reply	other threads:[~2010-04-15 10:30 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-15  5:43 [PATCH 0/4] block: Per-partition block IO performance histograms Divyesh Shah
2010-04-15  5:44 ` [PATCH 1/4] block: Re-introduce rq->__nr_sectors to maintain the Divyesh Shah
2010-04-15  5:45 ` [PATCH 2/4] block: Add disk performance histograms which can be read Divyesh Shah
2010-04-15  5:45 ` [PATCH 3/4] block: Add seek histograms to the block histograms Divyesh Shah
2010-04-15  5:46 ` [PATCH 4/4] block: Make base bucket for the histograms configurable Divyesh Shah
2010-04-15 10:29 ` Jens Axboe [this message]
2010-04-15 23:49   ` [PATCH 0/4] block: Per-partition block IO performance histograms Divyesh Shah
2010-04-15 13:40 ` Jeff Moyer
2010-04-15 23:50   ` Divyesh Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100415102956.GV27497@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=dpshah@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nauman@google.com \
    --cc=rickyb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox