Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Don Dupuis <ddupuissprint@earthlink.net>
To: zlatko.calusic@iskon.hr
Cc: linux-kernel@vger.kernel.org,
	"Stephen C. Tweedie" <sct@redhat.com>, Jens Axboe <axboe@suse.de>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3
Date: 11 Apr 2002 21:23:20 -0500	[thread overview]
Message-ID: <1018578201.4395.22.camel@linux-ath.linux.dev.com> (raw)
In-Reply-To: <dnu1qia3zg.fsf@magla.zg.iskon.hr>

I have a program called cspm at http://sourceforge.net/projects/cspm
that uses Stephen's sard patches.  This program allows showing stats on
all disks, controllers, and on a systemwide basis.  It will show ios,
uses, merges, and blocks per second.  This program was written in QT.
Please have a look at it and provide some feedback.

Thanks.

Don Dupuis

On Thu, 2002-04-11 at 06:06, Zlatko Calusic wrote:
> Hi!
> 
> I have ported Stephen's excellent sard I/O accounting patch to
> 2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility
> (originally written by Greg Franks) and created a man page for it.
> 
> All this gives Linux a much needed functionality - extensive
> measurements and reporting of disk device performance/bottlenecks.
> No server should be without it. See below for a real life story.
> 
> The patch, iostat utility and man page can be found at
> 
>  <URL:http://linux.inet.hr/>
> 
> Also, find the patch appended to this mail, for quicker browsing.
> 
> 
> Some observations:
> -----------------
> 
> The patch is quite well tested and stable (that is, it shouldn't eat
> your machine :)). Both on UP & SMP configurations, but I have tested
> it only on IDE hardware. Please test it on your configuration
> (especially SCSI) and report results.
> 
> I had to make some changes, as kernel internals have changed since the
> time patch was originally written. Alan has also included this patch
> in his 2.4.x-ac kernel series, but it is not working well.
> 
> First problem is that somehow, misteriously, ios_in_flight variable
> drops to value of -1 when disks are idle. Of course, this skews lots
> of other numbers and iostat reports garbage. I tried to find the cause
> of this behaviour, but failed (looks like we have a request fired on
> each disk, whose start is never accounted but completion is?!). So I
> resolved it this way
> 
> if (hd->ios_in_flight)
>                --hd->ios_in_flight;
> 
> which works well, but I would still love to know how number of I/Os
> can drop below zero. :)
> 
> Second problem/nuisance is that blk_partition_remap() destroys
> partition information from the bio->bi_dev before the request is
> queued. That's why -ac kernel doesn't report per-partition
> information. To resolve that, I had to put additional code in
> locate_hd_struct() to extract partition information back again which
> is kind of ugly (first we destroy it, then we recreate it again). The
> other way we could solve this is to put something like bio->bi_acctdev
> in bio structure and keep original device there, but that is not only
> a duplication of info (which is also ugly), but would also waste
> precious memory. I don't know if block re-mapping which is happening
> in the abovementioned function is strictly necessary. Jens, could you
> help with that one and/or give a suggestion which way to go with that
> one?
> 
> 
> TODO:
> ----
> 
> Please check the patch on various hardware you're equipped with and
> report success/failure. Somebody should really go through the man page
> and correct grammar mistakes :), maybe make it more readable in
> places. By the time this patch is ready for inclusion in the Linus
> kernel, old disk statistics counters should be removed (as they are
> buggy, incomplete and obsolete).
> 
> This patch w/ iostat utility is probably a necessary ingredient to
> make quality measurement of the new IDE-TCQ code. Too bad I have a
> Maxtor drive on a Promise controller here (any hope of me running TCQ
> on that combination in the future?).
> 
> 
> Real life story:
> ---------------
> 
> So, I had this applications heavily inserting data in the Oracle
> database. And its performance mattered to me. As the disk rattled all
> the time inserts were going on, I decided I needed to spend some time
> reordering datafiles to achieve the peak performance (disk I/O was a
> performance bottleneck, obviously). But after an hour or two I got
> only few percents improvement. How could that happen?!
> 
> Well, it happened that the database and application were already tuned
> quite well, and the load on database was (correctly) CPU based. As I
> have an SMP machine, I forgot that 40% free CPU could be on that
> second processor, obviously unusable in the app in question. And disk
> rattling was normal, lots of fsyncs, some seeks...
> 
> So, when I finally managed to port this patch to 2.5.x (we are
> speaking of a home development database, 2.5.x kernels are OK for
> those, right?! :)) iostat showed only ~20% disk utilization!  That is,
> there was _NO_ heavy I/O contention and disks were _NOT_ a culprit.
> Live and learn.
> 
> If only I had known that before I wasted my time needlessly moving big
> datafiles around and fixing configuration... :(
> 
> 
> Regards,
> 
> Zlatko
> 
> 
> Index: 7.12/include/linux/blkdev.h
> --- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644)
> +++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644)
> @@ -34,6 +34,7 @@
>  	int rq_status;	/* should split this into a few status bits */
>  	kdev_t rq_dev;
>  	int errors;
> +	unsigned long start_time;
>  	sector_t sector;
>  	unsigned long nr_sectors;
>  	unsigned long hard_sector;	/* the hard_* are block layer
> Index: 7.12/include/linux/genhd.h
> --- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644)
> +++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644)
> @@ -61,6 +61,22 @@
>  	unsigned long start_sect;
>  	unsigned long nr_sects;
>  	devfs_handle_t de;              /* primary (master) devfs entry  */
> +
> +	/* Performance stats: */
> +	unsigned int ios_in_flight;
> +	unsigned int io_ticks;
> +	unsigned int last_idle_time;
> +	unsigned int last_queue_change;
> +	unsigned int aveq;
> +
> +	unsigned int rd_ios;
> +	unsigned int rd_merges;
> +	unsigned int rd_ticks;
> +	unsigned int rd_sectors;
> +	unsigned int wr_ios;
> +	unsigned int wr_merges;
> +	unsigned int wr_ticks;
> +	unsigned int wr_sectors;
>  	int number;                     /* stupid old code wastes space  */
>  };
>  
> @@ -238,6 +254,19 @@
>  #ifdef __KERNEL__
>  
>  char *disk_name (struct gendisk *hd, int minor, char *buf);
> +
> +/*
> + * disk_round_stats is used to round off the IO statistics for a disk
> + * for a complete clock tick.
> + */
> +void disk_round_stats(struct hd_struct *hd);
> +
> +/* 
> + * Account for the completion of an IO request (used by drivers which 
> + * bypass the normal end_request processing) 
> + */
> +struct request;
> +void req_finished_io(struct request *);
>  
>  extern void devfs_register_partitions (struct gendisk *dev, int minor,
>  				       int unregister);
> Index: 7.12/drivers/scsi/scsi_lib.c
> --- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644)
> +++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644)
> @@ -382,6 +382,7 @@
>  	 */
>  	if (req->waiting)
>  		complete(req->waiting);
> +	req_finished_io(req);
>  
>  	add_blkdev_randomness(major(req->rq_dev));
>  
> Index: 7.12/drivers/block/genhd.c
> --- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644)
> +++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644)
> @@ -168,6 +168,10 @@
>  	read_unlock(&gendisk_lock);
>  }
>  
> +
> +/* Normalise the disk performance stats to a notional timer tick of 1ms. */
> +#define MSEC(x) ((x) * 1000 / HZ)
> +
>  static int show_partition(struct seq_file *part, void *v)
>  {
>  	struct gendisk *sgp = v;
> @@ -175,19 +179,34 @@
>  	char buf[64];
>  
>  	if (sgp == gendisk_head)
> -		seq_puts(part, "major minor  #blocks  name\n\n");
> +		seq_puts(part, "major minor  #blocks  name "
> +			 "rio rmerge rsect ruse "
> +			 "wio wmerge wsect wuse "
> +			 "running use aveq\n\n");
>  
>  	/* show all non-0 size partitions of this disk */
>  	for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) {
> -		if (sgp->part[n].nr_sects == 0)
> +		struct hd_struct *hd = &sgp->part[n];
> +
> +		if (!hd->nr_sects)
>  			continue;
> -		seq_printf(part, "%4d  %4d %10d %s\n",
> -			sgp->major, n, sgp->sizes[n],
> -			disk_name(sgp, n, buf));
> +		disk_round_stats(hd);
> +		seq_printf(part, "%4d  %4d %10d %s "
> +			   "%d %d %d %d %d %d %d %d %d %d %d\n",
> +			   sgp->major, n, sgp->sizes[n],
> +			   disk_name(sgp, n, buf),
> +			   hd->rd_ios, hd->rd_merges,
> +			   hd->rd_sectors, MSEC(hd->rd_ticks),
> +			   hd->wr_ios, hd->wr_merges,
> +			   hd->wr_sectors, MSEC(hd->wr_ticks),
> +			   hd->ios_in_flight, MSEC(hd->io_ticks),
> +			   MSEC(hd->aveq));
>  	}
>  
>  	return 0;
>  }
> +
> +#undef MSEC
>  
>  struct seq_operations partitions_op = {
>  	start:	part_start,
> Index: 7.12/drivers/block/ll_rw_blk.c
> --- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644)
> +++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644)
> @@ -944,6 +944,136 @@
>  		printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n");
>  }
>  
> +/* Return up to two hd_structs on which to do IO accounting for a given
> + * request.  On a partitioned device, we want to account both against
> + * the partition and against the whole disk.
> + */
> +static void locate_hd_struct(struct request *req, 
> +			     struct hd_struct **hd1,
> +			     struct hd_struct **hd2)
> +{
> +	int p, n_part;
> +	struct gendisk *gd;
> +	struct hd_struct *hd;
> +
> +	*hd1 = NULL;
> +	*hd2 = NULL;
> +
> +	gd = get_gendisk(req->rq_dev);
> +	if (gd && gd->part) {
> +		*hd1 = &gd->part[0];
> +
> +		/* Try to reconstruct the partition */
> +		n_part = 1 << gd->minor_shift;
> +		for (p = 1; p < n_part; p++) {
> +			hd = &gd->part[p];
> +			/* skip empty and extended partitions */
> +			if (hd->nr_sects < 3)
> +				continue;
> +			if (req->sector >= hd->start_sect
> +			    && req->sector < hd->start_sect + hd->nr_sects)
> +				break;
> +		}
> +		if (p < n_part)
> +			*hd2 = hd;
> +	}
> +}
> +
> +/* Round off the performance stats on an hd_struct.  The average IO
> + * queue length and utilisation statistics are maintained by observing
> + * the current state of the queue length and the amount of time it has
> + * been in this state for.  Normally, that accounting is done on IO
> + * completion, but that can result in more than a second's worth of IO
> + * being accounted for within any one second, leading to >100%
> + * utilisation.  To deal with that, we do a round-off before returning
> + * the results when reading /proc/partitions, accounting immediately for
> + * all queue usage up to the current jiffies and restarting the counters
> + * again.
> + */
> +void disk_round_stats(struct hd_struct *hd)
> +{
> +	unsigned long now = jiffies;
> +
> +	hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change);
> +	hd->last_queue_change = now;
> +
> +	if (hd->ios_in_flight)
> +		hd->io_ticks += (now - hd->last_idle_time);
> +	hd->last_idle_time = now;
> +}
> +
> +static inline void down_ios(struct hd_struct *hd)
> +{
> +	disk_round_stats(hd);
> +	if (hd->ios_in_flight)
> +		--hd->ios_in_flight;
> +}
> +
> +static inline void up_ios(struct hd_struct *hd)
> +{
> +	disk_round_stats(hd);
> +	++hd->ios_in_flight;
> +}
> +
> +static void account_io_start(struct hd_struct *hd, struct request *req,
> +			     int merge, int sectors)
> +{
> +	switch (rq_data_dir(req)) {
> +	case READ:
> +		if (merge)
> +			hd->rd_merges++;
> +		hd->rd_sectors += sectors;
> +		break;
> +	case WRITE:
> +		if (merge)
> +			hd->wr_merges++;
> +		hd->wr_sectors += sectors;
> +		break;
> +	default:
> +	}
> +	if (!merge)
> +		up_ios(hd);
> +}
> +
> +static void account_io_end(struct hd_struct *hd, struct request *req)
> +{
> +	unsigned long duration = jiffies - req->start_time;
> +	switch (rq_data_dir(req)) {
> +	case READ:
> +		hd->rd_ticks += duration;
> +		hd->rd_ios++;
> +		break;
> +	case WRITE:
> +		hd->wr_ticks += duration;
> +		hd->wr_ios++;
> +		break;
> +	default:
> +	}
> +	down_ios(hd);
> +}
> +
> +void req_new_io(struct request *req, int merge, int sectors)
> +{
> +	struct hd_struct *hd1, *hd2;
> +
> +	locate_hd_struct(req, &hd1, &hd2);
> +	if (hd1)
> +		account_io_start(hd1, req, merge, sectors);
> +	if (hd2)
> +		account_io_start(hd2, req, merge, sectors);
> +}
> +
> +void req_finished_io(struct request *req)
> +{
> +	struct hd_struct *hd1, *hd2;
> +
> +	locate_hd_struct(req, &hd1, &hd2);
> +	if (hd1)
> +		account_io_end(hd1, req);
> +	if (hd2)
> +		account_io_end(hd2, req);
> +}
> +
>  /*
>   * add-request adds a request to the linked list.
>   * queue lock is held and interrupts disabled, as we muck with the
> @@ -1018,6 +1148,8 @@
>  	 * counts here.
>  	 */
>  	if (q->merge_requests_fn(q, req, next)) {
> +		struct hd_struct *hd1, *hd2;
> +
>  		elv_merge_requests(q, req, next);
>  
>  		blkdev_dequeue_request(next);
> @@ -1027,6 +1159,15 @@
>  
>  		req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
>  
> +		/* One last thing: we have removed a request, so we now have
> +		 * one less expected IO to complete for accounting purposes.
> +		 */
> +		locate_hd_struct(req, &hd1, &hd2);
> +		if (hd1)
> +			down_ios(hd1);
> +		if (hd2)	
> +			down_ios(hd2);
> +
>  		blkdev_release_request(next);
>  	}
>  }
> @@ -1117,6 +1258,7 @@
>  			req->biotail = bio;
>  			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
>  			drive_stat_acct(req, nr_sectors, 0);
> +			req_new_io(req, 1, nr_sectors);
>  			attempt_back_merge(q, req);
>  			goto out;
>  
> @@ -1142,6 +1284,7 @@
>  			req->sector = req->hard_sector = sector;
>  			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
>  			drive_stat_acct(req, nr_sectors, 0);
> +			req_new_io(req, 1, nr_sectors);
>  			attempt_front_merge(q, req);
>  			goto out;
>  
> @@ -1210,6 +1353,8 @@
>  	req->waiting = NULL;
>  	req->bio = req->biotail = bio;
>  	req->rq_dev = bio->bi_dev;
> +	req->start_time = jiffies;
> +	req_new_io(req, 0, nr_sectors);
>  	add_request(q, req, insert_here);
>  out:
>  	if (freereq)
> @@ -1649,6 +1794,7 @@
>  {
>  	if (req->waiting)
>  		complete(req->waiting);
> +	req_finished_io(req);
>  
>  	blkdev_release_request(req);
>  }
> 
> -- 
> Zlatko
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

next prev parent reply	other threads:[~2002-04-12  2:22 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-04-11 11:06 sard/iostat disk I/O statistics/accounting for 2.5.8-pre3 Zlatko Calusic
2002-04-11 14:02 ` Christoph Hellwig
2002-04-11 17:20   ` Zlatko Calusic
2002-04-11 21:09     ` Mike Fedyk
2002-04-11 23:06       ` Zlatko Calusic
2002-04-12 10:28         ` Jochen Suckfuell
2002-04-12  2:23 ` Don Dupuis [this message]
2002-04-14 19:06   ` Zlatko Calusic
  -- strict thread matches above, loose matches on Subject: below --
2002-04-25 11:51 Thomas Tonino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1018578201.4395.22.camel@linux-ath.linux.dev.com \
    --to=ddupuissprint@earthlink.net \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=axboe@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sct@redhat.com \
    --cc=zlatko.calusic@iskon.hr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox