All of lore.kernel.org
 help / color / mirror / Atom feed
From: Don Dupuis <ddupuissprint@earthlink.net>
To: zlatko.calusic@iskon.hr
Cc: linux-kernel@vger.kernel.org,
	"Stephen C. Tweedie" <sct@redhat.com>, Jens Axboe <axboe@suse.de>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject: Re: sard/iostat disk I/O statistics/accounting for 2.5.8-pre3
Date: 11 Apr 2002 21:23:20 -0500	[thread overview]
Message-ID: <1018578201.4395.22.camel@linux-ath.linux.dev.com> (raw)
In-Reply-To: <dnu1qia3zg.fsf@magla.zg.iskon.hr>

I have a program called cspm at http://sourceforge.net/projects/cspm
that uses Stephen's sard patches.  This program allows showing stats on
all disks, controllers, and on a systemwide basis.  It will show ios,
uses, merges, and blocks per second.  This program was written in QT.
Please have a look at it and provide some feedback.

Thanks.

Don Dupuis

On Thu, 2002-04-11 at 06:06, Zlatko Calusic wrote:
> Hi!
> 
> I have ported Stephen's excellent sard I/O accounting patch to
> 2.5.8-pre3. Additionaly, I have adapted/cleaned up iostat utility
> (originally written by Greg Franks) and created a man page for it.
> 
> All this gives Linux a much needed functionality - extensive
> measurements and reporting of disk device performance/bottlenecks.
> No server should be without it. See below for a real life story.
> 
> The patch, iostat utility and man page can be found at
> 
>  <URL:http://linux.inet.hr/>
> 
> Also, find the patch appended to this mail, for quicker browsing.
> 
> 
> Some observations:
> -----------------
> 
> The patch is quite well tested and stable (that is, it shouldn't eat
> your machine :)). Both on UP & SMP configurations, but I have tested
> it only on IDE hardware. Please test it on your configuration
> (especially SCSI) and report results.
> 
> I had to make some changes, as kernel internals have changed since the
> time patch was originally written. Alan has also included this patch
> in his 2.4.x-ac kernel series, but it is not working well.
> 
> First problem is that somehow, misteriously, ios_in_flight variable
> drops to value of -1 when disks are idle. Of course, this skews lots
> of other numbers and iostat reports garbage. I tried to find the cause
> of this behaviour, but failed (looks like we have a request fired on
> each disk, whose start is never accounted but completion is?!). So I
> resolved it this way
> 
> if (hd->ios_in_flight)
>                --hd->ios_in_flight;
> 
> which works well, but I would still love to know how number of I/Os
> can drop below zero. :)
> 
> Second problem/nuisance is that blk_partition_remap() destroys
> partition information from the bio->bi_dev before the request is
> queued. That's why -ac kernel doesn't report per-partition
> information. To resolve that, I had to put additional code in
> locate_hd_struct() to extract partition information back again which
> is kind of ugly (first we destroy it, then we recreate it again). The
> other way we could solve this is to put something like bio->bi_acctdev
> in bio structure and keep original device there, but that is not only
> a duplication of info (which is also ugly), but would also waste
> precious memory. I don't know if block re-mapping which is happening
> in the abovementioned function is strictly necessary. Jens, could you
> help with that one and/or give a suggestion which way to go with that
> one?
> 
> 
> TODO:
> ----
> 
> Please check the patch on various hardware you're equipped with and
> report success/failure. Somebody should really go through the man page
> and correct grammar mistakes :), maybe make it more readable in
> places. By the time this patch is ready for inclusion in the Linus
> kernel, old disk statistics counters should be removed (as they are
> buggy, incomplete and obsolete).
> 
> This patch w/ iostat utility is probably a necessary ingredient to
> make quality measurement of the new IDE-TCQ code. Too bad I have a
> Maxtor drive on a Promise controller here (any hope of me running TCQ
> on that combination in the future?).
> 
> 
> Real life story:
> ---------------
> 
> So, I had this applications heavily inserting data in the Oracle
> database. And its performance mattered to me. As the disk rattled all
> the time inserts were going on, I decided I needed to spend some time
> reordering datafiles to achieve the peak performance (disk I/O was a
> performance bottleneck, obviously). But after an hour or two I got
> only few percents improvement. How could that happen?!
> 
> Well, it happened that the database and application were already tuned
> quite well, and the load on database was (correctly) CPU based. As I
> have an SMP machine, I forgot that 40% free CPU could be on that
> second processor, obviously unusable in the app in question. And disk
> rattling was normal, lots of fsyncs, some seeks...
> 
> So, when I finally managed to port this patch to 2.5.x (we are
> speaking of a home development database, 2.5.x kernels are OK for
> those, right?! :)) iostat showed only ~20% disk utilization!  That is,
> there was _NO_ heavy I/O contention and disks were _NOT_ a culprit.
> Live and learn.
> 
> If only I had known that before I wasted my time needlessly moving big
> datafiles around and fixing configuration... :(
> 
> 
> Regards,
> 
> Zlatko
> 
> 
> Index: 7.12/include/linux/blkdev.h
> --- 7.12/include/linux/blkdev.h Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.1 644)
> +++ 7.14/include/linux/blkdev.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/n/b/1_blkdev.h 1.1.1.1.1.1.1.2.2.2 644)
> @@ -34,6 +34,7 @@
>  	int rq_status;	/* should split this into a few status bits */
>  	kdev_t rq_dev;
>  	int errors;
> +	unsigned long start_time;
>  	sector_t sector;
>  	unsigned long nr_sectors;
>  	unsigned long hard_sector;	/* the hard_* are block layer
> Index: 7.12/include/linux/genhd.h
> --- 7.12/include/linux/genhd.h Tue, 15 Jan 2002 22:59:02 +0100 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1 644)
> +++ 7.14/include/linux/genhd.h Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/o/b/18_genhd.h 1.1.1.1.1.1.2.1 644)
> @@ -61,6 +61,22 @@
>  	unsigned long start_sect;
>  	unsigned long nr_sects;
>  	devfs_handle_t de;              /* primary (master) devfs entry  */
> +
> +	/* Performance stats: */
> +	unsigned int ios_in_flight;
> +	unsigned int io_ticks;
> +	unsigned int last_idle_time;
> +	unsigned int last_queue_change;
> +	unsigned int aveq;
> +
> +	unsigned int rd_ios;
> +	unsigned int rd_merges;
> +	unsigned int rd_ticks;
> +	unsigned int rd_sectors;
> +	unsigned int wr_ios;
> +	unsigned int wr_merges;
> +	unsigned int wr_ticks;
> +	unsigned int wr_sectors;
>  	int number;                     /* stupid old code wastes space  */
>  };
>  
> @@ -238,6 +254,19 @@
>  #ifdef __KERNEL__
>  
>  char *disk_name (struct gendisk *hd, int minor, char *buf);
> +
> +/*
> + * disk_round_stats is used to round off the IO statistics for a disk
> + * for a complete clock tick.
> + */
> +void disk_round_stats(struct hd_struct *hd);
> +
> +/* 
> + * Account for the completion of an IO request (used by drivers which 
> + * bypass the normal end_request processing) 
> + */
> +struct request;
> +void req_finished_io(struct request *);
>  
>  extern void devfs_register_partitions (struct gendisk *dev, int minor,
>  				       int unregister);
> Index: 7.12/drivers/scsi/scsi_lib.c
> --- 7.12/drivers/scsi/scsi_lib.c Tue, 12 Feb 2002 09:26:34 +0100 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2 644)
> +++ 7.14/drivers/scsi/scsi_lib.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/j/c/26_scsi_lib.c 1.1.1.1.1.2.1.1 644)
> @@ -382,6 +382,7 @@
>  	 */
>  	if (req->waiting)
>  		complete(req->waiting);
> +	req_finished_io(req);
>  
>  	add_blkdev_randomness(major(req->rq_dev));
>  
> Index: 7.12/drivers/block/genhd.c
> --- 7.12/drivers/block/genhd.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.1 644)
> +++ 7.14/drivers/block/genhd.c Wed, 10 Apr 2002 10:02:44 +0200 zcalusic (linux25/x/c/3_genhd.c 1.1.1.1.1.1.3.2 644)
> @@ -168,6 +168,10 @@
>  	read_unlock(&gendisk_lock);
>  }
>  
> +
> +/* Normalise the disk performance stats to a notional timer tick of 1ms. */
> +#define MSEC(x) ((x) * 1000 / HZ)
> +
>  static int show_partition(struct seq_file *part, void *v)
>  {
>  	struct gendisk *sgp = v;
> @@ -175,19 +179,34 @@
>  	char buf[64];
>  
>  	if (sgp == gendisk_head)
> -		seq_puts(part, "major minor  #blocks  name\n\n");
> +		seq_puts(part, "major minor  #blocks  name "
> +			 "rio rmerge rsect ruse "
> +			 "wio wmerge wsect wuse "
> +			 "running use aveq\n\n");
>  
>  	/* show all non-0 size partitions of this disk */
>  	for (n = 0; n < (sgp->nr_real << sgp->minor_shift); n++) {
> -		if (sgp->part[n].nr_sects == 0)
> +		struct hd_struct *hd = &sgp->part[n];
> +
> +		if (!hd->nr_sects)
>  			continue;
> -		seq_printf(part, "%4d  %4d %10d %s\n",
> -			sgp->major, n, sgp->sizes[n],
> -			disk_name(sgp, n, buf));
> +		disk_round_stats(hd);
> +		seq_printf(part, "%4d  %4d %10d %s "
> +			   "%d %d %d %d %d %d %d %d %d %d %d\n",
> +			   sgp->major, n, sgp->sizes[n],
> +			   disk_name(sgp, n, buf),
> +			   hd->rd_ios, hd->rd_merges,
> +			   hd->rd_sectors, MSEC(hd->rd_ticks),
> +			   hd->wr_ios, hd->wr_merges,
> +			   hd->wr_sectors, MSEC(hd->wr_ticks),
> +			   hd->ios_in_flight, MSEC(hd->io_ticks),
> +			   MSEC(hd->aveq));
>  	}
>  
>  	return 0;
>  }
> +
> +#undef MSEC
>  
>  struct seq_operations partitions_op = {
>  	start:	part_start,
> Index: 7.12/drivers/block/ll_rw_blk.c
> --- 7.12/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 09:50:17 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.1 644)
> +++ 7.14/drivers/block/ll_rw_blk.c Wed, 10 Apr 2002 18:37:24 +0200 zcalusic (linux25/x/c/9_ll_rw_blk. 1.1.1.1.1.1.1.1.1.1.3.1.1.3 644)
> @@ -944,6 +944,136 @@
>  		printk(KERN_ERR "drive_stat_acct: cmd not R/W?\n");
>  }
>  
> +/* Return up to two hd_structs on which to do IO accounting for a given
> + * request.  On a partitioned device, we want to account both against
> + * the partition and against the whole disk.
> + */
> +static void locate_hd_struct(struct request *req, 
> +			     struct hd_struct **hd1,
> +			     struct hd_struct **hd2)
> +{
> +	int p, n_part;
> +	struct gendisk *gd;
> +	struct hd_struct *hd;
> +
> +	*hd1 = NULL;
> +	*hd2 = NULL;
> +
> +	gd = get_gendisk(req->rq_dev);
> +	if (gd && gd->part) {
> +		*hd1 = &gd->part[0];
> +
> +		/* Try to reconstruct the partition */
> +		n_part = 1 << gd->minor_shift;
> +		for (p = 1; p < n_part; p++) {
> +			hd = &gd->part[p];
> +			/* skip empty and extended partitions */
> +			if (hd->nr_sects < 3)
> +				continue;
> +			if (req->sector >= hd->start_sect
> +			    && req->sector < hd->start_sect + hd->nr_sects)
> +				break;
> +		}
> +		if (p < n_part)
> +			*hd2 = hd;
> +	}
> +}
> +
> +/* Round off the performance stats on an hd_struct.  The average IO
> + * queue length and utilisation statistics are maintained by observing
> + * the current state of the queue length and the amount of time it has
> + * been in this state for.  Normally, that accounting is done on IO
> + * completion, but that can result in more than a second's worth of IO
> + * being accounted for within any one second, leading to >100%
> + * utilisation.  To deal with that, we do a round-off before returning
> + * the results when reading /proc/partitions, accounting immediately for
> + * all queue usage up to the current jiffies and restarting the counters
> + * again.
> + */
> +void disk_round_stats(struct hd_struct *hd)
> +{
> +	unsigned long now = jiffies;
> +
> +	hd->aveq += hd->ios_in_flight * (now - hd->last_queue_change);
> +	hd->last_queue_change = now;
> +
> +	if (hd->ios_in_flight)
> +		hd->io_ticks += (now - hd->last_idle_time);
> +	hd->last_idle_time = now;
> +}
> +
> +static inline void down_ios(struct hd_struct *hd)
> +{
> +	disk_round_stats(hd);
> +	if (hd->ios_in_flight)
> +		--hd->ios_in_flight;
> +}
> +
> +static inline void up_ios(struct hd_struct *hd)
> +{
> +	disk_round_stats(hd);
> +	++hd->ios_in_flight;
> +}
> +
> +static void account_io_start(struct hd_struct *hd, struct request *req,
> +			     int merge, int sectors)
> +{
> +	switch (rq_data_dir(req)) {
> +	case READ:
> +		if (merge)
> +			hd->rd_merges++;
> +		hd->rd_sectors += sectors;
> +		break;
> +	case WRITE:
> +		if (merge)
> +			hd->wr_merges++;
> +		hd->wr_sectors += sectors;
> +		break;
> +	default:
> +	}
> +	if (!merge)
> +		up_ios(hd);
> +}
> +
> +static void account_io_end(struct hd_struct *hd, struct request *req)
> +{
> +	unsigned long duration = jiffies - req->start_time;
> +	switch (rq_data_dir(req)) {
> +	case READ:
> +		hd->rd_ticks += duration;
> +		hd->rd_ios++;
> +		break;
> +	case WRITE:
> +		hd->wr_ticks += duration;
> +		hd->wr_ios++;
> +		break;
> +	default:
> +	}
> +	down_ios(hd);
> +}
> +
> +void req_new_io(struct request *req, int merge, int sectors)
> +{
> +	struct hd_struct *hd1, *hd2;
> +
> +	locate_hd_struct(req, &hd1, &hd2);
> +	if (hd1)
> +		account_io_start(hd1, req, merge, sectors);
> +	if (hd2)
> +		account_io_start(hd2, req, merge, sectors);
> +}
> +
> +void req_finished_io(struct request *req)
> +{
> +	struct hd_struct *hd1, *hd2;
> +
> +	locate_hd_struct(req, &hd1, &hd2);
> +	if (hd1)
> +		account_io_end(hd1, req);
> +	if (hd2)
> +		account_io_end(hd2, req);
> +}
> +
>  /*
>   * add-request adds a request to the linked list.
>   * queue lock is held and interrupts disabled, as we muck with the
> @@ -1018,6 +1148,8 @@
>  	 * counts here.
>  	 */
>  	if (q->merge_requests_fn(q, req, next)) {
> +		struct hd_struct *hd1, *hd2;
> +
>  		elv_merge_requests(q, req, next);
>  
>  		blkdev_dequeue_request(next);
> @@ -1027,6 +1159,15 @@
>  
>  		req->nr_sectors = req->hard_nr_sectors += next->hard_nr_sectors;
>  
> +		/* One last thing: we have removed a request, so we now have
> +		 * one less expected IO to complete for accounting purposes.
> +		 */
> +		locate_hd_struct(req, &hd1, &hd2);
> +		if (hd1)
> +			down_ios(hd1);
> +		if (hd2)	
> +			down_ios(hd2);
> +
>  		blkdev_release_request(next);
>  	}
>  }
> @@ -1117,6 +1258,7 @@
>  			req->biotail = bio;
>  			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
>  			drive_stat_acct(req, nr_sectors, 0);
> +			req_new_io(req, 1, nr_sectors);
>  			attempt_back_merge(q, req);
>  			goto out;
>  
> @@ -1142,6 +1284,7 @@
>  			req->sector = req->hard_sector = sector;
>  			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
>  			drive_stat_acct(req, nr_sectors, 0);
> +			req_new_io(req, 1, nr_sectors);
>  			attempt_front_merge(q, req);
>  			goto out;
>  
> @@ -1210,6 +1353,8 @@
>  	req->waiting = NULL;
>  	req->bio = req->biotail = bio;
>  	req->rq_dev = bio->bi_dev;
> +	req->start_time = jiffies;
> +	req_new_io(req, 0, nr_sectors);
>  	add_request(q, req, insert_here);
>  out:
>  	if (freereq)
> @@ -1649,6 +1794,7 @@
>  {
>  	if (req->waiting)
>  		complete(req->waiting);
> +	req_finished_io(req);
>  
>  	blkdev_release_request(req);
>  }
> 
> -- 
> Zlatko
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



  parent reply	other threads:[~2002-04-12  2:22 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-04-11 11:06 sard/iostat disk I/O statistics/accounting for 2.5.8-pre3 Zlatko Calusic
2002-04-11 14:02 ` Christoph Hellwig
2002-04-11 17:20   ` Zlatko Calusic
2002-04-11 21:09     ` Mike Fedyk
2002-04-11 23:06       ` Zlatko Calusic
2002-04-12 10:28         ` Jochen Suckfuell
2002-04-12  2:23 ` Don Dupuis [this message]
2002-04-14 19:06   ` Zlatko Calusic
  -- strict thread matches above, loose matches on Subject: below --
2002-04-25 11:51 Thomas Tonino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1018578201.4395.22.camel@linux-ath.linux.dev.com \
    --to=ddupuissprint@earthlink.net \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=axboe@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sct@redhat.com \
    --cc=zlatko.calusic@iskon.hr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.