All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@suse.de>
To: Chris Mason <mason@suse.com>
Cc: Andrea Arcangeli <andrea@suse.de>, Andrew Morton <akpm@osdl.org>,
	marcelo@conectiva.com.br, linux-kernel@vger.kernel.org
Subject: Re: RFC on io-stalls patch
Date: Tue, 15 Jul 2003 16:29:50 +0200	[thread overview]
Message-ID: <20030715142950.GZ833@suse.de> (raw)
In-Reply-To: <1058278688.4012.57.camel@tiny.suse.com>

On Tue, Jul 15 2003, Chris Mason wrote:
> On Tue, 2003-07-15 at 05:17, Jens Axboe wrote:
> 
> > > BTW, the contest run times vary pretty wildy.  My 3 compiles with
> > > io_load running on 2.4.21 were 603s, 443s and 515s.  This doesn't make
> > > the average of the 3 numbers invalid, but we need a more stable metric.
> > 
> > Mine are pretty consistent [1], I'd suspect that it isn't contest but your
> > drive tcq skewing things. But it would be nice to test with other things
> > as well, I just used contest because it was at hand.
> 
> I hacked up my random io generator a bit, combined with tiobench it gets
> pretty consistent numbers.  rio is attached, it does lots of different
> things but the basic idea is to create a sparse file and then do io of
> random sizes into random places in the file.
> 
> So, random writes with a large max record size can starve readers pretty
> well.  I've been running it like so:
> 
> #!/bin/sh
> #
> # rio sparse file size of 2G, writes only, max record size 1MB
> #
> rio -s 2G -W -m 1m &
> kid=$!
> sleep 1
> tiobench.pl --threads 8
> kill $kid
> wait
> 
> tiobench is nice because it also gives latency numbers, I'm playing a
> bit to see how the number of tiobench threads changes things, but the
> contest output is much easier to compare.  After a little formatting:
> 
> Sequential Reads
>             File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier  Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------------ ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.4.22-pre5 1792  4096    8    6.92 2.998%     4.363     9666.86   0.02245  0.00000   231
> 2.4.21      1792  4096    8    8.40 3.275%     3.052     3249.79   0.00131  0.00000   256
> 
> Random Reads
>             File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier  Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------------ ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.4.22-pre5 1792  4096    8    1.41 0.540%    20.932      604.13   0.00000  0.00000   260
> 2.4.21      1792  4096    8    0.65 0.540%    41.458     2689.96   0.05000  0.00000   120
> 
> Sequential Writes
>             File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier  Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------------ ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.4.22-pre5 1792  4096    8   13.77 8.793%     1.550     3416.72   0.00567  0.00000   157
> 2.4.21      1792  4096    8   15.38 8.847%     1.169    47134.93   0.00719  0.00305   174
> 
> Random Writes
>             File  Blk   Num                   Avg      Maximum      Lat%     Lat%    CPU
> Identifier  Size  Size  Thr   Rate  (CPU%)  Latency    Latency      >2s      >10s    Eff
> ------------ ------ ----- ---  ------ ------ --------- -----------  -------- -------- -----
> 2.4.22-pre5 1792  4096    8    0.68 1.470%     0.027       12.91   0.00000  0.00000    46
> 2.4.21      1792  4096    8    0.67 0.598%     0.043       67.21   0.00000  0.00000   112
> 
> rio output:
> 2.4.22-pre5 total io 2683.087 MB, 428.000 seconds 6.269 MB/s
> 2.4.21      total io 3231.576 MB, 381.000 seconds 8.482 MB/s
> 2.4.22-pre5 writes: 2683.087 MB, 6.269 MB/s
> 2.4.21      writes: 3231.576 MB, 8.482 MB/s
> 
> Without breaking tiobench apart to skip the write phase it is hard to
> tell where the rio throughput loss is.  My guess is that most of it was
> lost during the sequential write tiobench phase.
> 
> Since jens is getting consistent contest numbers, this isn't meant to
> replace it.  Just another data point worth looking at.  Jens if you want
> to send me your current patch I can give it a quick spin here.

Very nice Chris! I'm still getting the ide numbers here, the patch I'm
using at present looks like the following.

===== drivers/block/ll_rw_blk.c 1.47 vs edited =====
--- 1.47/drivers/block/ll_rw_blk.c	Fri Jul 11 10:30:54 2003
+++ edited/drivers/block/ll_rw_blk.c	Tue Jul 15 11:09:38 2003
@@ -458,6 +458,7 @@
   
  	INIT_LIST_HEAD(&q->rq.free);
 	q->rq.count = 0;
+	q->rq.pending[READ] = q->rq.pending[WRITE] = 0;
 	q->nr_requests = 0;
 
 	si_meminfo(&si);
@@ -549,18 +550,33 @@
 static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
-	struct request_list *rl;
+	struct request_list *rl = &q->rq;
 
-	rl = &q->rq;
-	if (!list_empty(&rl->free) && !blk_oversized_queue(q)) {
+	if (blk_oversized_queue(q)) {
+		int rlim = q->nr_requests >> 5;
+
+		if (rlim < 4)
+			rlim = 4;
+
+		/*
+		 * if its a write, or we have more than a handful of reads
+		 * pending, bail out
+		 */
+		if ((rw == WRITE) || (rw == READ && rl->pending[READ] > rlim))
+			return NULL;
+	}
+	
+	if (!list_empty(&rl->free)) {
 		rq = blkdev_free_rq(&rl->free);
 		list_del(&rq->queue);
 		rl->count--;
+		rl->pending[rw]++;
 		rq->rq_status = RQ_ACTIVE;
 		rq->cmd = rw;
 		rq->special = NULL;
 		rq->q = q;
 	}
+
 	return rq;
 }
 
@@ -866,13 +882,25 @@
 	 * assume it has free buffers and check waiters
 	 */
 	if (q) {
+		struct request_list *rl = &q->rq;
 		int oversized_batch = 0;
 
 		if (q->can_throttle)
 			oversized_batch = blk_oversized_queue_batch(q);
-		q->rq.count++;
-		list_add(&req->queue, &q->rq.free);
-		if (q->rq.count >= q->batch_requests && !oversized_batch) {
+		rl->count++;
+		/*
+		 * paranoia check
+		 */
+		if (req->cmd == READ || req->cmd == WRITE)
+			rl->pending[req->cmd]--;
+		if (rl->pending[READ] > q->nr_requests)
+			printf("reads: %u\n", rl->pending[READ]);
+		if (rl->pending[WRITE] > q->nr_requests)
+			printf("writes: %u\n", rl->pending[WRITE]);
+		if (rl->pending[READ] + rl->pending[WRITE] > q->nr_requests)
+			printf("too many: %u + %u > %u\n", rl->pending[READ], rl->pending[WRITE], q->nr_requests);
+		list_add(&req->queue, &rl->free);
+		if (rl->count >= q->batch_requests && !oversized_batch) {
 			smp_mb();
 			if (waitqueue_active(&q->wait_for_requests))
 				wake_up(&q->wait_for_requests);
@@ -947,7 +975,7 @@
 static int __make_request(request_queue_t * q, int rw,
 				  struct buffer_head * bh)
 {
-	unsigned int sector, count;
+	unsigned int sector, count, sync;
 	int max_segments = MAX_SEGMENTS;
 	struct request * req, *freereq = NULL;
 	int rw_ahead, max_sectors, el_ret;
@@ -958,6 +986,7 @@
 
 	count = bh->b_size >> 9;
 	sector = bh->b_rsector;
+	sync = test_and_clear_bit(BH_Sync, &bh->b_state);
 
 	rw_ahead = 0;	/* normal case; gets changed below for READA */
 	switch (rw) {
@@ -1124,6 +1153,8 @@
 		blkdev_release_request(freereq);
 	if (should_wake)
 		get_request_wait_wakeup(q, rw);
+	if (sync)
+		__generic_unplug_device(q);
 	spin_unlock_irq(&io_request_lock);
 	return 0;
 end_io:
===== fs/buffer.c 1.89 vs edited =====
--- 1.89/fs/buffer.c	Tue Jun 24 23:11:29 2003
+++ edited/fs/buffer.c	Mon Jul 14 16:59:59 2003
@@ -1142,6 +1142,7 @@
 	bh = getblk(dev, block, size);
 	if (buffer_uptodate(bh))
 		return bh;
+	set_bit(BH_Sync, &bh->b_state);
 	ll_rw_block(READ, 1, &bh);
 	wait_on_buffer(bh);
 	if (buffer_uptodate(bh))
===== include/linux/blkdev.h 1.24 vs edited =====
--- 1.24/include/linux/blkdev.h	Fri Jul  4 19:18:12 2003
+++ edited/include/linux/blkdev.h	Tue Jul 15 10:06:04 2003
@@ -66,6 +66,7 @@
 
 struct request_list {
 	unsigned int count;
+	unsigned int pending[2];
 	struct list_head free;
 };
 
===== include/linux/fs.h 1.82 vs edited =====
--- 1.82/include/linux/fs.h	Wed Jul  9 20:42:38 2003
+++ edited/include/linux/fs.h	Tue Jul 15 10:13:13 2003
@@ -221,6 +221,7 @@
 	BH_Launder,	/* 1 if we can throttle on this buffer */
 	BH_Attached,	/* 1 if b_inode_buffers is linked into a list */
 	BH_JBD,		/* 1 if it has an attached journal_head */
+	BH_Sync,	/* 1 if the buffer is a sync read */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities

-- 
Jens Axboe


  reply	other threads:[~2003-07-15 14:15 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-07-08 20:06 RFC on io-stalls patch Marcelo Tosatti
2003-07-10 13:57 ` Jens Axboe
2003-07-11 14:13   ` Chris Mason
2003-07-12  0:20     ` Nick Piggin
2003-07-12 18:37       ` Chris Mason
2003-07-12  7:37     ` Jens Axboe
2003-07-12  7:48       ` Jens Axboe
2003-07-12 18:32       ` Chris Mason
2003-07-13  0:33         ` Andrea Arcangeli
2003-07-13  9:01         ` Jens Axboe
2003-07-13 16:20           ` Chris Mason
2003-07-13 16:45             ` Jeff Garzik
2003-07-13 19:33               ` Andrea Arcangeli
2003-07-13 17:47             ` Jens Axboe
2003-07-13 19:35               ` Andrea Arcangeli
2003-07-14  0:36                 ` Chris Mason
2003-07-13 19:19           ` Andrea Arcangeli
2003-07-14  5:49             ` Jens Axboe
2003-07-14 12:23               ` Marcelo Tosatti
2003-07-14 13:12                 ` Jens Axboe
2003-07-14 19:51                   ` Jens Axboe
2003-07-14 20:09                     ` Chris Mason
2003-07-14 20:19                       ` Andrea Arcangeli
2003-07-14 21:24                         ` Chris Mason
2003-07-15  5:46                       ` Jens Axboe
2003-07-14 20:09                     ` Marcelo Tosatti
2003-07-14 20:24                       ` Andrea Arcangeli
2003-07-14 20:34                         ` Chris Mason
2003-07-15  5:35                           ` Jens Axboe
     [not found]                           ` <20030714224528.GU16313@dualathlon.random>
2003-07-15  5:40                             ` Jens Axboe
     [not found]                             ` <1058229360.13317.364.camel@tiny.suse.com>
2003-07-15  5:43                               ` Jens Axboe
     [not found]                               ` <20030714175238.3eaddd9a.akpm@osdl.org>
     [not found]                                 ` <20030715020706.GC16313@dualathlon.random>
2003-07-15  5:45                                   ` Jens Axboe
2003-07-15  6:01                                     ` Andrea Arcangeli
2003-07-15  6:08                                       ` Jens Axboe
2003-07-15  7:03                                         ` Andrea Arcangeli
2003-07-15  8:28                                           ` Jens Axboe
2003-07-15  9:12                                             ` Chris Mason
2003-07-15  9:17                                               ` Jens Axboe
2003-07-15  9:18                                                 ` Jens Axboe
2003-07-15  9:30                                                   ` Chris Mason
2003-07-15 10:03                                                   ` Andrea Arcangeli
2003-07-15 10:11                                                     ` Jens Axboe
2003-07-15 14:18                                                 ` Chris Mason
2003-07-15 14:29                                                   ` Jens Axboe [this message]
2003-07-16 17:06                                                   ` Chris Mason
2003-07-15  9:22                                               ` Chris Mason
2003-07-15  9:59                                               ` Andrea Arcangeli
2003-07-15  9:48                                             ` Andrea Arcangeli
2003-07-14 20:16                     ` Andrea Arcangeli
2003-07-14 20:17                       ` Marcelo Tosatti
2003-07-14 20:27                         ` Andrea Arcangeli
2003-07-15  5:26                       ` Jens Axboe
2003-07-15  5:48                         ` Andrea Arcangeli
2003-07-15  6:01                           ` Jens Axboe
2003-07-15  6:33                             ` Andrea Arcangeli
2003-07-15 11:22                         ` Alan Cox
2003-07-15 11:27                           ` Jens Axboe
2003-07-16 12:43                             ` Andrea Arcangeli
2003-07-16 12:46                               ` Jens Axboe
2003-07-16 12:59                                 ` Andrea Arcangeli
2003-07-16 13:04                                   ` Jens Axboe
2003-07-16 13:11                                     ` Andrea Arcangeli
2003-07-16 13:21                                       ` Jens Axboe
2003-07-16 13:44                                         ` Andrea Arcangeli
2003-07-16 14:00                                           ` Jens Axboe
2003-07-16 14:24                                             ` Andrea Arcangeli
2003-07-16 16:49                                     ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2003-07-15 18:47 Shane Shrybman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030715142950.GZ833@suse.de \
    --to=axboe@suse.de \
    --cc=akpm@osdl.org \
    --cc=andrea@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcelo@conectiva.com.br \
    --cc=mason@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.