public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@suse.de>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: Andrea Arcangeli <andrea@suse.de>,
	Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>,
	William Lee Irwin III <wli@holomorphy.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	linux-ide@vger.kernel.org,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@osdl.org>
Subject: Re: [PATCH] speed up SATA
Date: Tue, 30 Mar 2004 13:09:28 +0200	[thread overview]
Message-ID: <20040330110928.GR24370@suse.de> (raw)
In-Reply-To: <40687CF0.3040206@pobox.com>

On Mon, Mar 29 2004, Jeff Garzik wrote:
> Andrea Arcangeli wrote:
> >Once you change the API too, then you can set the hardwre limit in your
> >driver and relay on the highlevel blkdev code to find the optimal runtime
> >dma size for bulk I/O, but today it's your driver that is enforcing a
> >runtime bulk dma size, not the maximim limit of the controller, and so
> >you should code your patch accordingly.
> 
> This magic 512k or 1M limit has nothing to do with SATA.  It's a magic 
> number whose definition is "we don't want PATA or SATA or SCSI disks 
> doing larger requests than this for latency, VM, and similar reasons."
> 
> That definition belongs outside the low-level SATA driver.
> 
> This 512k/1M value is sure to change over time, which involves 
> needlessly plugging the same value into multiple places.  And when such 
> changes occurs, you must be careful not to exceed hardware- and 
> errata-based limits -- of which there is no distinction in the code.
> 
> I think the length of this discussion alone clearly implies that the 
> low-level driver should not be responsible for selecting this value, if 
> nothing else ;-)

Here's a quickly done patch that attempts to adjust the value based on a
previous range of completed requests. It changes ->max_sectors to be a
hardware limit, adding ->optimal_sectors to be our max issued io target.
It is split on READ and WRITE. The target is to keep request execution
time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead
max window is kept within a single request in size.

So this is pretty half-assed, but it gets the point across. Things that
should be looked at (meaning - should be done, but I didn't want to
waste time on them now):

- I added the hook to see how many in-drive queued requests you have, so
  there's the possibility to add tcq knowledge as well.

- The optimal_sectors calculation is just an average of previous maximum
  with current maximum. The scheme has a number of break down points,
  for instance writes starving reads will give you a bad read execution
  time, further limiting ->optimal_sectors[READ]

- HZ == 1000 is hardcoded

- bdi->ra_pages setting is just best effort, there's no attempt made to
  synchronize this with the issuer. Would require too much effort for
  probably zero real gain.

===== drivers/block/elevator.c 1.53 vs edited =====
--- 1.53/drivers/block/elevator.c	Mon Jan 19 07:38:36 2004
+++ edited/drivers/block/elevator.c	Tue Mar 30 12:59:57 2004
@@ -150,6 +150,15 @@
 void elv_requeue_request(request_queue_t *q, struct request *rq)
 {
 	/*
+	 * it already went through dequeue, we need to decrement the
+	 * in_flight count again
+	 */
+	if (blk_account_rq(rq)) {
+		WARN_ON(q->in_flight == 0);
+		q->in_flight--;
+	}
+
+	/*
 	 * if iosched has an explicit requeue hook, then use that. otherwise
 	 * just put the request at the front of the queue
 	 */
@@ -221,6 +230,9 @@
 		}
 	}
 
+	if (rq)
+		rq->issue_time = jiffies;
+
 	return rq;
 }
 
@@ -229,6 +241,21 @@
 	elevator_t *e = &q->elevator;
 
 	/*
+	 * the time frame between a request being removed from the lists
+	 * and to it is freed is accounted as io that is in progress at
+	 * the driver side. note that we only account requests that the
+	 * driver has seen (REQ_STARTED set), to avoid false accounting
+	 * for request-request merges
+	 */
+	if (blk_account_rq(rq)) {
+		q->in_flight++;
+#if 0
+		q->max_in_flight = max(q->in_flight, q->max_in_flight);
+#endif
+		WARN_ON(q->in_flight > 2 * q->nr_requests);
+	}
+
+	/*
 	 * the main clearing point for q->last_merge is on retrieval of
 	 * request by driver (it calls elv_next_request()), but it _can_
 	 * also happen here if a request is added to the queue but later
@@ -316,6 +343,14 @@
 void elv_completed_request(request_queue_t *q, struct request *rq)
 {
 	elevator_t *e = &q->elevator;
+
+	/*
+	 * request is released from the driver, io must be done
+	 */
+	if (blk_account_rq(rq)) {
+		WARN_ON(q->in_flight == 0);
+		q->in_flight--;
+	}
 
 	if (e->elevator_completed_req_fn)
 		e->elevator_completed_req_fn(q, rq);
===== drivers/block/ll_rw_blk.c 1.237 vs edited =====
--- 1.237/drivers/block/ll_rw_blk.c	Tue Mar 16 11:29:58 2004
+++ edited/drivers/block/ll_rw_blk.c	Tue Mar 30 12:58:51 2004
@@ -313,7 +313,7 @@
  *    Enables a low level driver to set an upper limit on the size of
  *    received requests.
  **/
-void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors)
+void blk_queue_max_sectors(request_queue_t *q, unsigned int max_sectors)
 {
 	if ((max_sectors << 9) < PAGE_CACHE_SIZE) {
 		max_sectors = 1 << (PAGE_CACHE_SHIFT - 9);
@@ -321,6 +321,14 @@
 	}
 
 	q->max_sectors = max_sectors;
+
+	/*
+	 * use a max of 1MB as a starting point
+	 */
+	q->optimal_sectors[READ] = max_sectors;
+	if (q->optimal_sectors[READ] > 2048)
+		q->optimal_sectors[READ] = 2048;
+	q->optimal_sectors[WRITE] = q->optimal_sectors[READ];
 }
 
 EXPORT_SYMBOL(blk_queue_max_sectors);
@@ -1018,10 +1026,12 @@
 	return 1;
 }
 
-static int ll_back_merge_fn(request_queue_t *q, struct request *req, 
+static int ll_back_merge_fn(request_queue_t *q, struct request *req,
 			    struct bio *bio)
 {
-	if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+	const int rw = rq_data_dir(req);
+
+	if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
 		req->flags |= REQ_NOMERGE;
 		q->last_merge = NULL;
 		return 0;
@@ -1033,10 +1043,12 @@
 	return ll_new_hw_segment(q, req, bio);
 }
 
-static int ll_front_merge_fn(request_queue_t *q, struct request *req, 
+static int ll_front_merge_fn(request_queue_t *q, struct request *req,
 			     struct bio *bio)
 {
-	if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+	const int rw = rq_data_dir(req);
+
+	if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
 		req->flags |= REQ_NOMERGE;
 		q->last_merge = NULL;
 		return 0;
@@ -1053,6 +1065,7 @@
 {
 	int total_phys_segments = req->nr_phys_segments +next->nr_phys_segments;
 	int total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
+	const int rw = rq_data_dir(req);
 
 	/*
 	 * First check if the either of the requests are re-queued
@@ -1064,7 +1077,7 @@
 	/*
 	 * Will it become to large?
 	 */
-	if ((req->nr_sectors + next->nr_sectors) > q->max_sectors)
+	if ((req->nr_sectors + next->nr_sectors) > q->optimal_sectors[rw])
 		return 0;
 
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
@@ -2312,9 +2325,9 @@
 		__blk_put_request(q, freereq);
 
 	if (blk_queue_plugged(q)) {
-		int nr_queued = q->rq.count[READ] + q->rq.count[WRITE];
+		int nrq = q->rq.count[READ] + q->rq.count[WRITE] - q->in_flight;
 
-		if (nr_queued == q->unplug_thresh)
+		if (nrq == q->unplug_thresh)
 			__generic_unplug_device(q);
 	}
 	spin_unlock_irq(q->queue_lock);
@@ -2575,6 +2588,74 @@
 	rq->nr_hw_segments = nr_hw_segs;
 }
 
+#define BLK_IORATE_SAMPLES	(512)
+#define BLK_IORATE_ALIGNMASK	(7)	/* 4kb alignment */
+#define BLK_IORATE_TARGET	(50)	/* 50ms latency */
+#define BLK_IORATE_ALIGN(x)	(x) = (((x) + BLK_IORATE_ALIGNMASK) & ~BLK_IORATE_ALIGNMASK)
+
+static inline void blk_calc_stream_rate(struct request *rq, int nbytes)
+{
+	request_queue_t *q = rq->q;
+	unsigned int iorate, max_sectors;
+	unsigned long duration;
+	struct blk_iorate *ior;
+	int rw;
+
+	if (!blk_fs_request(rq))
+		return;
+	if (unlikely(!q))
+		return;
+
+	rw = rq_data_dir(rq);
+	ior = &q->iorates[rw];
+	duration = jiffies - rq->issue_time;
+
+	ior->io_rate_samples++;
+	if (!(ior->io_rate_samples & (BLK_IORATE_SAMPLES - 1))) {
+		ior->io_rate_samples = 0;
+		ior->best_io_rate = 0;
+	}
+			
+	duration = max(jiffies - rq->start_time, 1UL);
+	iorate = nbytes / duration;
+	if (iorate > ior->best_io_rate)
+		ior->best_io_rate = iorate;
+
+	/*
+	 * average old optimal sectors with best data rate
+	 */
+	if (!ior->io_rate_samples) {
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+		int ra_sec = bdi->ra_pages << (PAGE_CACHE_SHIFT - 9);
+		int ra_sec_max = VM_MAX_READAHEAD << 9;
+
+		max_sectors = (ior->best_io_rate * BLK_IORATE_TARGET) >> 9;
+		q->optimal_sectors[rw] = (q->optimal_sectors[rw] + max_sectors) / 2;
+		BLK_IORATE_ALIGN(q->optimal_sectors[rw]);
+
+		/*
+		 * don't exceed the hardware limit
+		 */
+		if (q->optimal_sectors[rw] > q->max_sectors)
+			q->optimal_sectors[rw] = q->max_sectors;
+
+		/*
+		 * let read-ahead be optimal io size aligned, and no more than
+		 * a single request
+		 */
+		if (rw == READ) {
+			if (ra_sec > q->optimal_sectors[READ])
+				ra_sec = q->optimal_sectors[READ];
+			if (ra_sec > ra_sec_max)
+				ra_sec = ra_sec_max;
+
+			bdi->ra_pages = ra_sec >> (PAGE_CACHE_SHIFT - 9);
+		}
+
+		printk("%s: %s optimal sectors %u (ra %lu)\n", rq->rq_disk->disk_name, rw == WRITE ? "WRITE" : "READ", q->optimal_sectors[rw], bdi->ra_pages);
+	}
+}
+
 void blk_recalc_rq_sectors(struct request *rq, int nsect)
 {
 	if (blk_fs_request(rq)) {
@@ -2628,7 +2709,8 @@
 			printk("end_request: I/O error, dev %s, sector %llu\n",
 				req->rq_disk ? req->rq_disk->disk_name : "?",
 				(unsigned long long)req->sector);
-	}
+	} else
+		blk_calc_stream_rate(req, nr_bytes);
 
 	total_bytes = bio_nbytes = 0;
 	while ((bio = req->bio)) {
===== include/linux/blkdev.h 1.138 vs edited =====
--- 1.138/include/linux/blkdev.h	Fri Mar 12 10:33:07 2004
+++ edited/include/linux/blkdev.h	Tue Mar 30 13:00:51 2004
@@ -122,6 +122,7 @@
 	struct gendisk *rq_disk;
 	int errors;
 	unsigned long start_time;
+	unsigned long issue_time;
 
 	/* Number of scatter-gather DMA addr+len pairs after
 	 * physical address coalescing is performed.
@@ -267,6 +268,11 @@
 	atomic_t refcnt;		/* map can be shared */
 };
 
+struct blk_iorate {
+	unsigned int		io_rate_samples;
+	unsigned int		best_io_rate;
+};
+
 struct request_queue
 {
 	/*
@@ -337,11 +343,12 @@
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
 
-	unsigned short		max_sectors;
 	unsigned short		max_phys_segments;
 	unsigned short		max_hw_segments;
 	unsigned short		hardsect_size;
 	unsigned int		max_segment_size;
+	unsigned int		max_sectors;
+	unsigned int		optimal_sectors[2];
 
 	unsigned long		seg_boundary_mask;
 	unsigned int		dma_alignment;
@@ -350,6 +357,10 @@
 
 	atomic_t		refcnt;
 
+	unsigned short		in_flight;
+
+	struct blk_iorate	iorates[2];
+
 	/*
 	 * sg stuff
 	 */
@@ -378,6 +389,9 @@
 #define blk_fs_request(rq)	((rq)->flags & REQ_CMD)
 #define blk_pc_request(rq)	((rq)->flags & REQ_BLOCK_PC)
 #define blk_noretry_request(rq)	((rq)->flags & REQ_FAILFAST)
+#define blk_rq_started(rq)	((rq)->flags & REQ_STARTED)
+
+#define blk_account_rq(rq)	(blk_rq_started(rq) && blk_fs_request(rq))
 
 #define blk_pm_suspend_request(rq)	((rq)->flags & REQ_PM_SUSPEND)
 #define blk_pm_resume_request(rq)	((rq)->flags & REQ_PM_RESUME)
@@ -558,7 +572,7 @@
 extern void blk_cleanup_queue(request_queue_t *);
 extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
 extern void blk_queue_bounce_limit(request_queue_t *, u64);
-extern void blk_queue_max_sectors(request_queue_t *, unsigned short);
+extern void blk_queue_max_sectors(request_queue_t *, unsigned int);
 extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short);
 extern void blk_queue_max_hw_segments(request_queue_t *, unsigned short);
 extern void blk_queue_max_segment_size(request_queue_t *, unsigned int);

-- 
Jens Axboe


  reply	other threads:[~2004-03-30 11:11 UTC|newest]

Thread overview: 115+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-27 22:37 [PATCH] speed up SATA Jeff Garzik
2004-03-27 23:04 ` Stefan Smietanowski
2004-03-27 23:11   ` Jeff Garzik
2004-03-28  7:23     ` Stefan Smietanowski
2004-03-28 15:37       ` Bartlomiej Zolnierkiewicz
2004-03-27 23:32   ` Bartlomiej Zolnierkiewicz
2004-03-27 23:36     ` Jeff Garzik
2004-03-27 23:40       ` Jeff Garzik
2004-03-28  0:13         ` Bartlomiej Zolnierkiewicz
2004-03-28  0:08           ` Jeff Garzik
2004-03-29 11:42       ` Pavel Machek
2004-03-27 23:37 ` Nick Piggin
2004-03-27 23:44   ` Jeff Garzik
2004-03-27 23:47     ` Nick Piggin
2004-03-27 23:59       ` Jeff Garzik
2004-03-28 14:10         ` Jens Axboe
2004-03-28 17:31           ` Jeff Garzik
2004-03-28 17:35             ` Jens Axboe
2004-03-28 17:48               ` Jeff Garzik
2004-03-28 17:54                 ` Jens Axboe
2004-03-28 18:08                   ` Jamie Lokier
2004-03-28 18:15                     ` Jens Axboe
2004-03-28 18:55                       ` Jeff Garzik
2004-03-29  8:09                         ` Jens Axboe
2004-03-29 12:41                           ` Jamie Lokier
2004-03-29 12:44                             ` Jens Axboe
2004-03-29 12:50                               ` Jamie Lokier
2004-03-29 13:05                               ` Arjan van de Ven
2004-03-29 13:08                                 ` Jens Axboe
2004-03-30  8:13                                   ` Kurt Garloff
2004-03-30 11:40                                     ` Jens Axboe
2004-03-29 17:19                                 ` Craig I. Hagan
2004-03-29 18:19                           ` Jeff Garzik
2004-03-28 19:06                       ` Jeff Garzik
2004-03-28 18:12                   ` William Lee Irwin III
2004-03-28 18:17                     ` Jens Axboe
2004-03-28 18:30                     ` Bartlomiej Zolnierkiewicz
2004-03-28 18:30                       ` Jens Axboe
2004-03-28 18:45                         ` Bartlomiej Zolnierkiewicz
2004-03-28 18:59                           ` Jeff Garzik
2004-03-28 20:32                             ` Andrew Morton
2004-03-28 20:45                               ` Jeff Garzik
2004-03-29  0:55                             ` Andrea Arcangeli
2004-03-29  4:02                               ` Jeff Garzik
2004-03-29 13:04                                 ` Andrea Arcangeli
2004-03-29 19:45                                   ` Jeff Garzik
2004-03-30 11:09                                     ` Jens Axboe [this message]
2004-03-30 15:54                                       ` Timothy Miller
2004-03-30 16:20                                       ` Jeff Garzik
2004-03-30 18:05                                       ` Timothy Miller
2004-03-30 17:50                                         ` Jeff Garzik
2004-03-30 18:19                                           ` Timothy Miller
2004-03-29  4:29                               ` Wim Coekaerts
2004-03-29  7:32                                 ` Denis Vlasenko
2004-03-29  8:13                                 ` Jens Axboe
2004-03-29 13:05                                 ` Andrea Arcangeli
2004-03-29  4:31                               ` William Lee Irwin III
2004-03-29  4:57                               ` Jeff Garzik
2004-03-28 19:52                       ` Nuno Silva
2004-03-28 20:02                         ` Jeff Garzik
2004-03-28  0:06     ` Jeff Garzik
2004-03-28  0:15       ` Nick Piggin
2004-03-28  0:49         ` Jeff Garzik
2004-03-28  1:02           ` Andrew Morton
2004-03-28  1:09             ` Jeff Garzik
2004-03-28 13:59               ` Jens Axboe
2004-03-28 17:29                 ` Jeff Garzik
2004-03-28 17:31                   ` Jens Axboe
2004-03-28 13:51           ` Jamie Lokier
2004-03-28 17:24             ` Jeff Garzik
2004-03-28 17:36               ` Jamie Lokier
2004-03-28 17:54                 ` Jeff Garzik
2004-03-28 20:50                   ` Eric D. Mudama
2004-04-02 10:11                 ` Jeremy Higdon
2004-04-02 16:11                   ` Jamie Lokier
2004-04-03 10:48                     ` Jeremy Higdon
2004-04-03 13:49                       ` Jamie Lokier
2004-03-28 17:40               ` Jens Axboe
2004-03-28 17:49                 ` Jeff Garzik
2004-03-28 17:55                   ` Jens Axboe
2004-03-28 18:04                     ` Jeff Garzik
2004-03-28 18:09                       ` Jens Axboe
2004-03-28 20:12                         ` Jeff Garzik
2004-03-28 20:54                           ` Eric D. Mudama
2004-03-28  7:32       ` Stefan Smietanowski
2004-03-28 20:25         ` Jeff Garzik
2004-03-28 21:16           ` Stefan Smietanowski
2004-03-28 21:26             ` Jeff Garzik
2004-03-28 14:08     ` Jens Axboe
2004-03-28 17:38       ` Jeff Garzik
2004-03-28 17:45         ` Jens Axboe
2004-03-28 20:21       ` Jeff Garzik
2004-03-28  0:07   ` Andrew Morton
2004-03-28  0:21     ` Nick Piggin
2004-03-28  4:40   ` Eric D. Mudama
2004-03-28  6:56     ` Nick Piggin
2004-03-28 20:33       ` Eric D. Mudama
2004-03-28 20:59         ` Eric D. Mudama
2004-03-29  1:30           ` Nick Piggin
2004-03-29  5:24             ` Eric D. Mudama
2004-03-29 13:03               ` Jamie Lokier
2004-03-29 11:36 ` Pavel Machek
2004-03-29 18:46   ` David Lang
2004-03-29 20:13     ` Jeff Garzik
2004-03-30  5:55   ` Eric D. Mudama
2004-03-30 11:54 ` Marc Bevand
2004-03-30 13:07   ` Jens Axboe
2004-03-30 13:48     ` Marc Bevand
2004-03-30 13:49       ` Jens Axboe
2004-03-30 15:31   ` Jeff Garzik
2004-03-30 17:42   ` Jeff Garzik
2004-03-31  9:12     ` Marc Bevand
  -- strict thread matches above, loose matches on Subject: below --
2004-03-31  5:47 Marcus Hartig
2004-03-31  6:56 ` Jeff Garzik
2004-03-31 16:07   ` Marcus Hartig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040330110928.GR24370@suse.de \
    --to=axboe@suse.de \
    --cc=B.Zolnierkiewicz@elka.pw.edu.pl \
    --cc=akpm@osdl.org \
    --cc=andrea@suse.de \
    --cc=jgarzik@pobox.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox