From: Jens Axboe <axboe@suse.de>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: Andrea Arcangeli <andrea@suse.de>,
Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>,
William Lee Irwin III <wli@holomorphy.com>,
Nick Piggin <nickpiggin@yahoo.com.au>,
linux-ide@vger.kernel.org,
Linux Kernel <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@osdl.org>
Subject: Re: [PATCH] speed up SATA
Date: Tue, 30 Mar 2004 13:09:28 +0200 [thread overview]
Message-ID: <20040330110928.GR24370@suse.de> (raw)
In-Reply-To: <40687CF0.3040206@pobox.com>
On Mon, Mar 29 2004, Jeff Garzik wrote:
> Andrea Arcangeli wrote:
> >Once you change the API too, then you can set the hardwre limit in your
> >driver and relay on the highlevel blkdev code to find the optimal runtime
> >dma size for bulk I/O, but today it's your driver that is enforcing a
> >runtime bulk dma size, not the maximim limit of the controller, and so
> >you should code your patch accordingly.
>
> This magic 512k or 1M limit has nothing to do with SATA. It's a magic
> number whose definition is "we don't want PATA or SATA or SCSI disks
> doing larger requests than this for latency, VM, and similar reasons."
>
> That definition belongs outside the low-level SATA driver.
>
> This 512k/1M value is sure to change over time, which involves
> needlessly plugging the same value into multiple places. And when such
> changes occurs, you must be careful not to exceed hardware- and
> errata-based limits -- of which there is no distinction in the code.
>
> I think the length of this discussion alone clearly implies that the
> low-level driver should not be responsible for selecting this value, if
> nothing else ;-)
Here's a quickly done patch that attempts to adjust the value based on a
previous range of completed requests. It changes ->max_sectors to be a
hardware limit, adding ->optimal_sectors to be our max issued io target.
It is split on READ and WRITE. The target is to keep request execution
time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead
max window is kept within a single request in size.
So this is pretty half-assed, but it gets the point across. Things that
should be looked at (meaning - should be done, but I didn't want to
waste time on them now):
- I added the hook to see how many in-drive queued requests you have, so
there's the possibility to add tcq knowledge as well.
- The optimal_sectors calculation is just an average of previous maximum
with current maximum. The scheme has a number of break down points,
for instance writes starving reads will give you a bad read execution
time, further limiting ->optimal_sectors[READ]
- HZ == 1000 is hardcoded
- bdi->ra_pages setting is just best effort, there's no attempt made to
synchronize this with the issuer. Would require too much effort for
probably zero real gain.
===== drivers/block/elevator.c 1.53 vs edited =====
--- 1.53/drivers/block/elevator.c Mon Jan 19 07:38:36 2004
+++ edited/drivers/block/elevator.c Tue Mar 30 12:59:57 2004
@@ -150,6 +150,15 @@
void elv_requeue_request(request_queue_t *q, struct request *rq)
{
/*
+ * it already went through dequeue, we need to decrement the
+ * in_flight count again
+ */
+ if (blk_account_rq(rq)) {
+ WARN_ON(q->in_flight == 0);
+ q->in_flight--;
+ }
+
+ /*
* if iosched has an explicit requeue hook, then use that. otherwise
* just put the request at the front of the queue
*/
@@ -221,6 +230,9 @@
}
}
+ if (rq)
+ rq->issue_time = jiffies;
+
return rq;
}
@@ -229,6 +241,21 @@
elevator_t *e = &q->elevator;
/*
+ * the time frame between a request being removed from the lists
+ * and to it is freed is accounted as io that is in progress at
+ * the driver side. note that we only account requests that the
+ * driver has seen (REQ_STARTED set), to avoid false accounting
+ * for request-request merges
+ */
+ if (blk_account_rq(rq)) {
+ q->in_flight++;
+#if 0
+ q->max_in_flight = max(q->in_flight, q->max_in_flight);
+#endif
+ WARN_ON(q->in_flight > 2 * q->nr_requests);
+ }
+
+ /*
* the main clearing point for q->last_merge is on retrieval of
* request by driver (it calls elv_next_request()), but it _can_
* also happen here if a request is added to the queue but later
@@ -316,6 +343,14 @@
void elv_completed_request(request_queue_t *q, struct request *rq)
{
elevator_t *e = &q->elevator;
+
+ /*
+ * request is released from the driver, io must be done
+ */
+ if (blk_account_rq(rq)) {
+ WARN_ON(q->in_flight == 0);
+ q->in_flight--;
+ }
if (e->elevator_completed_req_fn)
e->elevator_completed_req_fn(q, rq);
===== drivers/block/ll_rw_blk.c 1.237 vs edited =====
--- 1.237/drivers/block/ll_rw_blk.c Tue Mar 16 11:29:58 2004
+++ edited/drivers/block/ll_rw_blk.c Tue Mar 30 12:58:51 2004
@@ -313,7 +313,7 @@
* Enables a low level driver to set an upper limit on the size of
* received requests.
**/
-void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors)
+void blk_queue_max_sectors(request_queue_t *q, unsigned int max_sectors)
{
if ((max_sectors << 9) < PAGE_CACHE_SIZE) {
max_sectors = 1 << (PAGE_CACHE_SHIFT - 9);
@@ -321,6 +321,14 @@
}
q->max_sectors = max_sectors;
+
+ /*
+ * use a max of 1MB as a starting point
+ */
+ q->optimal_sectors[READ] = max_sectors;
+ if (q->optimal_sectors[READ] > 2048)
+ q->optimal_sectors[READ] = 2048;
+ q->optimal_sectors[WRITE] = q->optimal_sectors[READ];
}
EXPORT_SYMBOL(blk_queue_max_sectors);
@@ -1018,10 +1026,12 @@
return 1;
}
-static int ll_back_merge_fn(request_queue_t *q, struct request *req,
+static int ll_back_merge_fn(request_queue_t *q, struct request *req,
struct bio *bio)
{
- if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+ const int rw = rq_data_dir(req);
+
+ if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
req->flags |= REQ_NOMERGE;
q->last_merge = NULL;
return 0;
@@ -1033,10 +1043,12 @@
return ll_new_hw_segment(q, req, bio);
}
-static int ll_front_merge_fn(request_queue_t *q, struct request *req,
+static int ll_front_merge_fn(request_queue_t *q, struct request *req,
struct bio *bio)
{
- if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+ const int rw = rq_data_dir(req);
+
+ if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
req->flags |= REQ_NOMERGE;
q->last_merge = NULL;
return 0;
@@ -1053,6 +1065,7 @@
{
int total_phys_segments = req->nr_phys_segments +next->nr_phys_segments;
int total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
+ const int rw = rq_data_dir(req);
/*
* First check if the either of the requests are re-queued
@@ -1064,7 +1077,7 @@
/*
* Will it become to large?
*/
- if ((req->nr_sectors + next->nr_sectors) > q->max_sectors)
+ if ((req->nr_sectors + next->nr_sectors) > q->optimal_sectors[rw])
return 0;
total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
@@ -2312,9 +2325,9 @@
__blk_put_request(q, freereq);
if (blk_queue_plugged(q)) {
- int nr_queued = q->rq.count[READ] + q->rq.count[WRITE];
+ int nrq = q->rq.count[READ] + q->rq.count[WRITE] - q->in_flight;
- if (nr_queued == q->unplug_thresh)
+ if (nrq == q->unplug_thresh)
__generic_unplug_device(q);
}
spin_unlock_irq(q->queue_lock);
@@ -2575,6 +2588,74 @@
rq->nr_hw_segments = nr_hw_segs;
}
+#define BLK_IORATE_SAMPLES (512)
+#define BLK_IORATE_ALIGNMASK (7) /* 4kb alignment */
+#define BLK_IORATE_TARGET (50) /* 50ms latency */
+#define BLK_IORATE_ALIGN(x) (x) = (((x) + BLK_IORATE_ALIGNMASK) & ~BLK_IORATE_ALIGNMASK)
+
+static inline void blk_calc_stream_rate(struct request *rq, int nbytes)
+{
+ request_queue_t *q = rq->q;
+ unsigned int iorate, max_sectors;
+ unsigned long duration;
+ struct blk_iorate *ior;
+ int rw;
+
+ if (!blk_fs_request(rq))
+ return;
+ if (unlikely(!q))
+ return;
+
+ rw = rq_data_dir(rq);
+ ior = &q->iorates[rw];
+ duration = jiffies - rq->issue_time;
+
+ ior->io_rate_samples++;
+ if (!(ior->io_rate_samples & (BLK_IORATE_SAMPLES - 1))) {
+ ior->io_rate_samples = 0;
+ ior->best_io_rate = 0;
+ }
+
+ duration = max(jiffies - rq->start_time, 1UL);
+ iorate = nbytes / duration;
+ if (iorate > ior->best_io_rate)
+ ior->best_io_rate = iorate;
+
+ /*
+ * average old optimal sectors with best data rate
+ */
+ if (!ior->io_rate_samples) {
+ struct backing_dev_info *bdi = &q->backing_dev_info;
+ int ra_sec = bdi->ra_pages << (PAGE_CACHE_SHIFT - 9);
+ int ra_sec_max = VM_MAX_READAHEAD << 9;
+
+ max_sectors = (ior->best_io_rate * BLK_IORATE_TARGET) >> 9;
+ q->optimal_sectors[rw] = (q->optimal_sectors[rw] + max_sectors) / 2;
+ BLK_IORATE_ALIGN(q->optimal_sectors[rw]);
+
+ /*
+ * don't exceed the hardware limit
+ */
+ if (q->optimal_sectors[rw] > q->max_sectors)
+ q->optimal_sectors[rw] = q->max_sectors;
+
+ /*
+ * let read-ahead be optimal io size aligned, and no more than
+ * a single request
+ */
+ if (rw == READ) {
+ if (ra_sec > q->optimal_sectors[READ])
+ ra_sec = q->optimal_sectors[READ];
+ if (ra_sec > ra_sec_max)
+ ra_sec = ra_sec_max;
+
+ bdi->ra_pages = ra_sec >> (PAGE_CACHE_SHIFT - 9);
+ }
+
+ printk("%s: %s optimal sectors %u (ra %lu)\n", rq->rq_disk->disk_name, rw == WRITE ? "WRITE" : "READ", q->optimal_sectors[rw], bdi->ra_pages);
+ }
+}
+
void blk_recalc_rq_sectors(struct request *rq, int nsect)
{
if (blk_fs_request(rq)) {
@@ -2628,7 +2709,8 @@
printk("end_request: I/O error, dev %s, sector %llu\n",
req->rq_disk ? req->rq_disk->disk_name : "?",
(unsigned long long)req->sector);
- }
+ } else
+ blk_calc_stream_rate(req, nr_bytes);
total_bytes = bio_nbytes = 0;
while ((bio = req->bio)) {
===== include/linux/blkdev.h 1.138 vs edited =====
--- 1.138/include/linux/blkdev.h Fri Mar 12 10:33:07 2004
+++ edited/include/linux/blkdev.h Tue Mar 30 13:00:51 2004
@@ -122,6 +122,7 @@
struct gendisk *rq_disk;
int errors;
unsigned long start_time;
+ unsigned long issue_time;
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
@@ -267,6 +268,11 @@
atomic_t refcnt; /* map can be shared */
};
+struct blk_iorate {
+ unsigned int io_rate_samples;
+ unsigned int best_io_rate;
+};
+
struct request_queue
{
/*
@@ -337,11 +343,12 @@
*/
unsigned long nr_requests; /* Max # of requests */
- unsigned short max_sectors;
unsigned short max_phys_segments;
unsigned short max_hw_segments;
unsigned short hardsect_size;
unsigned int max_segment_size;
+ unsigned int max_sectors;
+ unsigned int optimal_sectors[2];
unsigned long seg_boundary_mask;
unsigned int dma_alignment;
@@ -350,6 +357,10 @@
atomic_t refcnt;
+ unsigned short in_flight;
+
+ struct blk_iorate iorates[2];
+
/*
* sg stuff
*/
@@ -378,6 +389,9 @@
#define blk_fs_request(rq) ((rq)->flags & REQ_CMD)
#define blk_pc_request(rq) ((rq)->flags & REQ_BLOCK_PC)
#define blk_noretry_request(rq) ((rq)->flags & REQ_FAILFAST)
+#define blk_rq_started(rq) ((rq)->flags & REQ_STARTED)
+
+#define blk_account_rq(rq) (blk_rq_started(rq) && blk_fs_request(rq))
#define blk_pm_suspend_request(rq) ((rq)->flags & REQ_PM_SUSPEND)
#define blk_pm_resume_request(rq) ((rq)->flags & REQ_PM_RESUME)
@@ -558,7 +572,7 @@
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
extern void blk_queue_bounce_limit(request_queue_t *, u64);
-extern void blk_queue_max_sectors(request_queue_t *, unsigned short);
+extern void blk_queue_max_sectors(request_queue_t *, unsigned int);
extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short);
extern void blk_queue_max_hw_segments(request_queue_t *, unsigned short);
extern void blk_queue_max_segment_size(request_queue_t *, unsigned int);
--
Jens Axboe
next prev parent reply other threads:[~2004-03-30 11:10 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-03-27 22:37 [PATCH] speed up SATA Jeff Garzik
2004-03-27 23:04 ` Stefan Smietanowski
2004-03-27 23:11 ` Jeff Garzik
2004-03-28 7:23 ` Stefan Smietanowski
2004-03-28 15:37 ` Bartlomiej Zolnierkiewicz
2004-03-27 23:32 ` Bartlomiej Zolnierkiewicz
2004-03-27 23:36 ` Jeff Garzik
2004-03-27 23:40 ` Jeff Garzik
2004-03-28 0:13 ` Bartlomiej Zolnierkiewicz
2004-03-28 0:08 ` Jeff Garzik
2004-03-29 11:42 ` Pavel Machek
2004-03-27 23:37 ` Nick Piggin
2004-03-27 23:44 ` Jeff Garzik
2004-03-27 23:47 ` Nick Piggin
2004-03-27 23:59 ` Jeff Garzik
2004-03-28 14:10 ` Jens Axboe
2004-03-28 17:31 ` Jeff Garzik
2004-03-28 17:35 ` Jens Axboe
2004-03-28 17:48 ` Jeff Garzik
2004-03-28 17:54 ` Jens Axboe
2004-03-28 18:08 ` Jamie Lokier
2004-03-28 18:15 ` Jens Axboe
2004-03-28 18:55 ` Jeff Garzik
2004-03-29 8:09 ` Jens Axboe
2004-03-29 12:41 ` Jamie Lokier
2004-03-29 12:44 ` Jens Axboe
2004-03-29 12:50 ` Jamie Lokier
2004-03-29 13:05 ` Arjan van de Ven
2004-03-29 13:08 ` Jens Axboe
2004-03-30 8:13 ` Kurt Garloff
2004-03-30 11:40 ` Jens Axboe
2004-03-29 17:19 ` Craig I. Hagan
2004-03-29 18:19 ` Jeff Garzik
2004-03-28 19:06 ` Jeff Garzik
2004-03-28 18:12 ` William Lee Irwin III
2004-03-28 18:17 ` Jens Axboe
2004-03-28 18:30 ` Bartlomiej Zolnierkiewicz
2004-03-28 18:30 ` Jens Axboe
2004-03-28 18:45 ` Bartlomiej Zolnierkiewicz
2004-03-28 18:59 ` Jeff Garzik
2004-03-28 20:32 ` Andrew Morton
2004-03-28 20:45 ` Jeff Garzik
2004-03-29 0:55 ` Andrea Arcangeli
2004-03-29 4:02 ` Jeff Garzik
2004-03-29 13:04 ` Andrea Arcangeli
2004-03-29 19:45 ` Jeff Garzik
2004-03-30 11:09 ` Jens Axboe [this message]
2004-03-30 15:54 ` Timothy Miller
2004-03-30 16:20 ` Jeff Garzik
2004-03-30 18:05 ` Timothy Miller
2004-03-30 17:50 ` Jeff Garzik
2004-03-30 18:19 ` Timothy Miller
2004-03-29 4:29 ` Wim Coekaerts
2004-03-29 7:32 ` Denis Vlasenko
2004-03-29 8:13 ` Jens Axboe
2004-03-29 13:05 ` Andrea Arcangeli
2004-03-29 4:31 ` William Lee Irwin III
2004-03-29 4:57 ` Jeff Garzik
2004-03-28 19:52 ` Nuno Silva
2004-03-28 20:02 ` Jeff Garzik
2004-03-28 0:06 ` Jeff Garzik
2004-03-28 0:15 ` Nick Piggin
2004-03-28 0:49 ` Jeff Garzik
2004-03-28 1:02 ` Andrew Morton
2004-03-28 1:09 ` Jeff Garzik
2004-03-28 13:59 ` Jens Axboe
2004-03-28 17:29 ` Jeff Garzik
2004-03-28 17:31 ` Jens Axboe
2004-03-28 13:51 ` Jamie Lokier
2004-03-28 17:24 ` Jeff Garzik
2004-03-28 17:36 ` Jamie Lokier
2004-03-28 17:54 ` Jeff Garzik
2004-03-28 20:50 ` Eric D. Mudama
2004-04-02 10:11 ` Jeremy Higdon
2004-04-02 16:11 ` Jamie Lokier
2004-04-03 10:48 ` Jeremy Higdon
2004-04-03 13:49 ` Jamie Lokier
2004-03-28 17:40 ` Jens Axboe
2004-03-28 17:49 ` Jeff Garzik
2004-03-28 17:55 ` Jens Axboe
2004-03-28 18:04 ` Jeff Garzik
2004-03-28 18:09 ` Jens Axboe
2004-03-28 20:12 ` Jeff Garzik
2004-03-28 20:54 ` Eric D. Mudama
2004-03-28 7:32 ` Stefan Smietanowski
2004-03-28 20:25 ` Jeff Garzik
2004-03-28 21:16 ` Stefan Smietanowski
2004-03-28 21:26 ` Jeff Garzik
2004-03-28 14:08 ` Jens Axboe
2004-03-28 17:38 ` Jeff Garzik
2004-03-28 17:45 ` Jens Axboe
2004-03-28 20:21 ` Jeff Garzik
2004-03-28 0:07 ` Andrew Morton
2004-03-28 0:21 ` Nick Piggin
2004-03-28 4:40 ` Eric D. Mudama
2004-03-28 6:56 ` Nick Piggin
2004-03-28 20:33 ` Eric D. Mudama
2004-03-28 20:59 ` Eric D. Mudama
2004-03-29 1:30 ` Nick Piggin
2004-03-29 5:24 ` Eric D. Mudama
2004-03-29 13:03 ` Jamie Lokier
2004-03-29 11:36 ` Pavel Machek
2004-03-29 18:46 ` David Lang
2004-03-29 20:13 ` Jeff Garzik
2004-03-30 5:55 ` Eric D. Mudama
2004-03-30 11:54 ` Marc Bevand
2004-03-30 13:07 ` Jens Axboe
2004-03-30 13:48 ` Marc Bevand
2004-03-30 13:49 ` Jens Axboe
2004-03-30 15:31 ` Jeff Garzik
2004-03-30 17:42 ` Jeff Garzik
2004-03-31 9:12 ` Marc Bevand
2004-03-30 12:16 ` Marc Bevand
-- strict thread matches above, loose matches on Subject: below --
2004-03-31 5:47 Marcus Hartig
2004-03-31 6:56 ` Jeff Garzik
2004-03-31 16:07 ` Marcus Hartig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040330110928.GR24370@suse.de \
--to=axboe@suse.de \
--cc=B.Zolnierkiewicz@elka.pw.edu.pl \
--cc=akpm@osdl.org \
--cc=andrea@suse.de \
--cc=jgarzik@pobox.com \
--cc=linux-ide@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nickpiggin@yahoo.com.au \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.