* [PATCH] cfq-iosched: rework seeky detection
@ 2010-01-09 15:59 Corrado Zoccolo
2010-01-11 1:47 ` Shaohua Li
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Corrado Zoccolo @ 2010-01-09 15:59 UTC (permalink / raw)
To: Jens Axboe
Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng,
Yanmin Zhang, Corrado Zoccolo
Current seeky detection is based on average seek lenght.
This is suboptimal, since the average will not distinguish between:
* a process doing medium sized seeks
* a process doing some sequential requests interleaved with larger seeks
and even a medium seek can take lot of time, if the requested sector
happens to be behind the disk head in the rotation (50% probability).
Therefore, we change the seeky queue detection to work as follows:
* each request can be classified as sequential if it is very close to
the current head position, i.e. it is likely in the disk cache (disks
usually read more data than requested, and put it in cache for
subsequent reads). Otherwise, the request is classified as seeky.
* an history window of the last 32 requests is kept, storing the
classification result.
* A queue is marked as seeky if more than 1/8 of the last 32 requests
were seeky.
This patch fixes a regression reported by Yanmin, on mmap 64k random
reads.
Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
---
block/cfq-iosched.c | 54 +++++++++++++-------------------------------------
1 files changed, 14 insertions(+), 40 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c6d5678..4e203c4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -133,9 +133,7 @@ struct cfq_queue {
unsigned short ioprio, org_ioprio;
unsigned short ioprio_class, org_ioprio_class;
- unsigned int seek_samples;
- u64 seek_total;
- sector_t seek_mean;
+ u32 seek_history;
sector_t last_request_pos;
unsigned long seeky_start;
@@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
return cfqd->last_position - blk_rq_pos(rq);
}
-#define CFQQ_SEEK_THR 8 * 1024
-#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR)
+#define CFQQ_SEEK_THR (sector_t)(8 * 100)
+#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8)
static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct request *rq, bool for_preempt)
{
- sector_t sdist = cfqq->seek_mean;
-
- if (!sample_valid(cfqq->seek_samples))
- sdist = CFQQ_SEEK_THR;
-
- /* if seek_mean is big, using it as close criteria is meaningless */
- if (sdist > CFQQ_SEEK_THR && !for_preempt)
- sdist = CFQQ_SEEK_THR;
-
- return cfq_dist_from_last(cfqd, rq) <= sdist;
+ return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR;
}
static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
@@ -2971,30 +2960,16 @@ static void
cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct request *rq)
{
- sector_t sdist;
- u64 total;
-
- if (!cfqq->last_request_pos)
- sdist = 0;
- else if (cfqq->last_request_pos < blk_rq_pos(rq))
- sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
- else
- sdist = cfqq->last_request_pos - blk_rq_pos(rq);
-
- /*
- * Don't allow the seek distance to get too large from the
- * odd fragment, pagein, etc
- */
- if (cfqq->seek_samples <= 60) /* second&third seek */
- sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024);
- else
- sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64);
+ sector_t sdist = 0;
+ if (cfqq->last_request_pos) {
+ if (cfqq->last_request_pos < blk_rq_pos(rq))
+ sdist = blk_rq_pos(rq) - cfqq->last_request_pos;
+ else
+ sdist = cfqq->last_request_pos - blk_rq_pos(rq);
+ }
- cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8;
- cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8;
- total = cfqq->seek_total + (cfqq->seek_samples/2);
- do_div(total, cfqq->seek_samples);
- cfqq->seek_mean = (sector_t)total;
+ cfqq->seek_history <<= 1;
+ cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
/*
* If this cfqq is shared between multiple processes, check to
@@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfq_mark_cfqq_deep(cfqq);
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
- && CFQQ_SEEKY(cfqq)))
+ (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
--
1.6.4.4
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-09 15:59 [PATCH] cfq-iosched: rework seeky detection Corrado Zoccolo @ 2010-01-11 1:47 ` Shaohua Li 2010-01-11 2:53 ` Gui Jianfeng 2010-01-11 14:46 ` Corrado Zoccolo 2010-01-11 16:29 ` Vivek Goyal 2010-01-12 19:12 ` Vivek Goyal 2 siblings, 2 replies; 21+ messages in thread From: Shaohua Li @ 2010-01-11 1:47 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: > Current seeky detection is based on average seek lenght. > This is suboptimal, since the average will not distinguish between: > * a process doing medium sized seeks > * a process doing some sequential requests interleaved with larger seeks > and even a medium seek can take lot of time, if the requested sector > happens to be behind the disk head in the rotation (50% probability). > > Therefore, we change the seeky queue detection to work as follows: > * each request can be classified as sequential if it is very close to > the current head position, i.e. it is likely in the disk cache (disks > usually read more data than requested, and put it in cache for > subsequent reads). Otherwise, the request is classified as seeky. > * an history window of the last 32 requests is kept, storing the > classification result. > * A queue is marked as seeky if more than 1/8 of the last 32 requests > were seeky. > > This patch fixes a regression reported by Yanmin, on mmap 64k random > reads. Can we not count a big request (say the request data is >= 32k) as seeky regardless the seek distance? In this way we can also make a 64k random sync read not as seeky. Thanks, Shaohua ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-11 1:47 ` Shaohua Li @ 2010-01-11 2:53 ` Gui Jianfeng 2010-01-11 14:20 ` Jeff Moyer 2010-01-11 14:46 ` Corrado Zoccolo 1 sibling, 1 reply; 21+ messages in thread From: Gui Jianfeng @ 2010-01-11 2:53 UTC (permalink / raw) To: Shaohua Li, Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Yanmin Zhang Shaohua Li wrote: > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >> Current seeky detection is based on average seek lenght. >> This is suboptimal, since the average will not distinguish between: >> * a process doing medium sized seeks >> * a process doing some sequential requests interleaved with larger seeks >> and even a medium seek can take lot of time, if the requested sector >> happens to be behind the disk head in the rotation (50% probability). >> >> Therefore, we change the seeky queue detection to work as follows: >> * each request can be classified as sequential if it is very close to >> the current head position, i.e. it is likely in the disk cache (disks >> usually read more data than requested, and put it in cache for >> subsequent reads). Otherwise, the request is classified as seeky. >> * an history window of the last 32 requests is kept, storing the >> classification result. >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> were seeky. >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> reads. > Can we not count a big request (say the request data is >= 32k) as seeky > regardless the seek distance? In this way we can also make a 64k random sync > read not as seeky. Or maybe we can rely on *dynamic* CFQQ_SEEK_THR in terms of data lenght to determine whether a request should be a seeky one. > > Thanks, > Shaohua > > > -- Regards Gui Jianfeng ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-11 2:53 ` Gui Jianfeng @ 2010-01-11 14:20 ` Jeff Moyer 0 siblings, 0 replies; 21+ messages in thread From: Jeff Moyer @ 2010-01-11 14:20 UTC (permalink / raw) To: Gui Jianfeng Cc: Shaohua Li, Corrado Zoccolo, Jens Axboe, Linux-Kernel, Vivek Goyal, Yanmin Zhang Gui Jianfeng <guijianfeng@cn.fujitsu.com> writes: > Shaohua Li wrote: >> On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >>> Current seeky detection is based on average seek lenght. >>> This is suboptimal, since the average will not distinguish between: >>> * a process doing medium sized seeks >>> * a process doing some sequential requests interleaved with larger seeks >>> and even a medium seek can take lot of time, if the requested sector >>> happens to be behind the disk head in the rotation (50% probability). >>> >>> Therefore, we change the seeky queue detection to work as follows: >>> * each request can be classified as sequential if it is very close to >>> the current head position, i.e. it is likely in the disk cache (disks >>> usually read more data than requested, and put it in cache for >>> subsequent reads). Otherwise, the request is classified as seeky. >>> * an history window of the last 32 requests is kept, storing the >>> classification result. >>> * A queue is marked as seeky if more than 1/8 of the last 32 requests >>> were seeky. >>> >>> This patch fixes a regression reported by Yanmin, on mmap 64k random >>> reads. >> Can we not count a big request (say the request data is >= 32k) as seeky >> regardless the seek distance? In this way we can also make a 64k random sync >> read not as seeky. > > Or maybe we can rely on *dynamic* CFQQ_SEEK_THR in terms of data lenght to > determine whether a request should be a seeky one. I'm not sure I understand the question, but it sounds like you're assuming that the last_position tracks the beginning of the last I/O. That's not the case. It tracks the end of the last I/O, and so it should not matter what the request size is. Cheers, Jeff ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-11 1:47 ` Shaohua Li 2010-01-11 2:53 ` Gui Jianfeng @ 2010-01-11 14:46 ` Corrado Zoccolo 2010-01-12 1:49 ` Shaohua Li 1 sibling, 1 reply; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-11 14:46 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang Hi, On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >> Current seeky detection is based on average seek lenght. >> This is suboptimal, since the average will not distinguish between: >> * a process doing medium sized seeks >> * a process doing some sequential requests interleaved with larger seeks >> and even a medium seek can take lot of time, if the requested sector >> happens to be behind the disk head in the rotation (50% probability). >> >> Therefore, we change the seeky queue detection to work as follows: >> * each request can be classified as sequential if it is very close to >> the current head position, i.e. it is likely in the disk cache (disks >> usually read more data than requested, and put it in cache for >> subsequent reads). Otherwise, the request is classified as seeky. >> * an history window of the last 32 requests is kept, storing the >> classification result. >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> were seeky. >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> reads. > Can we not count a big request (say the request data is >= 32k) as seeky > regardless the seek distance? In this way we can also make a 64k random sync > read not as seeky. I think I understand what you are proposing, but I don't think request size should matter at all for rotational disk. Usually, the disk firmware will load a big chunk of data in its cache even when requested to read a single sector, and will provide following ones from the cache if you read them sequentially. Now, in CFQ, what we really mean by saying that a queue is seeky is that waiting a bit in order to serve an other request from this queue doesn't give any benefit w.r.t. switching to an other queue. So, if you read a single 64k block from disk and then seek, then you can service any other request without losing bandwidth. Instead, if you are reading 4k, then the next ones (and so on up to 64k, as it happens with mmap when you fault in a single page at a time), then it is convenient to wait for the next request, since it has 3/4 of changes to be sequential, so be serviced by cache. I'm currently testing a patch to consider request size in SSDs, instead. In SSDs, the location of the request doesn't mean anything, but the size is meaningful. Therefore, submitting together many small requests from different queues can improve the overall performance. Thanks, Corrado > > Thanks, > Shaohua > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-11 14:46 ` Corrado Zoccolo @ 2010-01-12 1:49 ` Shaohua Li 2010-01-12 8:52 ` Corrado Zoccolo 0 siblings, 1 reply; 21+ messages in thread From: Shaohua Li @ 2010-01-12 1:49 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: > Hi, > On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: > >> Current seeky detection is based on average seek lenght. > >> This is suboptimal, since the average will not distinguish between: > >> * a process doing medium sized seeks > >> * a process doing some sequential requests interleaved with larger seeks > >> and even a medium seek can take lot of time, if the requested sector > >> happens to be behind the disk head in the rotation (50% probability). > >> > >> Therefore, we change the seeky queue detection to work as follows: > >> * each request can be classified as sequential if it is very close to > >> the current head position, i.e. it is likely in the disk cache (disks > >> usually read more data than requested, and put it in cache for > >> subsequent reads). Otherwise, the request is classified as seeky. > >> * an history window of the last 32 requests is kept, storing the > >> classification result. > >> * A queue is marked as seeky if more than 1/8 of the last 32 requests > >> were seeky. > >> > >> This patch fixes a regression reported by Yanmin, on mmap 64k random > >> reads. > > Can we not count a big request (say the request data is >= 32k) as seeky > > regardless the seek distance? In this way we can also make a 64k random sync > > read not as seeky. > I think I understand what you are proposing, but I don't think request > size should > matter at all for rotational disk. randread a 32k bs definitely has better throughput than a 4k bs. So the request size does matter. From iops point of view, 64k and 4k might not have difference in device, but from performance point of view, they have big difference. > Usually, the disk firmware will load a big chunk of data in its cache even when > requested to read a single sector, and will provide following ones > from the cache > if you read them sequentially. > > Now, in CFQ, what we really mean by saying that a queue is seeky is that > waiting a bit in order to serve an other request from this queue doesn't > give any benefit w.r.t. switching to an other queue. If no idle, we might switch to a random 4k access or any kind of queues. Compared to continue big request access and switch to other queue with small block, no switching does give benefit. > So, if you read a single 64k block from disk and then seek, then you can service > any other request without losing bandwidth. But the 64k bs queue loses its slice, which might means device serves more 4k access. As a result, reduce bandwidth. > Instead, if you are reading 4k, then the next ones (and so on up to 64k, as it > happens with mmap when you fault in a single page at a time), then it > is convenient > to wait for the next request, since it has 3/4 of changes to be > sequential, so be > serviced by cache. > > I'm currently testing a patch to consider request size in SSDs, instead. > In SSDs, the location of the request doesn't mean anything, but the > size is meaningful. > Therefore, submitting together many small requests from different > queues can improve > the overall performance. Agree. Thanks, Shaohua ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 1:49 ` Shaohua Li @ 2010-01-12 8:52 ` Corrado Zoccolo 2010-01-13 3:45 ` Shaohua Li 0 siblings, 1 reply; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-12 8:52 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang Hi On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: >> Hi, >> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >> >> Current seeky detection is based on average seek lenght. >> >> This is suboptimal, since the average will not distinguish between: >> >> * a process doing medium sized seeks >> >> * a process doing some sequential requests interleaved with larger seeks >> >> and even a medium seek can take lot of time, if the requested sector >> >> happens to be behind the disk head in the rotation (50% probability). >> >> >> >> Therefore, we change the seeky queue detection to work as follows: >> >> * each request can be classified as sequential if it is very close to >> >> the current head position, i.e. it is likely in the disk cache (disks >> >> usually read more data than requested, and put it in cache for >> >> subsequent reads). Otherwise, the request is classified as seeky. >> >> * an history window of the last 32 requests is kept, storing the >> >> classification result. >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> >> were seeky. >> >> >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> >> reads. >> > Can we not count a big request (say the request data is >= 32k) as seeky >> > regardless the seek distance? In this way we can also make a 64k random sync >> > read not as seeky. >> I think I understand what you are proposing, but I don't think request >> size should >> matter at all for rotational disk. > randread a 32k bs definitely has better throughput than a 4k bs. So the request > size does matter. From iops point of view, 64k and 4k might not have difference > in device, but from performance point of view, they have big difference. Assume we have two queues, one with 64k requests, and an other with 4k requests, and that our ideal disk will service them with the same IOPS 'v'. Then, servicing for 100ms the first, and then for 100ms the second, we will have, averaging on the 200ms period of the schedule: first queue IOPS = v * 100/200 = v/2 second queue IOPS = v * 100/200 = v/2 Now the bandwidth will be simply IOPS * request size. If instead, you service one request from one queue, and one from the other (and keep switching for 200ms), with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the same numbers. But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the 64k accesses are aligned (do not cross the stripe), we will have 50% probability that the requests from the 2 queues are serviced in parallel, thus increasing the total IOPS and bandwidth. This cannot happen if you service for 100ms a single depth-1 seeky queue. > >> Usually, the disk firmware will load a big chunk of data in its cache even when >> requested to read a single sector, and will provide following ones >> from the cache >> if you read them sequentially. >> >> Now, in CFQ, what we really mean by saying that a queue is seeky is that >> waiting a bit in order to serve an other request from this queue doesn't >> give any benefit w.r.t. switching to an other queue. > If no idle, we might switch to a random 4k access or any kind of queues. Compared > to continue big request access and switch to other queue with small block, no switching > does give benefit. CFQ in 2.6.33 works differently than it worked before. Now, seeky queues have an aggregate time slice, and within this time slice, you will switch between seeky queues fairly. So it cannot happen that a seeky queue loses its time slice. > >> So, if you read a single 64k block from disk and then seek, then you can service >> any other request without losing bandwidth. > But the 64k bs queue loses its slice, which might means device serves more 4k access. > As a result, reduce bandwidth. If both queues are backlogged and at the same priority, they will be serviced fairly. If one queue has large think time (or lower priority), the other will be serviced more often. > >> Instead, if you are reading 4k, then the next ones (and so on up to 64k, as it >> happens with mmap when you fault in a single page at a time), then it >> is convenient >> to wait for the next request, since it has 3/4 of changes to be >> sequential, so be >> serviced by cache. >> >> I'm currently testing a patch to consider request size in SSDs, instead. >> In SSDs, the location of the request doesn't mean anything, but the >> size is meaningful. >> Therefore, submitting together many small requests from different >> queues can improve >> the overall performance. > Agree. > > Thanks, > Shaohua > Thanks, Corrado ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 8:52 ` Corrado Zoccolo @ 2010-01-13 3:45 ` Shaohua Li 2010-01-13 7:09 ` Corrado Zoccolo 0 siblings, 1 reply; 21+ messages in thread From: Shaohua Li @ 2010-01-13 3:45 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Tue, Jan 12, 2010 at 04:52:59PM +0800, Corrado Zoccolo wrote: > Hi > On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: > >> Hi, > >> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: > >> >> Current seeky detection is based on average seek lenght. > >> >> This is suboptimal, since the average will not distinguish between: > >> >> * a process doing medium sized seeks > >> >> * a process doing some sequential requests interleaved with larger seeks > >> >> and even a medium seek can take lot of time, if the requested sector > >> >> happens to be behind the disk head in the rotation (50% probability). > >> >> > >> >> Therefore, we change the seeky queue detection to work as follows: > >> >> * each request can be classified as sequential if it is very close to > >> >> the current head position, i.e. it is likely in the disk cache (disks > >> >> usually read more data than requested, and put it in cache for > >> >> subsequent reads). Otherwise, the request is classified as seeky. > >> >> * an history window of the last 32 requests is kept, storing the > >> >> classification result. > >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests > >> >> were seeky. > >> >> > >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random > >> >> reads. > >> > Can we not count a big request (say the request data is >= 32k) as seeky > >> > regardless the seek distance? In this way we can also make a 64k random sync > >> > read not as seeky. > >> I think I understand what you are proposing, but I don't think request > >> size should > >> matter at all for rotational disk. > > randread a 32k bs definitely has better throughput than a 4k bs. So the request > > size does matter. From iops point of view, 64k and 4k might not have difference > > in device, but from performance point of view, they have big difference. > Assume we have two queues, one with 64k requests, and an other with 4k requests, > and that our ideal disk will service them with the same IOPS 'v'. > Then, servicing for 100ms the first, and then for 100ms the second, we > will have, averaging on the > 200ms period of the schedule: > first queue IOPS = v * 100/200 = v/2 > second queue IOPS = v * 100/200 = v/2 > Now the bandwidth will be simply IOPS * request size. > If instead, you service one request from one queue, and one from the > other (and keep switching for 200ms), > with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the > same numbers. > > But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the > 64k accesses are aligned (do not cross the stripe), we will have 50% > probability that the requests from the 2 queues are serviced in > parallel, thus increasing the total IOPS and bandwidth. This cannot > happen if you service for 100ms a single depth-1 seeky queue. > > > > >> Usually, the disk firmware will load a big chunk of data in its cache even when > >> requested to read a single sector, and will provide following ones > >> from the cache > >> if you read them sequentially. > >> > >> Now, in CFQ, what we really mean by saying that a queue is seeky is that > >> waiting a bit in order to serve an other request from this queue doesn't > >> give any benefit w.r.t. switching to an other queue. > > If no idle, we might switch to a random 4k access or any kind of queues. Compared > > to continue big request access and switch to other queue with small block, no switching > > does give benefit. > CFQ in 2.6.33 works differently than it worked before. > Now, seeky queues have an aggregate time slice, and within this time > slice, you will switch > between seeky queues fairly. So it cannot happen that a seeky queue > loses its time slice. Sorry for my ignorance here, from the code, I know we have a forced slice for a domain and service tree, but for a queue, it appears we haven't an aggregate time slice. From my understanding, we don't add a queue's remaining slice to its next run, and queue might not even init its slice if it's non-timedout preempted before it finishes its first request, which is normal for a seeky queue with a ncq device. Thanks, Shaohua ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-13 3:45 ` Shaohua Li @ 2010-01-13 7:09 ` Corrado Zoccolo 2010-01-13 8:00 ` Shaohua Li 0 siblings, 1 reply; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-13 7:09 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 4:45 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Tue, Jan 12, 2010 at 04:52:59PM +0800, Corrado Zoccolo wrote: >> Hi >> On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: >> >> Hi, >> >> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >> >> >> Current seeky detection is based on average seek lenght. >> >> >> This is suboptimal, since the average will not distinguish between: >> >> >> * a process doing medium sized seeks >> >> >> * a process doing some sequential requests interleaved with larger seeks >> >> >> and even a medium seek can take lot of time, if the requested sector >> >> >> happens to be behind the disk head in the rotation (50% probability). >> >> >> >> >> >> Therefore, we change the seeky queue detection to work as follows: >> >> >> * each request can be classified as sequential if it is very close to >> >> >> the current head position, i.e. it is likely in the disk cache (disks >> >> >> usually read more data than requested, and put it in cache for >> >> >> subsequent reads). Otherwise, the request is classified as seeky. >> >> >> * an history window of the last 32 requests is kept, storing the >> >> >> classification result. >> >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> >> >> were seeky. >> >> >> >> >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> >> >> reads. >> >> > Can we not count a big request (say the request data is >= 32k) as seeky >> >> > regardless the seek distance? In this way we can also make a 64k random sync >> >> > read not as seeky. >> >> I think I understand what you are proposing, but I don't think request >> >> size should >> >> matter at all for rotational disk. >> > randread a 32k bs definitely has better throughput than a 4k bs. So the request >> > size does matter. From iops point of view, 64k and 4k might not have difference >> > in device, but from performance point of view, they have big difference. >> Assume we have two queues, one with 64k requests, and an other with 4k requests, >> and that our ideal disk will service them with the same IOPS 'v'. >> Then, servicing for 100ms the first, and then for 100ms the second, we >> will have, averaging on the >> 200ms period of the schedule: >> first queue IOPS = v * 100/200 = v/2 >> second queue IOPS = v * 100/200 = v/2 >> Now the bandwidth will be simply IOPS * request size. >> If instead, you service one request from one queue, and one from the >> other (and keep switching for 200ms), >> with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the >> same numbers. >> >> But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the >> 64k accesses are aligned (do not cross the stripe), we will have 50% >> probability that the requests from the 2 queues are serviced in >> parallel, thus increasing the total IOPS and bandwidth. This cannot >> happen if you service for 100ms a single depth-1 seeky queue. >> >> > >> >> Usually, the disk firmware will load a big chunk of data in its cache even when >> >> requested to read a single sector, and will provide following ones >> >> from the cache >> >> if you read them sequentially. >> >> >> >> Now, in CFQ, what we really mean by saying that a queue is seeky is that >> >> waiting a bit in order to serve an other request from this queue doesn't >> >> give any benefit w.r.t. switching to an other queue. >> > If no idle, we might switch to a random 4k access or any kind of queues. Compared >> > to continue big request access and switch to other queue with small block, no switching >> > does give benefit. >> CFQ in 2.6.33 works differently than it worked before. >> Now, seeky queues have an aggregate time slice, and within this time >> slice, you will switch >> between seeky queues fairly. So it cannot happen that a seeky queue >> loses its time slice. > Sorry for my ignorance here, from the code, I know we have a forced slice for a domain and > service tree, but for a queue, it appears we haven't an aggregate time slice. By aggregate time slice for seeky queues, I mean the time slice assigned to the sync-noidle service tree. > From my understanding, > we don't add a queue's remaining slice to its next run, and queue might not even init its slice if > it's non-timedout preempted before it finishes its first request, which is normal for a seeky > queue with a ncq device. Exactly for this reason, a seeky queue has no private time slice (it is meaningless, since we want multiple seeky queues working in parallel), but it participates fairly to the service tree's slice. The service tree's slice is computed proportionally to the number of seeky queues w.r.t. all queues in the domain, so you also have that seeky queues are serviced fairly w.r.t. other queues as well. Thanks, Corrado > > Thanks, > Shaohua > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-13 7:09 ` Corrado Zoccolo @ 2010-01-13 8:00 ` Shaohua Li 2010-01-13 8:09 ` Corrado Zoccolo 0 siblings, 1 reply; 21+ messages in thread From: Shaohua Li @ 2010-01-13 8:00 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 03:09:31PM +0800, Corrado Zoccolo wrote: > On Wed, Jan 13, 2010 at 4:45 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Tue, Jan 12, 2010 at 04:52:59PM +0800, Corrado Zoccolo wrote: > >> Hi > >> On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: > >> >> Hi, > >> >> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: > >> >> >> Current seeky detection is based on average seek lenght. > >> >> >> This is suboptimal, since the average will not distinguish between: > >> >> >> * a process doing medium sized seeks > >> >> >> * a process doing some sequential requests interleaved with larger seeks > >> >> >> and even a medium seek can take lot of time, if the requested sector > >> >> >> happens to be behind the disk head in the rotation (50% probability). > >> >> >> > >> >> >> Therefore, we change the seeky queue detection to work as follows: > >> >> >> * each request can be classified as sequential if it is very close to > >> >> >> the current head position, i.e. it is likely in the disk cache (disks > >> >> >> usually read more data than requested, and put it in cache for > >> >> >> subsequent reads). Otherwise, the request is classified as seeky. > >> >> >> * an history window of the last 32 requests is kept, storing the > >> >> >> classification result. > >> >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests > >> >> >> were seeky. > >> >> >> > >> >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random > >> >> >> reads. > >> >> > Can we not count a big request (say the request data is >= 32k) as seeky > >> >> > regardless the seek distance? In this way we can also make a 64k random sync > >> >> > read not as seeky. > >> >> I think I understand what you are proposing, but I don't think request > >> >> size should > >> >> matter at all for rotational disk. > >> > randread a 32k bs definitely has better throughput than a 4k bs. So the request > >> > size does matter. From iops point of view, 64k and 4k might not have difference > >> > in device, but from performance point of view, they have big difference. > >> Assume we have two queues, one with 64k requests, and an other with 4k requests, > >> and that our ideal disk will service them with the same IOPS 'v'. > >> Then, servicing for 100ms the first, and then for 100ms the second, we > >> will have, averaging on the > >> 200ms period of the schedule: > >> first queue IOPS = v * 100/200 = v/2 > >> second queue IOPS = v * 100/200 = v/2 > >> Now the bandwidth will be simply IOPS * request size. > >> If instead, you service one request from one queue, and one from the > >> other (and keep switching for 200ms), > >> with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the > >> same numbers. > >> > >> But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the > >> 64k accesses are aligned (do not cross the stripe), we will have 50% > >> probability that the requests from the 2 queues are serviced in > >> parallel, thus increasing the total IOPS and bandwidth. This cannot > >> happen if you service for 100ms a single depth-1 seeky queue. > >> > >> > > >> >> Usually, the disk firmware will load a big chunk of data in its cache even when > >> >> requested to read a single sector, and will provide following ones > >> >> from the cache > >> >> if you read them sequentially. > >> >> > >> >> Now, in CFQ, what we really mean by saying that a queue is seeky is that > >> >> waiting a bit in order to serve an other request from this queue doesn't > >> >> give any benefit w.r.t. switching to an other queue. > >> > If no idle, we might switch to a random 4k access or any kind of queues. Compared > >> > to continue big request access and switch to other queue with small block, no switching > >> > does give benefit. > >> CFQ in 2.6.33 works differently than it worked before. > >> Now, seeky queues have an aggregate time slice, and within this time > >> slice, you will switch > >> between seeky queues fairly. So it cannot happen that a seeky queue > >> loses its time slice. > > Sorry for my ignorance here, from the code, I know we have a forced slice for a domain and > > service tree, but for a queue, it appears we haven't an aggregate time slice. > By aggregate time slice for seeky queues, I mean the time slice > assigned to the sync-noidle service tree. > > > From my understanding, > > we don't add a queue's remaining slice to its next run, and queue might not even init its slice if > > it's non-timedout preempted before it finishes its first request, which is normal for a seeky > > queue with a ncq device. > > Exactly for this reason, a seeky queue has no private time slice (it > is meaningless, since we want multiple seeky queues working in > parallel), but it participates fairly to the service tree's slice. The > service tree's slice is computed proportionally to the number of seeky > queues w.r.t. all queues in the domain, so you also have that seeky > queues are serviced fairly w.r.t. other queues as well. Ok, I got your point. An off topic issue: For a queue with iodepth 1 and a queue with iodepth 32, looks this mechanism can't guanantee fairness. the queue with big iodepth can submit more requests in every switch. Thanks, Shaohua ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-13 8:00 ` Shaohua Li @ 2010-01-13 8:09 ` Corrado Zoccolo 0 siblings, 0 replies; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-13 8:09 UTC (permalink / raw) To: Shaohua Li Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Vivek Goyal, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 9:00 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Wed, Jan 13, 2010 at 03:09:31PM +0800, Corrado Zoccolo wrote: >> On Wed, Jan 13, 2010 at 4:45 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Tue, Jan 12, 2010 at 04:52:59PM +0800, Corrado Zoccolo wrote: >> >> Hi >> >> On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> > On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote: >> >> >> Hi, >> >> >> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> >> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote: >> >> >> >> Current seeky detection is based on average seek lenght. >> >> >> >> This is suboptimal, since the average will not distinguish between: >> >> >> >> * a process doing medium sized seeks >> >> >> >> * a process doing some sequential requests interleaved with larger seeks >> >> >> >> and even a medium seek can take lot of time, if the requested sector >> >> >> >> happens to be behind the disk head in the rotation (50% probability). >> >> >> >> >> >> >> >> Therefore, we change the seeky queue detection to work as follows: >> >> >> >> * each request can be classified as sequential if it is very close to >> >> >> >> the current head position, i.e. it is likely in the disk cache (disks >> >> >> >> usually read more data than requested, and put it in cache for >> >> >> >> subsequent reads). Otherwise, the request is classified as seeky. >> >> >> >> * an history window of the last 32 requests is kept, storing the >> >> >> >> classification result. >> >> >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> >> >> >> were seeky. >> >> >> >> >> >> >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> >> >> >> reads. >> >> >> > Can we not count a big request (say the request data is >= 32k) as seeky >> >> >> > regardless the seek distance? In this way we can also make a 64k random sync >> >> >> > read not as seeky. >> >> >> I think I understand what you are proposing, but I don't think request >> >> >> size should >> >> >> matter at all for rotational disk. >> >> > randread a 32k bs definitely has better throughput than a 4k bs. So the request >> >> > size does matter. From iops point of view, 64k and 4k might not have difference >> >> > in device, but from performance point of view, they have big difference. >> >> Assume we have two queues, one with 64k requests, and an other with 4k requests, >> >> and that our ideal disk will service them with the same IOPS 'v'. >> >> Then, servicing for 100ms the first, and then for 100ms the second, we >> >> will have, averaging on the >> >> 200ms period of the schedule: >> >> first queue IOPS = v * 100/200 = v/2 >> >> second queue IOPS = v * 100/200 = v/2 >> >> Now the bandwidth will be simply IOPS * request size. >> >> If instead, you service one request from one queue, and one from the >> >> other (and keep switching for 200ms), >> >> with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the >> >> same numbers. >> >> >> >> But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the >> >> 64k accesses are aligned (do not cross the stripe), we will have 50% >> >> probability that the requests from the 2 queues are serviced in >> >> parallel, thus increasing the total IOPS and bandwidth. This cannot >> >> happen if you service for 100ms a single depth-1 seeky queue. >> >> >> >> > >> >> >> Usually, the disk firmware will load a big chunk of data in its cache even when >> >> >> requested to read a single sector, and will provide following ones >> >> >> from the cache >> >> >> if you read them sequentially. >> >> >> >> >> >> Now, in CFQ, what we really mean by saying that a queue is seeky is that >> >> >> waiting a bit in order to serve an other request from this queue doesn't >> >> >> give any benefit w.r.t. switching to an other queue. >> >> > If no idle, we might switch to a random 4k access or any kind of queues. Compared >> >> > to continue big request access and switch to other queue with small block, no switching >> >> > does give benefit. >> >> CFQ in 2.6.33 works differently than it worked before. >> >> Now, seeky queues have an aggregate time slice, and within this time >> >> slice, you will switch >> >> between seeky queues fairly. So it cannot happen that a seeky queue >> >> loses its time slice. >> > Sorry for my ignorance here, from the code, I know we have a forced slice for a domain and >> > service tree, but for a queue, it appears we haven't an aggregate time slice. >> By aggregate time slice for seeky queues, I mean the time slice >> assigned to the sync-noidle service tree. >> >> > From my understanding, >> > we don't add a queue's remaining slice to its next run, and queue might not even init its slice if >> > it's non-timedout preempted before it finishes its first request, which is normal for a seeky >> > queue with a ncq device. >> >> Exactly for this reason, a seeky queue has no private time slice (it >> is meaningless, since we want multiple seeky queues working in >> parallel), but it participates fairly to the service tree's slice. The >> service tree's slice is computed proportionally to the number of seeky >> queues w.r.t. all queues in the domain, so you also have that seeky >> queues are serviced fairly w.r.t. other queues as well. > Ok, I got your point. An off topic issue: > For a queue with iodepth 1 and a queue with iodepth 32, looks this mechanism can't > guanantee fairness. the queue with big iodepth can submit more requests > in every switch. Yes. In fact, a queue that reaches large I/O depths will be marked as SYNC_IDLE, and have its dedicated time slice. Your testcase about cfq_quantum falls in this category. > > Thanks, > Shaohua > Thanks, Corrado ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-09 15:59 [PATCH] cfq-iosched: rework seeky detection Corrado Zoccolo 2010-01-11 1:47 ` Shaohua Li @ 2010-01-11 16:29 ` Vivek Goyal 2010-01-11 16:52 ` Corrado Zoccolo 2010-01-12 19:12 ` Vivek Goyal 2 siblings, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2010-01-11 16:29 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: > Current seeky detection is based on average seek lenght. > This is suboptimal, since the average will not distinguish between: > * a process doing medium sized seeks > * a process doing some sequential requests interleaved with larger seeks > and even a medium seek can take lot of time, if the requested sector > happens to be behind the disk head in the rotation (50% probability). > > Therefore, we change the seeky queue detection to work as follows: > * each request can be classified as sequential if it is very close to > the current head position, i.e. it is likely in the disk cache (disks > usually read more data than requested, and put it in cache for > subsequent reads). Otherwise, the request is classified as seeky. > * an history window of the last 32 requests is kept, storing the > classification result. > * A queue is marked as seeky if more than 1/8 of the last 32 requests > were seeky. > Because we are not relying on long term average and looking at only last 32 requests, looks like we will be switching between seeky to non seeky much more aggressively. > This patch fixes a regression reported by Yanmin, on mmap 64k random > reads. We never changed the seek logic recently. So if it is a regression it must have been introduced by some other change and we should look and fix that too. That's a different thing that seeky queue detection logic change also gave performance improvement in this specific case. IIUC, you are saying that doing bigger block size IO on mmapped files, has few 4K requests one after the other and then a big seek. So in such cases you would rather mark the cfqq as sync-idle and idle on the queue so that we can server 16 (64K/4) requests soon and then incur a large seek time. So these requests strictly come one after the other and request merging takes place? > > Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> > Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > --- > block/cfq-iosched.c | 54 +++++++++++++------------------------------------- > 1 files changed, 14 insertions(+), 40 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index c6d5678..4e203c4 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -133,9 +133,7 @@ struct cfq_queue { > unsigned short ioprio, org_ioprio; > unsigned short ioprio_class, org_ioprio_class; > > - unsigned int seek_samples; > - u64 seek_total; > - sector_t seek_mean; > + u32 seek_history; > sector_t last_request_pos; > unsigned long seeky_start; > > @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, > return cfqd->last_position - blk_rq_pos(rq); > } > > -#define CFQQ_SEEK_THR 8 * 1024 > -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) > +#define CFQQ_SEEK_THR (sector_t)(8 * 100) What's the rational behind changing CFQQ_SEEK_THR from 8*1024 to 8*100? Vivek > +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > > static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct request *rq, bool for_preempt) > { > - sector_t sdist = cfqq->seek_mean; > - > - if (!sample_valid(cfqq->seek_samples)) > - sdist = CFQQ_SEEK_THR; > - > - /* if seek_mean is big, using it as close criteria is meaningless */ > - if (sdist > CFQQ_SEEK_THR && !for_preempt) > - sdist = CFQQ_SEEK_THR; > - > - return cfq_dist_from_last(cfqd, rq) <= sdist; > + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; > } > > static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, > @@ -2971,30 +2960,16 @@ static void > cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct request *rq) > { > - sector_t sdist; > - u64 total; > - > - if (!cfqq->last_request_pos) > - sdist = 0; > - else if (cfqq->last_request_pos < blk_rq_pos(rq)) > - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > - else > - sdist = cfqq->last_request_pos - blk_rq_pos(rq); > - > - /* > - * Don't allow the seek distance to get too large from the > - * odd fragment, pagein, etc > - */ > - if (cfqq->seek_samples <= 60) /* second&third seek */ > - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); > - else > - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); > + sector_t sdist = 0; > + if (cfqq->last_request_pos) { > + if (cfqq->last_request_pos < blk_rq_pos(rq)) > + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > + else > + sdist = cfqq->last_request_pos - blk_rq_pos(rq); > + } > > - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; > - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; > - total = cfqq->seek_total + (cfqq->seek_samples/2); > - do_div(total, cfqq->seek_samples); > - cfqq->seek_mean = (sector_t)total; > + cfqq->seek_history <<= 1; > + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > > /* > * If this cfqq is shared between multiple processes, check to > @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > cfq_mark_cfqq_deep(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > - && CFQQ_SEEKY(cfqq))) > + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > -- > 1.6.4.4 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-11 16:29 ` Vivek Goyal @ 2010-01-11 16:52 ` Corrado Zoccolo 0 siblings, 0 replies; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-11 16:52 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Mon, Jan 11, 2010 at 5:29 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: >> Current seeky detection is based on average seek lenght. >> This is suboptimal, since the average will not distinguish between: >> * a process doing medium sized seeks >> * a process doing some sequential requests interleaved with larger seeks >> and even a medium seek can take lot of time, if the requested sector >> happens to be behind the disk head in the rotation (50% probability). >> >> Therefore, we change the seeky queue detection to work as follows: >> * each request can be classified as sequential if it is very close to >> the current head position, i.e. it is likely in the disk cache (disks >> usually read more data than requested, and put it in cache for >> subsequent reads). Otherwise, the request is classified as seeky. >> * an history window of the last 32 requests is kept, storing the >> classification result. >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> were seeky. >> > > Because we are not relying on long term average and looking at only last > 32 requests, looks like we will be switching between seeky to non seeky > much more aggressively. Hi Vivek, I hope this is not the case. I remember you observed instability in your tests. If you can re-run such tests, we can see if the instability is reduced or increased. > >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> reads. > > We never changed the seek logic recently. So if it is a regression it must > have been introduced by some other change and we should look and fix that > too. We didn't change the seek logic, but changed other code that exposed the bug in it. The cause of regression is that, in 2.6.32, with low_latency=1, even queues that were marked as seeky still had idle slice (so the metric didn't count), while now we are jumping immediately to a new queue. So, due to our removing the idle, we have to fix the bug in the metric (exposed by my other code change), that caused this regression, by making the seeky detection better. BTW, the issue only shows with really a lot of processes doing I/O, so that the disk cache is reclaimed before the new request comes. > > That's a different thing that seeky queue detection logic change also gave > performance improvement in this specific case. > > IIUC, you are saying that doing bigger block size IO on mmapped files, has > few 4K requests one after the other and then a big seek. So in such cases > you would rather mark the cfqq as sync-idle and idle on the queue so that > we can server 16 (64K/4) requests soon and then incur a large seek time. Yes. Assuming a seq request takes 1ms (worst case for external USB disks) and a random one 8ms, 15 seq requests and 1 seek complete in 23 ms, that is just 43% more than the completion of 16 seq requests. The threshold is defined as the point at which roughly 50% of the disk bandwidth is achieved. This doesn't consider think time, though. > > So these requests strictly come one after the other and request merging takes > place? Request merging doesn't take place, since with mmap, the read is triggered by a page fault, so the process is waiting for the first request to be completed before issuing the next one. Thanks, Corrado > >> >> Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> --- >> block/cfq-iosched.c | 54 +++++++++++++------------------------------------- >> 1 files changed, 14 insertions(+), 40 deletions(-) >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> index c6d5678..4e203c4 100644 >> --- a/block/cfq-iosched.c >> +++ b/block/cfq-iosched.c >> @@ -133,9 +133,7 @@ struct cfq_queue { >> unsigned short ioprio, org_ioprio; >> unsigned short ioprio_class, org_ioprio_class; >> >> - unsigned int seek_samples; >> - u64 seek_total; >> - sector_t seek_mean; >> + u32 seek_history; >> sector_t last_request_pos; >> unsigned long seeky_start; >> >> @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, >> return cfqd->last_position - blk_rq_pos(rq); >> } >> >> -#define CFQQ_SEEK_THR 8 * 1024 >> -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) >> +#define CFQQ_SEEK_THR (sector_t)(8 * 100) > > What's the rational behind changing CFQQ_SEEK_THR from 8*1024 to 8*100? > > Vivek > >> +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) >> >> static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> struct request *rq, bool for_preempt) >> { >> - sector_t sdist = cfqq->seek_mean; >> - >> - if (!sample_valid(cfqq->seek_samples)) >> - sdist = CFQQ_SEEK_THR; >> - >> - /* if seek_mean is big, using it as close criteria is meaningless */ >> - if (sdist > CFQQ_SEEK_THR && !for_preempt) >> - sdist = CFQQ_SEEK_THR; >> - >> - return cfq_dist_from_last(cfqd, rq) <= sdist; >> + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; >> } >> >> static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, >> @@ -2971,30 +2960,16 @@ static void >> cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> struct request *rq) >> { >> - sector_t sdist; >> - u64 total; >> - >> - if (!cfqq->last_request_pos) >> - sdist = 0; >> - else if (cfqq->last_request_pos < blk_rq_pos(rq)) >> - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> - else >> - sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> - >> - /* >> - * Don't allow the seek distance to get too large from the >> - * odd fragment, pagein, etc >> - */ >> - if (cfqq->seek_samples <= 60) /* second&third seek */ >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); >> - else >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); >> + sector_t sdist = 0; >> + if (cfqq->last_request_pos) { >> + if (cfqq->last_request_pos < blk_rq_pos(rq)) >> + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> + else >> + sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> + } >> >> - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; >> - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; >> - total = cfqq->seek_total + (cfqq->seek_samples/2); >> - do_div(total, cfqq->seek_samples); >> - cfqq->seek_mean = (sector_t)total; >> + cfqq->seek_history <<= 1; >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> >> /* >> * If this cfqq is shared between multiple processes, check to >> @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> cfq_mark_cfqq_deep(cfqq); >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) >> - && CFQQ_SEEKY(cfqq))) >> + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) >> enable_idle = 0; >> else if (sample_valid(cic->ttime_samples)) { >> if (cic->ttime_mean > cfqd->cfq_slice_idle) >> -- >> 1.6.4.4 > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-09 15:59 [PATCH] cfq-iosched: rework seeky detection Corrado Zoccolo 2010-01-11 1:47 ` Shaohua Li 2010-01-11 16:29 ` Vivek Goyal @ 2010-01-12 19:12 ` Vivek Goyal 2010-01-12 20:05 ` Corrado Zoccolo 2 siblings, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2010-01-12 19:12 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: > Current seeky detection is based on average seek lenght. > This is suboptimal, since the average will not distinguish between: > * a process doing medium sized seeks > * a process doing some sequential requests interleaved with larger seeks > and even a medium seek can take lot of time, if the requested sector > happens to be behind the disk head in the rotation (50% probability). > > Therefore, we change the seeky queue detection to work as follows: > * each request can be classified as sequential if it is very close to > the current head position, i.e. it is likely in the disk cache (disks > usually read more data than requested, and put it in cache for > subsequent reads). Otherwise, the request is classified as seeky. > * an history window of the last 32 requests is kept, storing the > classification result. > * A queue is marked as seeky if more than 1/8 of the last 32 requests > were seeky. > > This patch fixes a regression reported by Yanmin, on mmap 64k random > reads. > Ok, I did basic testing of this patch on my hardware. I got a RAID-0 configuration and there are 12 disks behind it. I ran 8 fio mmap random read processes with block size 64K and following are the results. Vanilla (3 runs) =============== aggrb=3,564KB/s (cfq) aggrb=3,600KB/s (cfq) aggrb=3,607KB/s (cfq) aggrb=3,992KB/s,(deadline) aggrb=3,953KB/s (deadline) aggrb=3,991KB/s (deadline) Patched kernel (3 runs) ======================= aggrb=2,080KB/s (cfq) aggrb=2,100KB/s (cfq) aggrb=2,124KB/s (cfq) My fio script ============= [global] directory=/mnt/sda/fio/ size=8G direct=0 runtime=30 ioscheduler=cfq exec_prerun="echo 3 > /proc/sys/vm/drop_caches" group_reporting=1 ioengine=mmap rw=randread bs=64K [randread] numjobs=8 ================================= There seems to be around more than 45% regression in this case. I have not run the blktrace, but I suspect it must be coming from the fact that we are now treating a random queue as sync-idle hence driving queue depth as 1. But the fact is that read ahead much not be kicking in, so we are not using the power of parallel processing this striped set of disks can do for us. So treating this kind of cfqq as sync-idle seems to be a bad idea atleast on configurations where multiple disks are in raid configuration. For yanmin's case, he seems to be running a case where there is only single spindle and mulitple processes doing IO on that spindle. So I guess it does not suffer from the fact that we are driving queue depth as 1. Why did I include some "deadline" numbers also? Just like that. Recently I have also become interested in comaparing CFQ with "deadline" for various workloads and see which one performs better in what circumstances. Thanks Vivek > Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> > Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > --- > block/cfq-iosched.c | 54 +++++++++++++------------------------------------- > 1 files changed, 14 insertions(+), 40 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index c6d5678..4e203c4 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -133,9 +133,7 @@ struct cfq_queue { > unsigned short ioprio, org_ioprio; > unsigned short ioprio_class, org_ioprio_class; > > - unsigned int seek_samples; > - u64 seek_total; > - sector_t seek_mean; > + u32 seek_history; > sector_t last_request_pos; > unsigned long seeky_start; > > @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, > return cfqd->last_position - blk_rq_pos(rq); > } > > -#define CFQQ_SEEK_THR 8 * 1024 > -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) > +#define CFQQ_SEEK_THR (sector_t)(8 * 100) > +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > > static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct request *rq, bool for_preempt) > { > - sector_t sdist = cfqq->seek_mean; > - > - if (!sample_valid(cfqq->seek_samples)) > - sdist = CFQQ_SEEK_THR; > - > - /* if seek_mean is big, using it as close criteria is meaningless */ > - if (sdist > CFQQ_SEEK_THR && !for_preempt) > - sdist = CFQQ_SEEK_THR; > - > - return cfq_dist_from_last(cfqd, rq) <= sdist; > + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; > } > > static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, > @@ -2971,30 +2960,16 @@ static void > cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct request *rq) > { > - sector_t sdist; > - u64 total; > - > - if (!cfqq->last_request_pos) > - sdist = 0; > - else if (cfqq->last_request_pos < blk_rq_pos(rq)) > - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > - else > - sdist = cfqq->last_request_pos - blk_rq_pos(rq); > - > - /* > - * Don't allow the seek distance to get too large from the > - * odd fragment, pagein, etc > - */ > - if (cfqq->seek_samples <= 60) /* second&third seek */ > - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); > - else > - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); > + sector_t sdist = 0; > + if (cfqq->last_request_pos) { > + if (cfqq->last_request_pos < blk_rq_pos(rq)) > + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > + else > + sdist = cfqq->last_request_pos - blk_rq_pos(rq); > + } > > - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; > - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; > - total = cfqq->seek_total + (cfqq->seek_samples/2); > - do_div(total, cfqq->seek_samples); > - cfqq->seek_mean = (sector_t)total; > + cfqq->seek_history <<= 1; > + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > > /* > * If this cfqq is shared between multiple processes, check to > @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > cfq_mark_cfqq_deep(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > - && CFQQ_SEEKY(cfqq))) > + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > -- > 1.6.4.4 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 19:12 ` Vivek Goyal @ 2010-01-12 20:05 ` Corrado Zoccolo 2010-01-12 22:36 ` Vivek Goyal 0 siblings, 1 reply; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-12 20:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang Hi Vivek, On Tue, Jan 12, 2010 at 8:12 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: >> Current seeky detection is based on average seek lenght. >> This is suboptimal, since the average will not distinguish between: >> * a process doing medium sized seeks >> * a process doing some sequential requests interleaved with larger seeks >> and even a medium seek can take lot of time, if the requested sector >> happens to be behind the disk head in the rotation (50% probability). >> >> Therefore, we change the seeky queue detection to work as follows: >> * each request can be classified as sequential if it is very close to >> the current head position, i.e. it is likely in the disk cache (disks >> usually read more data than requested, and put it in cache for >> subsequent reads). Otherwise, the request is classified as seeky. >> * an history window of the last 32 requests is kept, storing the >> classification result. >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> were seeky. >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> reads. >> > > Ok, I did basic testing of this patch on my hardware. I got a RAID-0 > configuration and there are 12 disks behind it. I ran 8 fio mmap random > read processes with block size 64K and following are the results. > > Vanilla (3 runs) > =============== > aggrb=3,564KB/s (cfq) > aggrb=3,600KB/s (cfq) > aggrb=3,607KB/s (cfq) > > aggrb=3,992KB/s,(deadline) > aggrb=3,953KB/s (deadline) > aggrb=3,991KB/s (deadline) > > Patched kernel (3 runs) > ======================= > aggrb=2,080KB/s (cfq) > aggrb=2,100KB/s (cfq) > aggrb=2,124KB/s (cfq) > > My fio script > ============= > [global] > directory=/mnt/sda/fio/ > size=8G > direct=0 > runtime=30 > ioscheduler=cfq > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" > group_reporting=1 > ioengine=mmap > rw=randread > bs=64K > > [randread] > numjobs=8 > ================================= > > There seems to be around more than 45% regression in this case. > > I have not run the blktrace, but I suspect it must be coming from the fact > that we are now treating a random queue as sync-idle hence driving queue > depth as 1. But the fact is that read ahead much not be kicking in, so > we are not using the power of parallel processing this striped set of > disks can do for us. > Yes. Those results are expected, and are the other side of the medal. If we handle those queues as sync-idle, we get better performance on single disk (and regression on RAIDs), and viceversa if we handle them as sync-noidle. Note that this is limited to mmap with large block size. Normal read/pread is not affected. > So treating this kind of cfqq as sync-idle seems to be a bad idea atleast > on configurations where multiple disks are in raid configuration. The fact is, can we reliably determine which of those two setups we have from cfq? Until we can, we should optimize for the most common case. Also, does the performance drop when the number of processes approaches 8*number of spindles? I think it should, so we will need to identify exacly the number of spindles to be able to allow only the right amount of parallelism. > For yanmin's case, he seems to be running a case where there is only single > spindle and mulitple processes doing IO on that spindle. So I guess it > does not suffer from the fact that we are driving queue depth as 1. Yes. Yanmin's configuration is a JBOD, i.e. a multi disk configuration in which no partition spans multiple disks, so you have at most 1 spindle per partition, possibly shared with others. > > Why did I include some "deadline" numbers also? Just like that. Recently > I have also become interested in comaparing CFQ with "deadline" for > various workloads and see which one performs better in what > circumstances. > Sounds sensible, especially for RAID configurations it can be a good comparison. Thanks, Corrado > Thanks > Vivek > >> Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> --- >> block/cfq-iosched.c | 54 +++++++++++++------------------------------------- >> 1 files changed, 14 insertions(+), 40 deletions(-) >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> index c6d5678..4e203c4 100644 >> --- a/block/cfq-iosched.c >> +++ b/block/cfq-iosched.c >> @@ -133,9 +133,7 @@ struct cfq_queue { >> unsigned short ioprio, org_ioprio; >> unsigned short ioprio_class, org_ioprio_class; >> >> - unsigned int seek_samples; >> - u64 seek_total; >> - sector_t seek_mean; >> + u32 seek_history; >> sector_t last_request_pos; >> unsigned long seeky_start; >> >> @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, >> return cfqd->last_position - blk_rq_pos(rq); >> } >> >> -#define CFQQ_SEEK_THR 8 * 1024 >> -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) >> +#define CFQQ_SEEK_THR (sector_t)(8 * 100) >> +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) >> >> static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> struct request *rq, bool for_preempt) >> { >> - sector_t sdist = cfqq->seek_mean; >> - >> - if (!sample_valid(cfqq->seek_samples)) >> - sdist = CFQQ_SEEK_THR; >> - >> - /* if seek_mean is big, using it as close criteria is meaningless */ >> - if (sdist > CFQQ_SEEK_THR && !for_preempt) >> - sdist = CFQQ_SEEK_THR; >> - >> - return cfq_dist_from_last(cfqd, rq) <= sdist; >> + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; >> } >> >> static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, >> @@ -2971,30 +2960,16 @@ static void >> cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> struct request *rq) >> { >> - sector_t sdist; >> - u64 total; >> - >> - if (!cfqq->last_request_pos) >> - sdist = 0; >> - else if (cfqq->last_request_pos < blk_rq_pos(rq)) >> - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> - else >> - sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> - >> - /* >> - * Don't allow the seek distance to get too large from the >> - * odd fragment, pagein, etc >> - */ >> - if (cfqq->seek_samples <= 60) /* second&third seek */ >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); >> - else >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); >> + sector_t sdist = 0; >> + if (cfqq->last_request_pos) { >> + if (cfqq->last_request_pos < blk_rq_pos(rq)) >> + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> + else >> + sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> + } >> >> - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; >> - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; >> - total = cfqq->seek_total + (cfqq->seek_samples/2); >> - do_div(total, cfqq->seek_samples); >> - cfqq->seek_mean = (sector_t)total; >> + cfqq->seek_history <<= 1; >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> >> /* >> * If this cfqq is shared between multiple processes, check to >> @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> cfq_mark_cfqq_deep(cfqq); >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) >> - && CFQQ_SEEKY(cfqq))) >> + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) >> enable_idle = 0; >> else if (sample_valid(cic->ttime_samples)) { >> if (cic->ttime_mean > cfqd->cfq_slice_idle) >> -- >> 1.6.4.4 > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 20:05 ` Corrado Zoccolo @ 2010-01-12 22:36 ` Vivek Goyal 2010-01-12 23:17 ` Corrado Zoccolo 0 siblings, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2010-01-12 22:36 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Tue, Jan 12, 2010 at 09:05:29PM +0100, Corrado Zoccolo wrote: > Hi Vivek, > On Tue, Jan 12, 2010 at 8:12 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: > >> Current seeky detection is based on average seek lenght. > >> This is suboptimal, since the average will not distinguish between: > >> * a process doing medium sized seeks > >> * a process doing some sequential requests interleaved with larger seeks > >> and even a medium seek can take lot of time, if the requested sector > >> happens to be behind the disk head in the rotation (50% probability). > >> > >> Therefore, we change the seeky queue detection to work as follows: > >> * each request can be classified as sequential if it is very close to > >> the current head position, i.e. it is likely in the disk cache (disks > >> usually read more data than requested, and put it in cache for > >> subsequent reads). Otherwise, the request is classified as seeky. > >> * an history window of the last 32 requests is kept, storing the > >> classification result. > >> * A queue is marked as seeky if more than 1/8 of the last 32 requests > >> were seeky. > >> > >> This patch fixes a regression reported by Yanmin, on mmap 64k random > >> reads. > >> > > > > Ok, I did basic testing of this patch on my hardware. I got a RAID-0 > > configuration and there are 12 disks behind it. I ran 8 fio mmap random > > read processes with block size 64K and following are the results. > > > > Vanilla (3 runs) > > =============== > > aggrb=3,564KB/s (cfq) > > aggrb=3,600KB/s (cfq) > > aggrb=3,607KB/s (cfq) > > > > aggrb=3,992KB/s,(deadline) > > aggrb=3,953KB/s (deadline) > > aggrb=3,991KB/s (deadline) > > > > Patched kernel (3 runs) > > ======================= > > aggrb=2,080KB/s (cfq) > > aggrb=2,100KB/s (cfq) > > aggrb=2,124KB/s (cfq) > > > > My fio script > > ============= > > [global] > > directory=/mnt/sda/fio/ > > size=8G > > direct=0 > > runtime=30 > > ioscheduler=cfq > > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" > > group_reporting=1 > > ioengine=mmap > > rw=randread > > bs=64K > > > > [randread] > > numjobs=8 > > ================================= > > > > There seems to be around more than 45% regression in this case. > > > > I have not run the blktrace, but I suspect it must be coming from the fact > > that we are now treating a random queue as sync-idle hence driving queue > > depth as 1. But the fact is that read ahead much not be kicking in, so > > we are not using the power of parallel processing this striped set of > > disks can do for us. > > > Yes. Those results are expected, and are the other side of the medal. > If we handle those queues as sync-idle, we get better performance > on single disk (and regression on RAIDs), and viceversa if we handle > them as sync-noidle. > > Note that this is limited to mmap with large block size. Normal read/pread > is not affected. > > > So treating this kind of cfqq as sync-idle seems to be a bad idea atleast > > on configurations where multiple disks are in raid configuration. > > The fact is, can we reliably determine which of those two setups we > have from cfq? I have no idea at this point of time but it looks like determining this will help. May be something like keep a track of number of processes on "sync-noidle" tree and average read times when sync-noidle tree is being served. Over a period of time we need to monitor what's the number of processes (threshold), after which average read time goes up. For sync-noidle we can then drive "queue_depth=nr_thrshold" and once queue depth reaches that, then idle on the process. So for single spindle, I guess tipping point will be 2 processes and we can idle on sync-noidle process. For more spindles, tipping point will be higher. These are just some random thoughts. > Until we can, we should optimize for the most common case. Hard to say what's the common case? Single rotational disks or enterprise storage with multiple disks behind RAID cards. > > Also, does the performance drop when the number of processes > approaches 8*number of spindles? I think here performance drop will be limited by queue depth. So once you have more than 32 processes driving queue depth 32, it should not matter how many processes you launch in parallel. I have collected some numbers for running 1,2,4,8,32 and 64 threads in parallel and see how throughput varies with vanilla kernel and with your patch. Vanilla kernel ============== aggrb=2,771KB/s, aggrb=2,779KB/s, aggrb=3,084KB/s, aggrb=3,623KB/s, aggrb=3,847KB/s, aggrb=3,940KB/s, aggrb=4,216KB/s, Patched kernel ============== aggrb=2,778KB/s, aggrb=2,447KB/s, aggrb=2,240KB/s, aggrb=2,182KB/s, aggrb=2,082KB/s, aggrb=2,033KB/s, aggrb=1,672KB/s, With vanilla kernel, output is on the rise as number of threads doing IO incrases and with patched kernel it is falling as number of threads rise. This is not pretty. Thanks Vivek > I think it should, so we will need to identify exacly the number of > spindles to be able to > allow only the right amount of parallelism. > > > For yanmin's case, he seems to be running a case where there is only single > > spindle and mulitple processes doing IO on that spindle. So I guess it > > does not suffer from the fact that we are driving queue depth as 1. > > Yes. Yanmin's configuration is a JBOD, i.e. a multi disk configuration > in which no partition spans multiple disks, so you have at most 1 spindle > per partition, possibly shared with others. > > > > > Why did I include some "deadline" numbers also? Just like that. Recently > > I have also become interested in comaparing CFQ with "deadline" for > > various workloads and see which one performs better in what > > circumstances. > > > Sounds sensible, especially for RAID configurations it can be a good comparison. > > Thanks, > Corrado > > > Thanks > > Vivek > > > >> Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> > >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > >> --- > >> block/cfq-iosched.c | 54 +++++++++++++------------------------------------- > >> 1 files changed, 14 insertions(+), 40 deletions(-) > >> > >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > >> index c6d5678..4e203c4 100644 > >> --- a/block/cfq-iosched.c > >> +++ b/block/cfq-iosched.c > >> @@ -133,9 +133,7 @@ struct cfq_queue { > >> unsigned short ioprio, org_ioprio; > >> unsigned short ioprio_class, org_ioprio_class; > >> > >> - unsigned int seek_samples; > >> - u64 seek_total; > >> - sector_t seek_mean; > >> + u32 seek_history; > >> sector_t last_request_pos; > >> unsigned long seeky_start; > >> > >> @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, > >> return cfqd->last_position - blk_rq_pos(rq); > >> } > >> > >> -#define CFQQ_SEEK_THR 8 * 1024 > >> -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) > >> +#define CFQQ_SEEK_THR (sector_t)(8 * 100) > >> +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > >> > >> static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> struct request *rq, bool for_preempt) > >> { > >> - sector_t sdist = cfqq->seek_mean; > >> - > >> - if (!sample_valid(cfqq->seek_samples)) > >> - sdist = CFQQ_SEEK_THR; > >> - > >> - /* if seek_mean is big, using it as close criteria is meaningless */ > >> - if (sdist > CFQQ_SEEK_THR && !for_preempt) > >> - sdist = CFQQ_SEEK_THR; > >> - > >> - return cfq_dist_from_last(cfqd, rq) <= sdist; > >> + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; > >> } > >> > >> static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, > >> @@ -2971,30 +2960,16 @@ static void > >> cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> struct request *rq) > >> { > >> - sector_t sdist; > >> - u64 total; > >> - > >> - if (!cfqq->last_request_pos) > >> - sdist = 0; > >> - else if (cfqq->last_request_pos < blk_rq_pos(rq)) > >> - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > >> - else > >> - sdist = cfqq->last_request_pos - blk_rq_pos(rq); > >> - > >> - /* > >> - * Don't allow the seek distance to get too large from the > >> - * odd fragment, pagein, etc > >> - */ > >> - if (cfqq->seek_samples <= 60) /* second&third seek */ > >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); > >> - else > >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); > >> + sector_t sdist = 0; > >> + if (cfqq->last_request_pos) { > >> + if (cfqq->last_request_pos < blk_rq_pos(rq)) > >> + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > >> + else > >> + sdist = cfqq->last_request_pos - blk_rq_pos(rq); > >> + } > >> > >> - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; > >> - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; > >> - total = cfqq->seek_total + (cfqq->seek_samples/2); > >> - do_div(total, cfqq->seek_samples); > >> - cfqq->seek_mean = (sector_t)total; > >> + cfqq->seek_history <<= 1; > >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > >> > >> /* > >> * If this cfqq is shared between multiple processes, check to > >> @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> cfq_mark_cfqq_deep(cfqq); > >> > >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > >> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > >> - && CFQQ_SEEKY(cfqq))) > >> + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) > >> enable_idle = 0; > >> else if (sample_valid(cic->ttime_samples)) { > >> if (cic->ttime_mean > cfqd->cfq_slice_idle) > >> -- > >> 1.6.4.4 > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 22:36 ` Vivek Goyal @ 2010-01-12 23:17 ` Corrado Zoccolo 2010-01-13 8:05 ` Corrado Zoccolo 2010-01-13 20:10 ` Vivek Goyal 0 siblings, 2 replies; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-12 23:17 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Tue, Jan 12, 2010 at 11:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Jan 12, 2010 at 09:05:29PM +0100, Corrado Zoccolo wrote: >> Hi Vivek, >> On Tue, Jan 12, 2010 at 8:12 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: >> >> Current seeky detection is based on average seek lenght. >> >> This is suboptimal, since the average will not distinguish between: >> >> * a process doing medium sized seeks >> >> * a process doing some sequential requests interleaved with larger seeks >> >> and even a medium seek can take lot of time, if the requested sector >> >> happens to be behind the disk head in the rotation (50% probability). >> >> >> >> Therefore, we change the seeky queue detection to work as follows: >> >> * each request can be classified as sequential if it is very close to >> >> the current head position, i.e. it is likely in the disk cache (disks >> >> usually read more data than requested, and put it in cache for >> >> subsequent reads). Otherwise, the request is classified as seeky. >> >> * an history window of the last 32 requests is kept, storing the >> >> classification result. >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests >> >> were seeky. >> >> >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random >> >> reads. >> >> >> > >> > Ok, I did basic testing of this patch on my hardware. I got a RAID-0 >> > configuration and there are 12 disks behind it. I ran 8 fio mmap random >> > read processes with block size 64K and following are the results. >> > >> > Vanilla (3 runs) >> > =============== >> > aggrb=3,564KB/s (cfq) >> > aggrb=3,600KB/s (cfq) >> > aggrb=3,607KB/s (cfq) >> > >> > aggrb=3,992KB/s,(deadline) >> > aggrb=3,953KB/s (deadline) >> > aggrb=3,991KB/s (deadline) >> > >> > Patched kernel (3 runs) >> > ======================= >> > aggrb=2,080KB/s (cfq) >> > aggrb=2,100KB/s (cfq) >> > aggrb=2,124KB/s (cfq) >> > >> > My fio script >> > ============= >> > [global] >> > directory=/mnt/sda/fio/ >> > size=8G >> > direct=0 >> > runtime=30 >> > ioscheduler=cfq >> > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" >> > group_reporting=1 >> > ioengine=mmap >> > rw=randread >> > bs=64K >> > >> > [randread] >> > numjobs=8 >> > ================================= >> > >> > There seems to be around more than 45% regression in this case. >> > >> > I have not run the blktrace, but I suspect it must be coming from the fact >> > that we are now treating a random queue as sync-idle hence driving queue >> > depth as 1. But the fact is that read ahead much not be kicking in, so >> > we are not using the power of parallel processing this striped set of >> > disks can do for us. >> > >> Yes. Those results are expected, and are the other side of the medal. >> If we handle those queues as sync-idle, we get better performance >> on single disk (and regression on RAIDs), and viceversa if we handle >> them as sync-noidle. >> >> Note that this is limited to mmap with large block size. Normal read/pread >> is not affected. >> >> > So treating this kind of cfqq as sync-idle seems to be a bad idea atleast >> > on configurations where multiple disks are in raid configuration. >> >> The fact is, can we reliably determine which of those two setups we >> have from cfq? > > I have no idea at this point of time but it looks like determining this > will help. > > May be something like keep a track of number of processes on "sync-noidle" > tree and average read times when sync-noidle tree is being served. Over a > period of time we need to monitor what's the number of processes > (threshold), after which average read time goes up. For sync-noidle we can > then drive "queue_depth=nr_thrshold" and once queue depth reaches that, > then idle on the process. So for single spindle, I guess tipping point > will be 2 processes and we can idle on sync-noidle process. For more > spindles, tipping point will be higher. > > These are just some random thoughts. It seems reasonable. Something similar to what we do to reduce depth for async writes. Can you see if you get similar BW improvements also for parallel sequential direct I/Os with block size < stripe size? > >> Until we can, we should optimize for the most common case. > > Hard to say what's the common case? Single rotational disks or enterprise > storage with multiple disks behind RAID cards. I think the pattern produced by mmap 64k is uncommon for reading data, while it is common for binaries. And binaries, even in enterprise machines, are usually not put on the large raids. > >> >> Also, does the performance drop when the number of processes >> approaches 8*number of spindles? > > I think here performance drop will be limited by queue depth. So once you > have more than 32 processes driving queue depth 32, it should not matter > how many processes you launch in parallel. Yes. With 12 disks it is unlikely to reach the saturation point. > > I have collected some numbers for running 1,2,4,8,32 and 64 threads in > parallel and see how throughput varies with vanilla kernel and with your > patch. > > Vanilla kernel > ============== > aggrb=2,771KB/s, > aggrb=2,779KB/s, > aggrb=3,084KB/s, > aggrb=3,623KB/s, > aggrb=3,847KB/s, > aggrb=3,940KB/s, > aggrb=4,216KB/s, > > Patched kernel > ============== > aggrb=2,778KB/s, > aggrb=2,447KB/s, > aggrb=2,240KB/s, > aggrb=2,182KB/s, > aggrb=2,082KB/s, > aggrb=2,033KB/s, > aggrb=1,672KB/s, > > With vanilla kernel, output is on the rise as number of threads doing IO > incrases and with patched kernel it is falling as number of threads > rise. This is not pretty. This is strange. we force the depth to be 1, but the BW should be stable. What happens if you disable low_latency? And can you compare it with 2.6.32? Thanks, Corrado > > Thanks > Vivek > >> I think it should, so we will need to identify exacly the number of >> spindles to be able to >> allow only the right amount of parallelism. >> >> > For yanmin's case, he seems to be running a case where there is only single >> > spindle and mulitple processes doing IO on that spindle. So I guess it >> > does not suffer from the fact that we are driving queue depth as 1. >> >> Yes. Yanmin's configuration is a JBOD, i.e. a multi disk configuration >> in which no partition spans multiple disks, so you have at most 1 spindle >> per partition, possibly shared with others. >> >> > >> > Why did I include some "deadline" numbers also? Just like that. Recently >> > I have also become interested in comaparing CFQ with "deadline" for >> > various workloads and see which one performs better in what >> > circumstances. >> > >> Sounds sensible, especially for RAID configurations it can be a good comparison. >> >> Thanks, >> Corrado >> >> > Thanks >> > Vivek >> > >> >> Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> >> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> >> --- >> >> block/cfq-iosched.c | 54 +++++++++++++------------------------------------- >> >> 1 files changed, 14 insertions(+), 40 deletions(-) >> >> >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> >> index c6d5678..4e203c4 100644 >> >> --- a/block/cfq-iosched.c >> >> +++ b/block/cfq-iosched.c >> >> @@ -133,9 +133,7 @@ struct cfq_queue { >> >> unsigned short ioprio, org_ioprio; >> >> unsigned short ioprio_class, org_ioprio_class; >> >> >> >> - unsigned int seek_samples; >> >> - u64 seek_total; >> >> - sector_t seek_mean; >> >> + u32 seek_history; >> >> sector_t last_request_pos; >> >> unsigned long seeky_start; >> >> >> >> @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, >> >> return cfqd->last_position - blk_rq_pos(rq); >> >> } >> >> >> >> -#define CFQQ_SEEK_THR 8 * 1024 >> >> -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) >> >> +#define CFQQ_SEEK_THR (sector_t)(8 * 100) >> >> +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) >> >> >> >> static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> >> struct request *rq, bool for_preempt) >> >> { >> >> - sector_t sdist = cfqq->seek_mean; >> >> - >> >> - if (!sample_valid(cfqq->seek_samples)) >> >> - sdist = CFQQ_SEEK_THR; >> >> - >> >> - /* if seek_mean is big, using it as close criteria is meaningless */ >> >> - if (sdist > CFQQ_SEEK_THR && !for_preempt) >> >> - sdist = CFQQ_SEEK_THR; >> >> - >> >> - return cfq_dist_from_last(cfqd, rq) <= sdist; >> >> + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; >> >> } >> >> >> >> static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, >> >> @@ -2971,30 +2960,16 @@ static void >> >> cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> >> struct request *rq) >> >> { >> >> - sector_t sdist; >> >> - u64 total; >> >> - >> >> - if (!cfqq->last_request_pos) >> >> - sdist = 0; >> >> - else if (cfqq->last_request_pos < blk_rq_pos(rq)) >> >> - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> >> - else >> >> - sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> >> - >> >> - /* >> >> - * Don't allow the seek distance to get too large from the >> >> - * odd fragment, pagein, etc >> >> - */ >> >> - if (cfqq->seek_samples <= 60) /* second&third seek */ >> >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); >> >> - else >> >> - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); >> >> + sector_t sdist = 0; >> >> + if (cfqq->last_request_pos) { >> >> + if (cfqq->last_request_pos < blk_rq_pos(rq)) >> >> + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> >> + else >> >> + sdist = cfqq->last_request_pos - blk_rq_pos(rq); >> >> + } >> >> >> >> - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; >> >> - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; >> >> - total = cfqq->seek_total + (cfqq->seek_samples/2); >> >> - do_div(total, cfqq->seek_samples); >> >> - cfqq->seek_mean = (sector_t)total; >> >> + cfqq->seek_history <<= 1; >> >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> >> >> >> /* >> >> * If this cfqq is shared between multiple processes, check to >> >> @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> >> cfq_mark_cfqq_deep(cfqq); >> >> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> >> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) >> >> - && CFQQ_SEEKY(cfqq))) >> >> + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) >> >> enable_idle = 0; >> >> else if (sample_valid(cic->ttime_samples)) { >> >> if (cic->ttime_mean > cfqd->cfq_slice_idle) >> >> -- >> >> 1.6.4.4 >> > >> >> >> >> -- >> __________________________________________________________________________ >> >> dott. Corrado Zoccolo mailto:czoccolo@gmail.com >> PhD - Department of Computer Science - University of Pisa, Italy >> -------------------------------------------------------------------------- >> The self-confidence of a warrior is not the self-confidence of the average >> man. The average man seeks certainty in the eyes of the onlooker and calls >> that self-confidence. The warrior seeks impeccability in his own eyes and >> calls that humbleness. >> Tales of Power - C. Castaneda > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 23:17 ` Corrado Zoccolo @ 2010-01-13 8:05 ` Corrado Zoccolo 2010-01-13 20:19 ` Vivek Goyal 2010-01-13 20:10 ` Vivek Goyal 1 sibling, 1 reply; 21+ messages in thread From: Corrado Zoccolo @ 2010-01-13 8:05 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 12:17 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > On Tue, Jan 12, 2010 at 11:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >>> The fact is, can we reliably determine which of those two setups we >>> have from cfq? >> >> I have no idea at this point of time but it looks like determining this >> will help. >> >> May be something like keep a track of number of processes on "sync-noidle" >> tree and average read times when sync-noidle tree is being served. Over a >> period of time we need to monitor what's the number of processes >> (threshold), after which average read time goes up. For sync-noidle we can >> then drive "queue_depth=nr_thrshold" and once queue depth reaches that, >> then idle on the process. So for single spindle, I guess tipping point >> will be 2 processes and we can idle on sync-noidle process. For more >> spindles, tipping point will be higher. >> >> These are just some random thoughts. > It seems reasonable. I think, though, that the implementation will be complex. We should limit this to request sizes that are <= stripe size (larger requests will hit more disks, and have a much lower optimal queue depth), so we need to add a new service_tree (they will become: SYNC_IDLE_LARGE, SYNC_IDLE_SMALL, SYNC_NOIDLE, ASYNC), and the optimization will apply only to the SYNC_IDLE_SMALL tree. Moreover, we can't just dispatch K queues and then idle on the last one. We need to have a set of K active queues, and wait on any of them. This makes this optimization very complex, and I think for little gain. In fact, usually we don't have sequential streams of small requests, unless we misuse mmap or direct I/O. BTW, the mmap problem could be easily fixed adding madvise(WILL_NEED) to the userspace program, when dealing with data. I think we only have to worry about binaries, here. > Something similar to what we do to reduce depth for async writes. > Can you see if you get similar BW improvements also for parallel > sequential direct I/Os with block size < stripe size? Thanks, Corrado ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-13 8:05 ` Corrado Zoccolo @ 2010-01-13 20:19 ` Vivek Goyal 0 siblings, 0 replies; 21+ messages in thread From: Vivek Goyal @ 2010-01-13 20:19 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 09:05:21AM +0100, Corrado Zoccolo wrote: > On Wed, Jan 13, 2010 at 12:17 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > > On Tue, Jan 12, 2010 at 11:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >>> The fact is, can we reliably determine which of those two setups we > >>> have from cfq? > >> > >> I have no idea at this point of time but it looks like determining this > >> will help. > >> > >> May be something like keep a track of number of processes on "sync-noidle" > >> tree and average read times when sync-noidle tree is being served. Over a > >> period of time we need to monitor what's the number of processes > >> (threshold), after which average read time goes up. For sync-noidle we can > >> then drive "queue_depth=nr_thrshold" and once queue depth reaches that, > >> then idle on the process. So for single spindle, I guess tipping point > >> will be 2 processes and we can idle on sync-noidle process. For more > >> spindles, tipping point will be higher. > >> > >> These are just some random thoughts. > > It seems reasonable. > I think, though, that the implementation will be complex. > We should limit this to request sizes that are <= stripe size (larger > requests will hit more disks, and have a much lower optimal queue > depth), so we need to add a new service_tree (they will become: > SYNC_IDLE_LARGE, SYNC_IDLE_SMALL, SYNC_NOIDLE, ASYNC), and the > optimization will apply only to the SYNC_IDLE_SMALL tree. > Moreover, we can't just dispatch K queues and then idle on the last > one. We need to have a set of K active queues, and wait on any of > them. This makes this optimization very complex, and I think for > little gain. In fact, usually we don't have sequential streams of > small requests, unless we misuse mmap or direct I/O. I guess one little simpler thing could be to determine whether underlying media is single disk/spindle or not. So if optimal queue depth is more than 1, there are most likely more than one spindle and we can drive deeper queue depths and not idle on mmap process. If optimal queue depth is 1, then there is single disk/spindle, and we can mark mmap process as sync-idle. Not need of extra service tree. But I do agree, that even determining optimal queue depth might turn out to be complex. But in the long run it might be a useful information to detct/know whether we are operating on single disk or an array of disks. I will play around a bit with it if time permits. > BTW, the mmap problem could be easily fixed adding madvise(WILL_NEED) > to the userspace program, when dealing with data. > I think we only have to worry about binaries, here. > > > Something similar to what we do to reduce depth for async writes. > > Can you see if you get similar BW improvements also for parallel > > sequential direct I/Os with block size < stripe size? > > Thanks, > Corrado ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] cfq-iosched: rework seeky detection 2010-01-12 23:17 ` Corrado Zoccolo 2010-01-13 8:05 ` Corrado Zoccolo @ 2010-01-13 20:10 ` Vivek Goyal [not found] ` <4e5e476b1001131324t148d195cp7ad92e7edf8325fb@mail.gmail.com> 1 sibling, 1 reply; 21+ messages in thread From: Vivek Goyal @ 2010-01-13 20:10 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 12:17:16AM +0100, Corrado Zoccolo wrote: > On Tue, Jan 12, 2010 at 11:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Jan 12, 2010 at 09:05:29PM +0100, Corrado Zoccolo wrote: > >> Hi Vivek, > >> On Tue, Jan 12, 2010 at 8:12 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > On Sat, Jan 09, 2010 at 04:59:17PM +0100, Corrado Zoccolo wrote: > >> >> Current seeky detection is based on average seek lenght. > >> >> This is suboptimal, since the average will not distinguish between: > >> >> * a process doing medium sized seeks > >> >> * a process doing some sequential requests interleaved with larger seeks > >> >> and even a medium seek can take lot of time, if the requested sector > >> >> happens to be behind the disk head in the rotation (50% probability). > >> >> > >> >> Therefore, we change the seeky queue detection to work as follows: > >> >> * each request can be classified as sequential if it is very close to > >> >> the current head position, i.e. it is likely in the disk cache (disks > >> >> usually read more data than requested, and put it in cache for > >> >> subsequent reads). Otherwise, the request is classified as seeky. > >> >> * an history window of the last 32 requests is kept, storing the > >> >> classification result. > >> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests > >> >> were seeky. > >> >> > >> >> This patch fixes a regression reported by Yanmin, on mmap 64k random > >> >> reads. > >> >> > >> > > >> > Ok, I did basic testing of this patch on my hardware. I got a RAID-0 > >> > configuration and there are 12 disks behind it. I ran 8 fio mmap random > >> > read processes with block size 64K and following are the results. > >> > > >> > Vanilla (3 runs) > >> > =============== > >> > aggrb=3,564KB/s (cfq) > >> > aggrb=3,600KB/s (cfq) > >> > aggrb=3,607KB/s (cfq) > >> > > >> > aggrb=3,992KB/s,(deadline) > >> > aggrb=3,953KB/s (deadline) > >> > aggrb=3,991KB/s (deadline) > >> > > >> > Patched kernel (3 runs) > >> > ======================= > >> > aggrb=2,080KB/s (cfq) > >> > aggrb=2,100KB/s (cfq) > >> > aggrb=2,124KB/s (cfq) > >> > > >> > My fio script > >> > ============= > >> > [global] > >> > directory=/mnt/sda/fio/ > >> > size=8G > >> > direct=0 > >> > runtime=30 > >> > ioscheduler=cfq > >> > exec_prerun="echo 3 > /proc/sys/vm/drop_caches" > >> > group_reporting=1 > >> > ioengine=mmap > >> > rw=randread > >> > bs=64K > >> > > >> > [randread] > >> > numjobs=8 > >> > ================================= > >> > > >> > There seems to be around more than 45% regression in this case. > >> > > >> > I have not run the blktrace, but I suspect it must be coming from the fact > >> > that we are now treating a random queue as sync-idle hence driving queue > >> > depth as 1. But the fact is that read ahead much not be kicking in, so > >> > we are not using the power of parallel processing this striped set of > >> > disks can do for us. > >> > > >> Yes. Those results are expected, and are the other side of the medal. > >> If we handle those queues as sync-idle, we get better performance > >> on single disk (and regression on RAIDs), and viceversa if we handle > >> them as sync-noidle. > >> > >> Note that this is limited to mmap with large block size. Normal read/pread > >> is not affected. > >> > >> > So treating this kind of cfqq as sync-idle seems to be a bad idea atleast > >> > on configurations where multiple disks are in raid configuration. > >> > >> The fact is, can we reliably determine which of those two setups we > >> have from cfq? > > > > I have no idea at this point of time but it looks like determining this > > will help. > > > > May be something like keep a track of number of processes on "sync-noidle" > > tree and average read times when sync-noidle tree is being served. Over a > > period of time we need to monitor what's the number of processes > > (threshold), after which average read time goes up. For sync-noidle we can > > then drive "queue_depth=nr_thrshold" and once queue depth reaches that, > > then idle on the process. So for single spindle, I guess tipping point > > will be 2 processes and we can idle on sync-noidle process. For more > > spindles, tipping point will be higher. > > > > These are just some random thoughts. > It seems reasonable. > Something similar to what we do to reduce depth for async writes. > Can you see if you get similar BW improvements also for parallel > sequential direct I/Os with block size < stripe size? > Hi Corrado, I have run some more tests. For direct sequential I/Os I do not see BW improvements as I increase number of processes. Which is kind of expected as these are sync-idle workload and we will continue to drive queue depth as 1. I do see that as number of processes increase, BW goes down. Not sure why. May be some readahead data in hardware gets trashed. ? vanilla (1,2,4,8,16,32,64 processses, direct=1, seq, size=4G, bs=64K) ========= cfq --- aggrb=279MB/s, aggrb=277MB/s, aggrb=276MB/s, aggrb=263MB/s, aggrb=262MB/s, aggrb=214MB/s, aggrb=99MB/s, Especially look at BW drop when numjobs=64. deadline's numbers look a lot better. deadline ------------ aggrb=271MB/s, aggrb=385MB/s, aggrb=386MB/s, aggrb=385MB/s, aggrb=384MB/s, aggrb=356MB/s, aggrb=257MB/s, Above numbers can almost be met if slice_idle=0 with cfq cfq (slice_idle=0) ------------------ aggrb=278MB/s, aggrb=390MB/s, aggrb=384MB/s, aggrb=386MB/s, aggrb=383MB/s, aggrb=350MB/s, aggrb=261MB/s, > > > >> Until we can, we should optimize for the most common case. > > > > Hard to say what's the common case? Single rotational disks or enterprise > > storage with multiple disks behind RAID cards. > > I think the pattern produced by mmap 64k is uncommon for reading data, while > it is common for binaries. And binaries, even in enterprise machines, > are usually > not put on the large raids. > Not large raids but root can very well be on small RAID (3-4 disks). > > > >> > >> Also, does the performance drop when the number of processes > >> approaches 8*number of spindles? > > > > I think here performance drop will be limited by queue depth. So once you > > have more than 32 processes driving queue depth 32, it should not matter > > how many processes you launch in parallel. > Yes. With 12 disks it is unlikely to reach the saturation point. > > > > I have collected some numbers for running 1,2,4,8,32 and 64 threads in > > parallel and see how throughput varies with vanilla kernel and with your > > patch. > > > > Vanilla kernel > > ============== > > aggrb=2,771KB/s, > > aggrb=2,779KB/s, > > aggrb=3,084KB/s, > > aggrb=3,623KB/s, > > aggrb=3,847KB/s, > > aggrb=3,940KB/s, > > aggrb=4,216KB/s, > > > > Patched kernel > > ============== > > aggrb=2,778KB/s, > > aggrb=2,447KB/s, > > aggrb=2,240KB/s, > > aggrb=2,182KB/s, > > aggrb=2,082KB/s, > > aggrb=2,033KB/s, > > aggrb=1,672KB/s, > > > > With vanilla kernel, output is on the rise as number of threads doing IO > > incrases and with patched kernel it is falling as number of threads > > rise. This is not pretty. > This is strange. we force the depth to be 1, but the BW should be stable. > What happens if you disable low_latency? > And can you compare it with 2.6.32? Disabling low_latency did not help much. cfq, low_latency=0 ------------------ aggrb=2,755KB/s, aggrb=2,374KB/s, aggrb=2,225KB/s, aggrb=2,174KB/s, aggrb=2,007KB/s, aggrb=1,904KB/s, aggrb=1,856KB/s, On a side note, I also did some tests with multiple buffered sequential read streams. Here are the results. Vanilla (buffered seq reads, size=4G, bs=64K, 1,2,4,8,16,32,64 processes) =================================================== cfq (low_latency=1) ------------------- aggrb=372MB/s, aggrb=326MB/s, aggrb=319MB/s, aggrb=272MB/s, aggrb=250MB/s, aggrb=200MB/s, aggrb=186MB/s, cfq (low_latency=0) ------------------ aggrb=370MB/s, aggrb=325MB/s, aggrb=330MB/s, aggrb=311MB/s, aggrb=206MB/s, aggrb=264MB/s, aggrb=157MB/s, cfq (slice_idle=0) ------------------ aggrb=372MB/s, aggrb=383MB/s, aggrb=387MB/s, aggrb=382MB/s, aggrb=378MB/s, aggrb=372MB/s, aggrb=230MB/s, deadline -------- aggrb=380MB/s, aggrb=381MB/s, aggrb=386MB/s, aggrb=383MB/s, aggrb=382MB/s, aggrb=370MB/s, aggrb=234MB/s, Notes(For this workload on this hardware): - It is hard to beat deadline. cfq with slice_idle=0 is almost there. - low_latency=0 is not significantly better than low_latency=1. - driving queue depth 1, hurts on large RAIDs even for buffered sequtetial reads. readahead can only help this much. It does not fully compensate for the fact that there are more spindles and we can get more out of array if we drive deeper queue depths. This is one data point for the discussion we were having in another thread where I was suspecting that driving shallower queue depth might hurt on large arrays even with readahed. Thanks Vivek ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <4e5e476b1001131324t148d195cp7ad92e7edf8325fb@mail.gmail.com>]
* Re: [PATCH] cfq-iosched: rework seeky detection [not found] ` <4e5e476b1001131324t148d195cp7ad92e7edf8325fb@mail.gmail.com> @ 2010-01-13 22:21 ` Vivek Goyal 0 siblings, 0 replies; 21+ messages in thread From: Vivek Goyal @ 2010-01-13 22:21 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, Yanmin Zhang On Wed, Jan 13, 2010 at 10:24:14PM +0100, Corrado Zoccolo wrote: > Hi Vivek, > On Wed, Jan 13, 2010 at 9:10 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > Hi Corrado, > > > > I have run some more tests. For direct sequential I/Os I do not see BW > > improvements as I increase number of processes. Which is kind of expected > > as these are sync-idle workload and we will continue to drive queue depth > > as 1. > > > Ok. But the deadline numbers tell us what we could achieve if, for example, > we decided that queues issuing too small requests are marked as noidle. > Upon successful detection of RAID. Or if we do it irrespective of underlying storage, then you will gain on RAIDs but can loose on single spindle. Might see excessive seeks (direct sequential IO). > > > I do see that as number of processes increase, BW goes down. Not > > sure why. May be some readahead data in hardware gets trashed. ? > > > > Likely, when cfq switches from one process to an other, the disk's cache > still contains some useful data. If there are too many reads before the same > process comes back, the cache will be flushed, and you have to re-read that > data again. > This is why we should have a certain number of active queues, instead of > cycling all the queues. But then you will be starving non-active queues or will increase max latency. I guess bigger slice for each queue should achive similar effect but i did not see significant gains with low_latency=0. > > vanilla (1,2,4,8,16,32,64 processses, direct=1, seq, size=4G, bs=64K) > > ========= > > cfq > > --- > > aggrb=279MB/s, > > aggrb=277MB/s, > > aggrb=276MB/s, > > aggrb=263MB/s, > > aggrb=262MB/s, > > aggrb=214MB/s, > > aggrb=99MB/s, > > > > Especially look at BW drop when numjobs=64. > > > > deadline's numbers look a lot better. > > > > deadline > > ------------ > > aggrb=271MB/s, > > aggrb=385MB/s, > > aggrb=386MB/s, > > aggrb=385MB/s, > > aggrb=384MB/s, > > aggrb=356MB/s, > > aggrb=257MB/s, > > > > > This shows that the optimal queue depth is around 2-4. I think much more than that. In case of deadline we saw performance drop in 32 processes. Till 16 processes it was just fine. So I would say 16 is the optimal queue depth in this case. This is further verified by cfq numbers with slice_idle=0 below. > > > > Above numbers can almost be met if slice_idle=0 with cfq > > > > cfq (slice_idle=0) > > ------------------ > > aggrb=278MB/s, > > aggrb=390MB/s, > > aggrb=384MB/s, > > aggrb=386MB/s, > > aggrb=383MB/s, > > aggrb=350MB/s, > > aggrb=261MB/s, > > > > > > > > > > >> Until we can, we should optimize for the most common case. > > > > > > > > Hard to say what's the common case? Single rotational disks or > > enterprise > > > > storage with multiple disks behind RAID cards. > > > > > > I think the pattern produced by mmap 64k is uncommon for reading data, > > while > > > it is common for binaries. And binaries, even in enterprise machines, > > > are usually > > > not put on the large raids. > > > > > > > Not large raids but root can very well be on small RAID (3-4 disks). > > > On 3-4 disks, probably the optimal queue depth is 1. > > > > > > > > >> > > > >> Also, does the performance drop when the number of processes > > > >> approaches 8*number of spindles? > > > > > > > > I think here performance drop will be limited by queue depth. So once > > you > > > > have more than 32 processes driving queue depth 32, it should not > > matter > > > > how many processes you launch in parallel. > > > Yes. With 12 disks it is unlikely to reach the saturation point. > > > > > > > > I have collected some numbers for running 1,2,4,8,32 and 64 threads in > > > > parallel and see how throughput varies with vanilla kernel and with > > your > > > > patch. > > > > > > > > Vanilla kernel > > > > ============== > > > > aggrb=2,771KB/s, > > > > aggrb=2,779KB/s, > > > > aggrb=3,084KB/s, > > > > aggrb=3,623KB/s, > > > > aggrb=3,847KB/s, > > > > aggrb=3,940KB/s, > > > > aggrb=4,216KB/s, > > > > > > > > Patched kernel > > > > ============== > > > > aggrb=2,778KB/s, > > > > aggrb=2,447KB/s, > > > > aggrb=2,240KB/s, > > > > aggrb=2,182KB/s, > > > > aggrb=2,082KB/s, > > > > aggrb=2,033KB/s, > > > > aggrb=1,672KB/s, > > > > > > > > With vanilla kernel, output is on the rise as number of threads doing > > IO > > > > incrases and with patched kernel it is falling as number of threads > > > > rise. This is not pretty. > > > This is strange. we force the depth to be 1, but the BW should be stable. > > > What happens if you disable low_latency? > > > And can you compare it with 2.6.32? > > > > Disabling low_latency did not help much. > > > > > cfq, low_latency=0 > > ------------------ > > aggrb=2,755KB/s, > > aggrb=2,374KB/s, > > aggrb=2,225KB/s, > > aggrb=2,174KB/s, > > aggrb=2,007KB/s, > > aggrb=1,904KB/s, > > aggrb=1,856KB/s, > > > > Looking at those numbers, it seems that an average seek costs you 23ms. Is > this possible? > (64kb / 2770 kb/s = 23ms) > Maybe the firmware of your RAID card implements idling internally? > I have no idea about firmware implementation. I ran mmap, bs=64K testcase with deadline also on same hardware. deadline ======== aggrb=2,754KB/s, aggrb=3,212KB/s, aggrb=3,838KB/s, aggrb=4,017KB/s, aggrb=3,974KB/s, aggrb=4,628KB/s, aggrb=6,124KB/s, Look with even 64 processes, BW is on the rise. So I guess there is no idling in firmware otherwise we would have seen BW kind of stablized. But you never know. This is also baffeling that why I am not getting same result with CFQ. Currently CFQ will mark mmap queues as sync-noidle and then we should be driving queue depth 32 like deadline and should have got same numbers. But does not seem to be happeing. CFQ is bit behind especially in the case of 64 processes. > On a side note, I also did some tests with multiple buffered sequential > > read streams. Here are the results. > > > > Vanilla (buffered seq reads, size=4G, bs=64K, 1,2,4,8,16,32,64 processes) > > =================================================== > > cfq (low_latency=1) > > ------------------- > > aggrb=372MB/s, > > aggrb=326MB/s, > > aggrb=319MB/s, > > aggrb=272MB/s, > > aggrb=250MB/s, > > aggrb=200MB/s, > > aggrb=186MB/s, > > > > cfq (low_latency=0) > > ------------------ > > aggrb=370MB/s, > > aggrb=325MB/s, > > aggrb=330MB/s, > > aggrb=311MB/s, > > aggrb=206MB/s, > > aggrb=264MB/s, > > aggrb=157MB/s, > > > > cfq (slice_idle=0) > > ------------------ > > aggrb=372MB/s, > > aggrb=383MB/s, > > aggrb=387MB/s, > > aggrb=382MB/s, > > aggrb=378MB/s, > > aggrb=372MB/s, > > aggrb=230MB/s, > > > > deadline > > -------- > > aggrb=380MB/s, > > aggrb=381MB/s, > > aggrb=386MB/s, > > aggrb=383MB/s, > > aggrb=382MB/s, > > aggrb=370MB/s, > > aggrb=234MB/s, > > > > Notes(For this workload on this hardware): > > > > - It is hard to beat deadline. cfq with slice_idle=0 is almost there. > > > > When slice_idle = 0, cfq works much more like noop (just with better control > of write depth). > > - low_latency=0 is not significantly better than low_latency=1. > > > > Good. At least it doesn't introduce regression in those workloads. > > > > - driving queue depth 1, hurts on large RAIDs even for buffered sequtetial > > reads. readahead can only help this much. It does not fully compensate for > > the fact that there are more spindles and we can get more out of array if > > we drive deeper queue depths. > > > > This is one data point for the discussion we were having in another thread > > where I was suspecting that driving shallower queue depth might hurt on > > large arrays even with readahed. > > > > Well, it is just a 2% improvement (look at deadline numbers, 1:380, > best:386), so I think readahead is enough for the buffered case. But in cfq as number of processes increase, throughput drops. But does not happen with deadline. So driving deeper queue depths on RAIDs is good for throughput. What amuses me is that with 16 processes, deadline is still clocking 382MB/s and cfq is 250MB/s. Why this difference of 130MB/s. Can you think of anything else apart from shallower queue depths in cfq. Even low_latency=0 did not help. In fact for 16 processes throughput dropped to 206MB/s. So somehow giving bigger time slices did not help. > In case of small requests, though, we are paying too much. > I think that simply marking queues with too small requests as no-idle should > be a win here (when we can identify RAIDs reliably). > > Thanks, > Corrado > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-01-13 22:22 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-09 15:59 [PATCH] cfq-iosched: rework seeky detection Corrado Zoccolo
2010-01-11 1:47 ` Shaohua Li
2010-01-11 2:53 ` Gui Jianfeng
2010-01-11 14:20 ` Jeff Moyer
2010-01-11 14:46 ` Corrado Zoccolo
2010-01-12 1:49 ` Shaohua Li
2010-01-12 8:52 ` Corrado Zoccolo
2010-01-13 3:45 ` Shaohua Li
2010-01-13 7:09 ` Corrado Zoccolo
2010-01-13 8:00 ` Shaohua Li
2010-01-13 8:09 ` Corrado Zoccolo
2010-01-11 16:29 ` Vivek Goyal
2010-01-11 16:52 ` Corrado Zoccolo
2010-01-12 19:12 ` Vivek Goyal
2010-01-12 20:05 ` Corrado Zoccolo
2010-01-12 22:36 ` Vivek Goyal
2010-01-12 23:17 ` Corrado Zoccolo
2010-01-13 8:05 ` Corrado Zoccolo
2010-01-13 20:19 ` Vivek Goyal
2010-01-13 20:10 ` Vivek Goyal
[not found] ` <4e5e476b1001131324t148d195cp7ad92e7edf8325fb@mail.gmail.com>
2010-01-13 22:21 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox