* [PATCH 1/2] cfq-iosched: rework seeky detection [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com> @ 2010-02-27 18:45 ` Corrado Zoccolo 2010-02-27 18:45 ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo 2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe 2010-03-01 16:35 ` Vivek Goyal 2 siblings, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-02-27 18:45 UTC (permalink / raw) To: Jens Axboe Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng, Corrado Zoccolo Current seeky detection is based on average seek lenght. This is suboptimal, since the average will not distinguish between: * a process doing medium sized seeks * a process doing some sequential requests interleaved with larger seeks and even a medium seek can take lot of time, if the requested sector happens to be behind the disk head in the rotation (50% probability). Therefore, we change the seeky queue detection to work as follows: * each request can be classified as sequential if it is very close to the current head position, i.e. it is likely in the disk cache (disks usually read more data than requested, and put it in cache for subsequent reads). Otherwise, the request is classified as seeky. * an history window of the last 32 requests is kept, storing the classification result. * A queue is marked as seeky if more than 1/8 of the last 32 requests were seeky. This patch fixes a regression reported by Yanmin, on mmap 64k random reads. Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> --- block/cfq-iosched.c | 54 +++++++++++++------------------------------------- 1 files changed, 14 insertions(+), 40 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index eed649c..806d30b 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -46,8 +46,8 @@ static const int cfq_hist_divisor = 4; #define CFQ_HW_QUEUE_MIN (5) #define CFQ_SERVICE_SHIFT 12 -#define CFQQ_SEEK_THR 8 * 1024 -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) +#define CFQQ_SEEK_THR (sector_t)(8 * 100) +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) #define RQ_CIC(rq) \ ((struct cfq_io_context *) (rq)->elevator_private) @@ -132,9 +132,7 @@ struct cfq_queue { pid_t pid; - unsigned int seek_samples; - u64 seek_total; - sector_t seek_mean; + u32 seek_history; sector_t last_request_pos; struct cfq_rb_root *service_tree; @@ -1662,16 +1660,7 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct request *rq, bool for_preempt) { - sector_t sdist = cfqq->seek_mean; - - if (!sample_valid(cfqq->seek_samples)) - sdist = CFQQ_SEEK_THR; - - /* if seek_mean is big, using it as close criteria is meaningless */ - if (sdist > CFQQ_SEEK_THR && !for_preempt) - sdist = CFQQ_SEEK_THR; - - return cfq_dist_from_last(cfqd, rq) <= sdist; + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; } static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, @@ -2968,30 +2957,16 @@ static void cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct request *rq) { - sector_t sdist; - u64 total; - - if (!cfqq->last_request_pos) - sdist = 0; - else if (cfqq->last_request_pos < blk_rq_pos(rq)) - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; - else - sdist = cfqq->last_request_pos - blk_rq_pos(rq); - - /* - * Don't allow the seek distance to get too large from the - * odd fragment, pagein, etc - */ - if (cfqq->seek_samples <= 60) /* second&third seek */ - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); - else - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); + sector_t sdist = 0; + if (cfqq->last_request_pos) { + if (cfqq->last_request_pos < blk_rq_pos(rq)) + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; + else + sdist = cfqq->last_request_pos - blk_rq_pos(rq); + } - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; - total = cfqq->seek_total + (cfqq->seek_samples/2); - do_div(total, cfqq->seek_samples); - cfqq->seek_mean = (sector_t)total; + cfqq->seek_history <<= 1; + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); } /* @@ -3016,8 +2991,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfq_mark_cfqq_deep(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) - && CFQQ_SEEKY(cfqq))) + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) -- 1.6.4.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo @ 2010-02-27 18:45 ` Corrado Zoccolo 2010-03-01 14:25 ` Vivek Goyal 0 siblings, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-02-27 18:45 UTC (permalink / raw) To: Jens Axboe Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng, Corrado Zoccolo CFQ currently applies the same logic of detecting seeky queues and grouping them together for rotational disks as well as SSDs. For SSDs, the time to complete a request doesn't depend on the request location, but only on the size. This patch therefore changes the criterion to group queues by request size in case of SSDs, in order to achieve better fairness. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> --- block/cfq-iosched.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 806d30b..f27e535 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; #define CFQ_SERVICE_SHIFT 12 #define CFQQ_SEEK_THR (sector_t)(8 * 100) +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) #define RQ_CIC(rq) \ @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct request *rq) { sector_t sdist = 0; + sector_t n_sec = blk_rq_sectors(rq); if (cfqq->last_request_pos) { if (cfqq->last_request_pos < blk_rq_pos(rq)) sdist = blk_rq_pos(rq) - cfqq->last_request_pos; @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, } cfqq->seek_history <<= 1; - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); + if (blk_queue_nonrot(cfqd->queue)) + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); + else + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); } /* -- 1.6.4.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-02-27 18:45 ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo @ 2010-03-01 14:25 ` Vivek Goyal 2010-03-03 19:47 ` Corrado Zoccolo 0 siblings, 1 reply; 16+ messages in thread From: Vivek Goyal @ 2010-03-01 14:25 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: > CFQ currently applies the same logic of detecting seeky queues and > grouping them together for rotational disks as well as SSDs. > For SSDs, the time to complete a request doesn't depend on the > request location, but only on the size. > This patch therefore changes the criterion to group queues by > request size in case of SSDs, in order to achieve better fairness. Hi Corrado, Can you give some numbers regarding how are you measuring fairness and how did you decide that we achieve better fairness? In case of SSDs with NCQ, we will not idle on any of the queues (either sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes if we mark a queue as seeky/non-seeky on SSD? IOW, looking at this patch, now any queue doing IO in smaller chunks than 32K on SSD will be marked as seeky. How does that change the behavior in terms of fairness for the queue? Thanks Vivek > > Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > --- > block/cfq-iosched.c | 7 ++++++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index 806d30b..f27e535 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; > #define CFQ_SERVICE_SHIFT 12 > > #define CFQQ_SEEK_THR (sector_t)(8 * 100) > +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) > #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > > #define RQ_CIC(rq) \ > @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct request *rq) > { > sector_t sdist = 0; > + sector_t n_sec = blk_rq_sectors(rq); > if (cfqq->last_request_pos) { > if (cfqq->last_request_pos < blk_rq_pos(rq)) > sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > } > > cfqq->seek_history <<= 1; > - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > + if (blk_queue_nonrot(cfqd->queue)) > + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); > + else > + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > } > > /* > -- > 1.6.4.4 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-01 14:25 ` Vivek Goyal @ 2010-03-03 19:47 ` Corrado Zoccolo 2010-03-03 21:21 ` Vivek Goyal 2010-03-03 23:28 ` Vivek Goyal 0 siblings, 2 replies; 16+ messages in thread From: Corrado Zoccolo @ 2010-03-03 19:47 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng [-- Attachment #1: Type: text/plain, Size: 5579 bytes --] Hi Vivek, On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: >> CFQ currently applies the same logic of detecting seeky queues and >> grouping them together for rotational disks as well as SSDs. >> For SSDs, the time to complete a request doesn't depend on the >> request location, but only on the size. >> This patch therefore changes the criterion to group queues by >> request size in case of SSDs, in order to achieve better fairness. > > Hi Corrado, > > Can you give some numbers regarding how are you measuring fairness and > how did you decide that we achieve better fairness? > Please, see the attached fio script. It benchmarks pairs of processes performing direct random I/O. One is always fixed at bs=4k , while I vary the other from 8K to 64K test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 With unpatched cfq (2.6.33), on a flash card (non-ncq), after running a fio script with high number of parallel readers to make sure ncq detection is stabilized, I get the following: Run status group 0 (all jobs): READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, mint=5001msec, maxt=5003msec Run status group 1 (all jobs): READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, mint=5002msec, maxt=5003msec Run status group 2 (all jobs): READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, mint=5001msec, maxt=5004msec Run status group 3 (all jobs): READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, maxb=12486KiB/s, mint=5002msec, maxt=5004msec As you can see from minb, the process with smallest I/O size is penalized (the fact is that being both marked as noidle, they both end up in the noidle tree, where they are serviced round robin, so they get fairness in term of IOPS, but bandwidth varies a lot. With my patches in place, I get: Run status group 0 (all jobs): READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, mint=5002msec, maxt=5003msec Run status group 1 (all jobs): READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, mint=5001msec, maxt=5003msec Run status group 2 (all jobs): READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, mint=5002msec, maxt=5003msec Run status group 3 (all jobs): READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, maxb=8548KiB/s, mint=5001msec, maxt=5006msec The process doing smaller requests is now not penalized by the fact that it is run concurrently with the other one, and the other still benefits from larger requests because it uses better its time slice. > In case of SSDs with NCQ, we will not idle on any of the queues (either > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes > if we mark a queue as seeky/non-seeky on SSD? > I've not tested on NCQ SSD, but I think at worst it will not harm, and at best, it will provide similar fairness improvements when the queue of processes submitting requests grows above the available NCQ slots. > IOW, looking at this patch, now any queue doing IO in smaller chunks than > 32K on SSD will be marked as seeky. How does that change the behavior in > terms of fairness for the queue? > Basically, we will have IOPS based fairness for small requests, and time based fairness for larger requests. Thanks, Corrado > Thanks > Vivek > >> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> --- >> block/cfq-iosched.c | 7 ++++++- >> 1 files changed, 6 insertions(+), 1 deletions(-) >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> index 806d30b..f27e535 100644 >> --- a/block/cfq-iosched.c >> +++ b/block/cfq-iosched.c >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; >> #define CFQ_SERVICE_SHIFT 12 >> >> #define CFQQ_SEEK_THR (sector_t)(8 * 100) >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) >> #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) >> >> #define RQ_CIC(rq) \ >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> struct request *rq) >> { >> sector_t sdist = 0; >> + sector_t n_sec = blk_rq_sectors(rq); >> if (cfqq->last_request_pos) { >> if (cfqq->last_request_pos < blk_rq_pos(rq)) >> sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> } >> >> cfqq->seek_history <<= 1; >> - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> + if (blk_queue_nonrot(cfqd->queue)) >> + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); >> + else >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> } >> >> /* >> -- >> 1.6.4.4 > [-- Attachment #2: fair.fio --] [-- Type: application/octet-stream, Size: 483 bytes --] [global] size=1G ioengine=sync invalidate=1 runtime=5 time_based direct=1 [test00] rw=randread bs=8k filename=testfile1 [test01] rw=randread bs=4k filename=testfile2 [test10] rw=randread bs=16k filename=testfile1 stonewall [test11] rw=randread bs=4k filename=testfile2 [test20] rw=randread bs=32k filename=testfile1 stonewall [test21] rw=randread bs=4k filename=testfile2 [test30] rw=randread bs=64k filename=testfile1 stonewall [test31] rw=randread bs=4k filename=testfile2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-03 19:47 ` Corrado Zoccolo @ 2010-03-03 21:21 ` Vivek Goyal 2010-03-03 23:28 ` Vivek Goyal 1 sibling, 0 replies; 16+ messages in thread From: Vivek Goyal @ 2010-03-03 21:21 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote: > Hi Vivek, > On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: > >> CFQ currently applies the same logic of detecting seeky queues and > >> grouping them together for rotational disks as well as SSDs. > >> For SSDs, the time to complete a request doesn't depend on the > >> request location, but only on the size. > >> This patch therefore changes the criterion to group queues by > >> request size in case of SSDs, in order to achieve better fairness. > > > > Hi Corrado, > > > > Can you give some numbers regarding how are you measuring fairness and > > how did you decide that we achieve better fairness? > > > Please, see the attached fio script. It benchmarks pairs of processes > performing direct random I/O. > One is always fixed at bs=4k , while I vary the other from 8K to 64K > test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 > test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 > test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 > test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 > test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > > With unpatched cfq (2.6.33), on a flash card (non-ncq), after running > a fio script with high number of parallel readers to make sure ncq > detection is stabilized, I get the following: > Run status group 0 (all jobs): > READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, > mint=5001msec, maxt=5003msec > > Run status group 1 (all jobs): > READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 2 (all jobs): > READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, > mint=5001msec, maxt=5004msec > > Run status group 3 (all jobs): > READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, > maxb=12486KiB/s, mint=5002msec, maxt=5004msec > > As you can see from minb, the process with smallest I/O size is > penalized (the fact is that being both marked as noidle, they both end > up in the noidle tree, where they are serviced round robin, so they > get fairness in term of IOPS, but bandwidth varies a lot. > > With my patches in place, I get: > Run status group 0 (all jobs): > READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 1 (all jobs): > READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, > mint=5001msec, maxt=5003msec > > Run status group 2 (all jobs): > READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 3 (all jobs): > READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, > maxb=8548KiB/s, mint=5001msec, maxt=5006msec > > The process doing smaller requests is now not penalized by the fact > that it is run concurrently with the other one, and the other still > benefits from larger requests because it uses better its time slice. Ok, so with this patch, higher size requests will be marked as sync-idle so that now 4K size process and 64K size processes will be on separate service tree. But this will work only if we were idling on service tree (on SSD). I thought in SSD we will not idle even on service tree. But looks like we have left a bug somewhere. Otherwise on NCQ SSD we will suffer in terms of performance in this kind of setup. Especially if you increase number of readers. I will do run your fio script on my NCQ SSD. Or it is intentional that idle on service tree with hw_tag=1 but don't idle with NCQ hard disk. That makes sense though. But looking at cfq_should_idle(), looks like we will always on a service tree even on NCQ SSD even if cfq_cfqq_idle_window=0. I think if I run the same test on NCQ SSD, now bigger size process should loose because we will idle on sync-noidle service tree but not on sync-idle service tree. > > > In case of SSDs with NCQ, we will not idle on any of the queues (either > > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes > > if we mark a queue as seeky/non-seeky on SSD? > > > > I've not tested on NCQ SSD, but I think at worst it will not harm, and > at best, it will provide similar fairness improvements when the queue > of processes submitting requests grows above the available NCQ slots. > > > IOW, looking at this patch, now any queue doing IO in smaller chunks than > > 32K on SSD will be marked as seeky. How does that change the behavior in > > terms of fairness for the queue? > > > Basically, we will have IOPS based fairness for small requests, and > time based fairness for larger requests. > > Thanks, > Corrado > > > Thanks > > Vivek > > > >> > >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > >> --- > >> block/cfq-iosched.c | 7 ++++++- > >> 1 files changed, 6 insertions(+), 1 deletions(-) > >> > >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > >> index 806d30b..f27e535 100644 > >> --- a/block/cfq-iosched.c > >> +++ b/block/cfq-iosched.c > >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; > >> #define CFQ_SERVICE_SHIFT 12 > >> > >> #define CFQQ_SEEK_THR (sector_t)(8 * 100) > >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) > >> #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > >> > >> #define RQ_CIC(rq) \ > >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> struct request *rq) > >> { > >> sector_t sdist = 0; > >> + sector_t n_sec = blk_rq_sectors(rq); > >> if (cfqq->last_request_pos) { > >> if (cfqq->last_request_pos < blk_rq_pos(rq)) > >> sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> } > >> > >> cfqq->seek_history <<= 1; > >> - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > >> + if (blk_queue_nonrot(cfqd->queue)) > >> + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); > >> + else > >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > >> } > >> > >> /* > >> -- > >> 1.6.4.4 > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-03 19:47 ` Corrado Zoccolo 2010-03-03 21:21 ` Vivek Goyal @ 2010-03-03 23:28 ` Vivek Goyal 2010-03-04 20:34 ` Corrado Zoccolo 1 sibling, 1 reply; 16+ messages in thread From: Vivek Goyal @ 2010-03-03 23:28 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote: > Hi Vivek, > On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: > >> CFQ currently applies the same logic of detecting seeky queues and > >> grouping them together for rotational disks as well as SSDs. > >> For SSDs, the time to complete a request doesn't depend on the > >> request location, but only on the size. > >> This patch therefore changes the criterion to group queues by > >> request size in case of SSDs, in order to achieve better fairness. > > > > Hi Corrado, > > > > Can you give some numbers regarding how are you measuring fairness and > > how did you decide that we achieve better fairness? > > > Please, see the attached fio script. It benchmarks pairs of processes > performing direct random I/O. > One is always fixed at bs=4k , while I vary the other from 8K to 64K > test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 > test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 > test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 > test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 > test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > > With unpatched cfq (2.6.33), on a flash card (non-ncq), after running > a fio script with high number of parallel readers to make sure ncq > detection is stabilized, I get the following: > Run status group 0 (all jobs): > READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, > mint=5001msec, maxt=5003msec > > Run status group 1 (all jobs): > READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 2 (all jobs): > READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, > mint=5001msec, maxt=5004msec > > Run status group 3 (all jobs): > READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, > maxb=12486KiB/s, mint=5002msec, maxt=5004msec > > As you can see from minb, the process with smallest I/O size is > penalized (the fact is that being both marked as noidle, they both end > up in the noidle tree, where they are serviced round robin, so they > get fairness in term of IOPS, but bandwidth varies a lot. > > With my patches in place, I get: > Run status group 0 (all jobs): > READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 1 (all jobs): > READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, > mint=5001msec, maxt=5003msec > > Run status group 2 (all jobs): > READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, > mint=5002msec, maxt=5003msec > > Run status group 3 (all jobs): > READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, > maxb=8548KiB/s, mint=5001msec, maxt=5006msec > > The process doing smaller requests is now not penalized by the fact > that it is run concurrently with the other one, and the other still > benefits from larger requests because it uses better its time slice. > Hi Corrado, I ran your fio script on my SSD which supports NCQ. In my case both the processes lose. Vanilla kernel (2.6.33) ======================= Run status group 0 (all jobs): READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s, mint=5001msec, maxt=5001msec Run status group 1 (all jobs): READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s, mint=5001msec, maxt=5001msec Run status group 2 (all jobs): READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s, mint=5001msec, maxt=5001msec Run status group 3 (all jobs): READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s, mint=5001msec, maxt=5001msec With two seek patches applied ============================= Run status group 0 (all jobs): READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s, mint=5001msec, maxt=5001msec Run status group 1 (all jobs): READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s, mint=5001msec, maxt=5001msec Run status group 2 (all jobs): READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s, mint=5001msec, maxt=5001msec Run status group 3 (all jobs): READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s, mint=5001msec, maxt=5001msec Note, how overall BW suffers for group 2 and group 3. This needs some debugging why it is happening. I guess we are idling on sync-noidle tree and not idling on sync-idle tree, probably that's why bigger request size process can't get enough IO pushed. But then smaller request size process should have got more IO done but that also does not seem to be happening. Both small request size and big request size processes are losing. Thanks Vivek > > In case of SSDs with NCQ, we will not idle on any of the queues (either > > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes > > if we mark a queue as seeky/non-seeky on SSD? > > > > I've not tested on NCQ SSD, but I think at worst it will not harm, and > at best, it will provide similar fairness improvements when the queue > of processes submitting requests grows above the available NCQ slots. > > > IOW, looking at this patch, now any queue doing IO in smaller chunks than > > 32K on SSD will be marked as seeky. How does that change the behavior in > > terms of fairness for the queue? > > > Basically, we will have IOPS based fairness for small requests, and > time based fairness for larger requests. > > Thanks, > Corrado > > > Thanks > > Vivek > > > >> > >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > >> --- > >> block/cfq-iosched.c | 7 ++++++- > >> 1 files changed, 6 insertions(+), 1 deletions(-) > >> > >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > >> index 806d30b..f27e535 100644 > >> --- a/block/cfq-iosched.c > >> +++ b/block/cfq-iosched.c > >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; > >> #define CFQ_SERVICE_SHIFT 12 > >> > >> #define CFQQ_SEEK_THR (sector_t)(8 * 100) > >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) > >> #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) > >> > >> #define RQ_CIC(rq) \ > >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> struct request *rq) > >> { > >> sector_t sdist = 0; > >> + sector_t n_sec = blk_rq_sectors(rq); > >> if (cfqq->last_request_pos) { > >> if (cfqq->last_request_pos < blk_rq_pos(rq)) > >> sdist = blk_rq_pos(rq) - cfqq->last_request_pos; > >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >> } > >> > >> cfqq->seek_history <<= 1; > >> - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > >> + if (blk_queue_nonrot(cfqd->queue)) > >> + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); > >> + else > >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); > >> } > >> > >> /* > >> -- > >> 1.6.4.4 > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-03 23:28 ` Vivek Goyal @ 2010-03-04 20:34 ` Corrado Zoccolo 2010-03-04 22:27 ` Vivek Goyal 0 siblings, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-03-04 20:34 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng Hi Vivek, On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote: >> Hi Vivek, >> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: >> >> CFQ currently applies the same logic of detecting seeky queues and >> >> grouping them together for rotational disks as well as SSDs. >> >> For SSDs, the time to complete a request doesn't depend on the >> >> request location, but only on the size. >> >> This patch therefore changes the criterion to group queues by >> >> request size in case of SSDs, in order to achieve better fairness. >> > >> > Hi Corrado, >> > >> > Can you give some numbers regarding how are you measuring fairness and >> > how did you decide that we achieve better fairness? >> > >> Please, see the attached fio script. It benchmarks pairs of processes >> performing direct random I/O. >> One is always fixed at bs=4k , while I vary the other from 8K to 64K >> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 >> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 >> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 >> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 >> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> >> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running >> a fio script with high number of parallel readers to make sure ncq >> detection is stabilized, I get the following: >> Run status group 0 (all jobs): >> READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, >> mint=5001msec, maxt=5003msec >> >> Run status group 1 (all jobs): >> READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, >> mint=5002msec, maxt=5003msec >> >> Run status group 2 (all jobs): >> READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, >> mint=5001msec, maxt=5004msec >> >> Run status group 3 (all jobs): >> READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, >> maxb=12486KiB/s, mint=5002msec, maxt=5004msec >> >> As you can see from minb, the process with smallest I/O size is >> penalized (the fact is that being both marked as noidle, they both end >> up in the noidle tree, where they are serviced round robin, so they >> get fairness in term of IOPS, but bandwidth varies a lot. >> >> With my patches in place, I get: >> Run status group 0 (all jobs): >> READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, >> mint=5002msec, maxt=5003msec >> >> Run status group 1 (all jobs): >> READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, >> mint=5001msec, maxt=5003msec >> >> Run status group 2 (all jobs): >> READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, >> mint=5002msec, maxt=5003msec >> >> Run status group 3 (all jobs): >> READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, >> maxb=8548KiB/s, mint=5001msec, maxt=5006msec >> >> The process doing smaller requests is now not penalized by the fact >> that it is run concurrently with the other one, and the other still >> benefits from larger requests because it uses better its time slice. >> > > Hi Corrado, > > I ran your fio script on my SSD which supports NCQ. In my case both the > processes lose. > > Vanilla kernel (2.6.33) > ======================= > Run status group 0 (all jobs): > READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s, > mint=5001msec, maxt=5001msec > > Run status group 1 (all jobs): > READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s, > mint=5001msec, maxt=5001msec > > Run status group 2 (all jobs): > READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s, > mint=5001msec, maxt=5001msec > > Run status group 3 (all jobs): > READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s, > mint=5001msec, maxt=5001msec > > With two seek patches applied > ============================= > Run status group 0 (all jobs): > READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s, > mint=5001msec, maxt=5001msec > > Run status group 1 (all jobs): > READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s, > mint=5001msec, maxt=5001msec > > Run status group 2 (all jobs): > READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s, > mint=5001msec, maxt=5001msec > > Run status group 3 (all jobs): > READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s, > mint=5001msec, maxt=5001msec > > Note, how overall BW suffers for group 2 and group 3. This needs some > debugging why it is happening. I guess we are idling on sync-noidle > tree and not idling on sync-idle tree, probably that's why bigger request > size process can't get enough IO pushed. But then smaller request size > process should have got more IO done but that also does not seem to > be happening. I think this is a preexisting bug. You can probably trigger it in vanilla 2.6.33 by having one process doing random and an other doing sequential reads. I think the issue is: * cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs: static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) { enum wl_prio_t prio = cfqq_prio(cfqq); struct cfq_rb_root *service_tree = cfqq->service_tree; BUG_ON(!service_tree); BUG_ON(!service_tree->count); /* We never do for idle class queues. */ if (prio == IDLE_WORKLOAD) return false; /* We do for queues that were marked with idle window flag. */ if (cfq_cfqq_idle_window(cfqq) && !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)) return true; /* * Otherwise, we do only if they are the last ones * in their service tree. */ return service_tree->count == 1 && cfq_cfqq_sync(cfqq); } * Even if we never activate a timer, we wait for request completion by keeping a queue that has requests in flight, when cfq_should_idle returns true (in two places): /* * The active queue has run out of time, expire it and select new. */ if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq)) { /* * If slice had not expired at the completion of last request * we might not have turned on wait_busy flag. Don't expire * the queue yet. Allow the group to get backlogged. * * The very fact that we have used the slice, that means we * have been idling all along on this queue and it should be * ok to wait for this request to complete. */ if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list) && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { cfqq = NULL; goto keep_queue; } else goto expire; } /* * No requests pending. If the active queue still has requests in * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { cfqq = NULL; goto keep_queue; } Can you change cfq_should_idle to always return false for NCQ SSDs and retest? Thanks Corrado > > Both small request size and big request size processes are losing. > > Thanks > Vivek > >> > In case of SSDs with NCQ, we will not idle on any of the queues (either >> > sync or sync-noidle (seeky queues)). So w.r.t code, what behavior changes >> > if we mark a queue as seeky/non-seeky on SSD? >> > >> >> I've not tested on NCQ SSD, but I think at worst it will not harm, and >> at best, it will provide similar fairness improvements when the queue >> of processes submitting requests grows above the available NCQ slots. >> >> > IOW, looking at this patch, now any queue doing IO in smaller chunks than >> > 32K on SSD will be marked as seeky. How does that change the behavior in >> > terms of fairness for the queue? >> > >> Basically, we will have IOPS based fairness for small requests, and >> time based fairness for larger requests. >> >> Thanks, >> Corrado >> >> > Thanks >> > Vivek >> > >> >> >> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> >> --- >> >> block/cfq-iosched.c | 7 ++++++- >> >> 1 files changed, 6 insertions(+), 1 deletions(-) >> >> >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> >> index 806d30b..f27e535 100644 >> >> --- a/block/cfq-iosched.c >> >> +++ b/block/cfq-iosched.c >> >> @@ -47,6 +47,7 @@ static const int cfq_hist_divisor = 4; >> >> #define CFQ_SERVICE_SHIFT 12 >> >> >> >> #define CFQQ_SEEK_THR (sector_t)(8 * 100) >> >> +#define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) >> >> #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) >> >> >> >> #define RQ_CIC(rq) \ >> >> @@ -2958,6 +2959,7 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> >> struct request *rq) >> >> { >> >> sector_t sdist = 0; >> >> + sector_t n_sec = blk_rq_sectors(rq); >> >> if (cfqq->last_request_pos) { >> >> if (cfqq->last_request_pos < blk_rq_pos(rq)) >> >> sdist = blk_rq_pos(rq) - cfqq->last_request_pos; >> >> @@ -2966,7 +2968,10 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> >> } >> >> >> >> cfqq->seek_history <<= 1; >> >> - cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> >> + if (blk_queue_nonrot(cfqd->queue)) >> >> + cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); >> >> + else >> >> + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); >> >> } >> >> >> >> /* >> >> -- >> >> 1.6.4.4 >> > > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-04 20:34 ` Corrado Zoccolo @ 2010-03-04 22:27 ` Vivek Goyal 2010-03-05 22:31 ` Corrado Zoccolo 0 siblings, 1 reply; 16+ messages in thread From: Vivek Goyal @ 2010-03-04 22:27 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Thu, Mar 04, 2010 at 09:34:51PM +0100, Corrado Zoccolo wrote: > Hi Vivek, > On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote: > >> Hi Vivek, > >> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: > >> >> CFQ currently applies the same logic of detecting seeky queues and > >> >> grouping them together for rotational disks as well as SSDs. > >> >> For SSDs, the time to complete a request doesn't depend on the > >> >> request location, but only on the size. > >> >> This patch therefore changes the criterion to group queues by > >> >> request size in case of SSDs, in order to achieve better fairness. > >> > > >> > Hi Corrado, > >> > > >> > Can you give some numbers regarding how are you measuring fairness and > >> > how did you decide that we achieve better fairness? > >> > > >> Please, see the attached fio script. It benchmarks pairs of processes > >> performing direct random I/O. > >> One is always fixed at bs=4k , while I vary the other from 8K to 64K > >> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 > >> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > >> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 > >> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > >> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 > >> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > >> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 > >> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 > >> > >> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running > >> a fio script with high number of parallel readers to make sure ncq > >> detection is stabilized, I get the following: > >> Run status group 0 (all jobs): > >> READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, > >> mint=5001msec, maxt=5003msec > >> > >> Run status group 1 (all jobs): > >> READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, > >> mint=5002msec, maxt=5003msec > >> > >> Run status group 2 (all jobs): > >> READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, > >> mint=5001msec, maxt=5004msec > >> > >> Run status group 3 (all jobs): > >> READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, > >> maxb=12486KiB/s, mint=5002msec, maxt=5004msec > >> > >> As you can see from minb, the process with smallest I/O size is > >> penalized (the fact is that being both marked as noidle, they both end > >> up in the noidle tree, where they are serviced round robin, so they > >> get fairness in term of IOPS, but bandwidth varies a lot. > >> > >> With my patches in place, I get: > >> Run status group 0 (all jobs): > >> READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, > >> mint=5002msec, maxt=5003msec > >> > >> Run status group 1 (all jobs): > >> READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, > >> mint=5001msec, maxt=5003msec > >> > >> Run status group 2 (all jobs): > >> READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, > >> mint=5002msec, maxt=5003msec > >> > >> Run status group 3 (all jobs): > >> READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, > >> maxb=8548KiB/s, mint=5001msec, maxt=5006msec > >> > >> The process doing smaller requests is now not penalized by the fact > >> that it is run concurrently with the other one, and the other still > >> benefits from larger requests because it uses better its time slice. > >> > > > > Hi Corrado, > > > > I ran your fio script on my SSD which supports NCQ. In my case both the > > processes lose. > > > > Vanilla kernel (2.6.33) > > ======================= > > Run status group 0 (all jobs): > > READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 1 (all jobs): > > READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 2 (all jobs): > > READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 3 (all jobs): > > READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s, > > mint=5001msec, maxt=5001msec > > > > With two seek patches applied > > ============================= > > Run status group 0 (all jobs): > > READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 1 (all jobs): > > READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 2 (all jobs): > > READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s, > > mint=5001msec, maxt=5001msec > > > > Run status group 3 (all jobs): > > READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s, > > mint=5001msec, maxt=5001msec > > > > Note, how overall BW suffers for group 2 and group 3. This needs some > > debugging why it is happening. I guess we are idling on sync-noidle > > tree and not idling on sync-idle tree, probably that's why bigger request > > size process can't get enough IO pushed. But then smaller request size > > process should have got more IO done but that also does not seem to > > be happening. > > I think this is a preexisting bug. You can probably trigger it in > vanilla 2.6.33 by having one process doing random and an other doing > sequential reads. > I think the issue is: > * cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs: you are right. The way cfq_should_idle() is written, it will return true for bothe sync-idle and sync-noidle trees. With your patch two random readers go onto different service trees and now we start driving queue depth 1 because of follwing. if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { cfqq = NULL; goto keep_queue; } I tested following patch and now throughput is almost back at the same levels as wihtout your patch. --- block/cfq-iosched.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 40840da..296aa23 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) if (prio == IDLE_WORKLOAD) return false; + /* Don't idle on NCQ SSDs */ + if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) + return false; + /* We do for queues that were marked with idle window flag. */ - if (cfq_cfqq_idle_window(cfqq) && - !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)) + if (cfq_cfqq_idle_window(cfqq)) return true; /* -- 1.6.2.5 Now I am wondering what are the side affects of above change. One advantage of idling on service tree even on NCQ SSD is that READS get more protection from WRITES. Especially if there are SSDs which implement NCQ but really suck at WRITE speed or don't implement parallel IO channels. I see cfq_should_idle() being used at four places. - cfq_select_queue() if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list) && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { cfqq = NULL; goto keep_queue; } This is for fairness in group scheduling. I think that impact here should not be much because this condition is primarily hit on media where we idle on the queue. On NCQ SSD, I don't think we get fairness in group scheduling much until and unless a group is generating enough traffic to keep the request queue full. - cfq_select_queue() (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { cfqq = NULL; Should we really wait for a dispatched request to complete on NCQ SSD? On SSD with parallel channel, I think this will reduce throughput? - cfq_may_dispatch() /* * Drain async requests before we start sync IO */ if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) return false; If NCQ SSD is implementing parallel IO channels, probably there is no need to clear out WRITES before we start sync IO? - cfq_arm_slice_timer() if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) return; cfq_arm_slice_timer() already check for NCQ SSD and does not arm timer so this function should remain unaffected with this change. So the question is, should we wait on service tree on an NCQ SSD or not? Secondly, your patch of determining cfqq seeky nature based on request size on SSD, helps only on non NCQ SSD. Thanks Vivek ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-04 22:27 ` Vivek Goyal @ 2010-03-05 22:31 ` Corrado Zoccolo 2010-03-08 14:08 ` Vivek Goyal 0 siblings, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-03-05 22:31 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Thu, Mar 4, 2010 at 11:27 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Thu, Mar 04, 2010 at 09:34:51PM +0100, Corrado Zoccolo wrote: >> Hi Vivek, >> On Thu, Mar 4, 2010 at 12:28 AM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Wed, Mar 03, 2010 at 08:47:31PM +0100, Corrado Zoccolo wrote: >> >> Hi Vivek, >> >> On Mon, Mar 1, 2010 at 3:25 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> >> > On Sat, Feb 27, 2010 at 07:45:40PM +0100, Corrado Zoccolo wrote: >> >> >> CFQ currently applies the same logic of detecting seeky queues and >> >> >> grouping them together for rotational disks as well as SSDs. >> >> >> For SSDs, the time to complete a request doesn't depend on the >> >> >> request location, but only on the size. >> >> >> This patch therefore changes the criterion to group queues by >> >> >> request size in case of SSDs, in order to achieve better fairness. >> >> > >> >> > Hi Corrado, >> >> > >> >> > Can you give some numbers regarding how are you measuring fairness and >> >> > how did you decide that we achieve better fairness? >> >> > >> >> Please, see the attached fio script. It benchmarks pairs of processes >> >> performing direct random I/O. >> >> One is always fixed at bs=4k , while I vary the other from 8K to 64K >> >> test00: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=1 >> >> test01: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> >> test10: (g=1): rw=randread, bs=16K-16K/16K-16K, ioengine=sync, iodepth=1 >> >> test11: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> >> test20: (g=2): rw=randread, bs=32K-32K/32K-32K, ioengine=sync, iodepth=1 >> >> test21: (g=2): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> >> test30: (g=3): rw=randread, bs=64K-64K/64K-64K, ioengine=sync, iodepth=1 >> >> test31: (g=3): rw=randread, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1 >> >> >> >> With unpatched cfq (2.6.33), on a flash card (non-ncq), after running >> >> a fio script with high number of parallel readers to make sure ncq >> >> detection is stabilized, I get the following: >> >> Run status group 0 (all jobs): >> >> READ: io=21528KiB, aggrb=4406KiB/s, minb=1485KiB/s, maxb=2922KiB/s, >> >> mint=5001msec, maxt=5003msec >> >> >> >> Run status group 1 (all jobs): >> >> READ: io=31524KiB, aggrb=6452KiB/s, minb=1327KiB/s, maxb=5126KiB/s, >> >> mint=5002msec, maxt=5003msec >> >> >> >> Run status group 2 (all jobs): >> >> READ: io=46544KiB, aggrb=9524KiB/s, minb=1031KiB/s, maxb=8493KiB/s, >> >> mint=5001msec, maxt=5004msec >> >> >> >> Run status group 3 (all jobs): >> >> READ: io=64712KiB, aggrb=13242KiB/s, minb=761KiB/s, >> >> maxb=12486KiB/s, mint=5002msec, maxt=5004msec >> >> >> >> As you can see from minb, the process with smallest I/O size is >> >> penalized (the fact is that being both marked as noidle, they both end >> >> up in the noidle tree, where they are serviced round robin, so they >> >> get fairness in term of IOPS, but bandwidth varies a lot. >> >> >> >> With my patches in place, I get: >> >> Run status group 0 (all jobs): >> >> READ: io=21544KiB, aggrb=4409KiB/s, minb=1511KiB/s, maxb=2898KiB/s, >> >> mint=5002msec, maxt=5003msec >> >> >> >> Run status group 1 (all jobs): >> >> READ: io=32000KiB, aggrb=6549KiB/s, minb=1277KiB/s, maxb=5274KiB/s, >> >> mint=5001msec, maxt=5003msec >> >> >> >> Run status group 2 (all jobs): >> >> READ: io=39444KiB, aggrb=8073KiB/s, minb=1576KiB/s, maxb=6498KiB/s, >> >> mint=5002msec, maxt=5003msec >> >> >> >> Run status group 3 (all jobs): >> >> READ: io=49180KiB, aggrb=10059KiB/s, minb=1512KiB/s, >> >> maxb=8548KiB/s, mint=5001msec, maxt=5006msec >> >> >> >> The process doing smaller requests is now not penalized by the fact >> >> that it is run concurrently with the other one, and the other still >> >> benefits from larger requests because it uses better its time slice. >> >> >> > >> > Hi Corrado, >> > >> > I ran your fio script on my SSD which supports NCQ. In my case both the >> > processes lose. >> > >> > Vanilla kernel (2.6.33) >> > ======================= >> > Run status group 0 (all jobs): >> > READ: io=258MB, aggrb=52,903KB/s, minb=18,058KB/s, maxb=36,114KB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 1 (all jobs): >> > READ: io=382MB, aggrb=78,301KB/s, minb=16,037KB/s, maxb=64,143KB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 2 (all jobs): >> > READ: io=560MB, aggrb=112MB/s, minb=13,052KB/s, maxb=102MB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 3 (all jobs): >> > READ: io=774MB, aggrb=155MB/s, minb=9,595KB/s, maxb=149MB/s, >> > mint=5001msec, maxt=5001msec >> > >> > With two seek patches applied >> > ============================= >> > Run status group 0 (all jobs): >> > READ: io=260MB, aggrb=53,148KB/s, minb=18,140KB/s, maxb=36,283KB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 1 (all jobs): >> > READ: io=383MB, aggrb=78,484KB/s, minb=16,073KB/s, maxb=64,294KB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 2 (all jobs): >> > READ: io=359MB, aggrb=73,534KB/s, minb=8,367KB/s, maxb=66,931KB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Run status group 3 (all jobs): >> > READ: io=556MB, aggrb=111MB/s, minb=6,852KB/s, maxb=107MB/s, >> > mint=5001msec, maxt=5001msec >> > >> > Note, how overall BW suffers for group 2 and group 3. This needs some >> > debugging why it is happening. I guess we are idling on sync-noidle >> > tree and not idling on sync-idle tree, probably that's why bigger request >> > size process can't get enough IO pushed. But then smaller request size >> > process should have got more IO done but that also does not seem to >> > be happening. >> >> I think this is a preexisting bug. You can probably trigger it in >> vanilla 2.6.33 by having one process doing random and an other doing >> sequential reads. >> I think the issue is: >> * cfq_should_idle returns true for the last queue on a tree, even for NCQ SSDs: > > you are right. The way cfq_should_idle() is written, it will return true > for bothe sync-idle and sync-noidle trees. With your patch two random > readers go onto different service trees and now we start driving queue > depth 1 because of follwing. > > if (timer_pending(&cfqd->idle_slice_timer) || > (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { > cfqq = NULL; > goto keep_queue; > } > > > I tested following patch and now throughput is almost back at the same levels > as wihtout your patch. > > --- > block/cfq-iosched.c | 7 +++++-- > 1 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index 40840da..296aa23 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) > if (prio == IDLE_WORKLOAD) > return false; > > + /* Don't idle on NCQ SSDs */ > + if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) > + return false; > + > /* We do for queues that were marked with idle window flag. */ > - if (cfq_cfqq_idle_window(cfqq) && > - !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)) > + if (cfq_cfqq_idle_window(cfqq)) > return true; > > /* > -- > 1.6.2.5 > > > Now I am wondering what are the side affects of above change. > > One advantage of idling on service tree even on NCQ SSD is that READS get > more protection from WRITES. Especially if there are SSDs which implement > NCQ but really suck at WRITE speed or don't implement parallel IO > channels. I think the following code: /* * If this is an async queue and we have sync IO in flight, let it wait */ if (cfqd->rq_in_flight[BLK_RW_SYNC] && !cfq_cfqq_sync(cfqq)) return false; already provides this protection, regardless of the result of cfq_should_idle(). > > I see cfq_should_idle() being used at four places. > > - cfq_select_queue() > if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list) > && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { > cfqq = NULL; > goto keep_queue; > } > > This is for fairness in group scheduling. I think that impact here should > not be much because this condition is primarily hit on media where we idle > on the queue. > > On NCQ SSD, I don't think we get fairness in group scheduling much until > and unless a group is generating enough traffic to keep the request queue > full. > > - cfq_select_queue() > (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { > cfqq = NULL; > > Should we really wait for a dispatched request to complete on NCQ SSD? I don't think so. This is the cause for the bad performance we are looking at. And remember that, without my patch, you will still get the same bad behaviour, just with different conditions (one random and one sequential reader, regardless of request size). > On SSD with parallel channel, I think this will reduce throughput? > > - cfq_may_dispatch() > > /* > * Drain async requests before we start sync IO > */ > if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > return false; > > If NCQ SSD is implementing parallel IO channels, probably there is no need > to clear out WRITES before we start sync IO? Right. I think this is present for crazy NCQ rotational disks, where a write could sit in cache forever if a stream of reads is coming. > > - cfq_arm_slice_timer() > > if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) > return; > > cfq_arm_slice_timer() already check for NCQ SSD and does not arm timer > so this function should remain unaffected with this change. > > So the question is, should we wait on service tree on an NCQ SSD or not? I don't see a good reason to wait, and bad results if we do. > > Secondly, your patch of determining cfqq seeky nature based on request > size on SSD, helps only on non NCQ SSD. We know it helps on non NCQ. If we fix the waiting bug, it could help also on NCQ, in scenarios where lot of processes (> NCQ depth) are submitting requests with largely different sizes. Thanks Corrado > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs 2010-03-05 22:31 ` Corrado Zoccolo @ 2010-03-08 14:08 ` Vivek Goyal 0 siblings, 0 replies; 16+ messages in thread From: Vivek Goyal @ 2010-03-08 14:08 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Fri, Mar 05, 2010 at 11:31:43PM +0100, Corrado Zoccolo wrote: [..] > > --- > > block/cfq-iosched.c | 7 +++++-- > > 1 files changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index 40840da..296aa23 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -1788,9 +1788,12 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > if (prio == IDLE_WORKLOAD) > > return false; > > > > + /* Don't idle on NCQ SSDs */ > > + if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) > > + return false; > > + > > /* We do for queues that were marked with idle window flag. */ > > - if (cfq_cfqq_idle_window(cfqq) && > > - !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)) > > + if (cfq_cfqq_idle_window(cfqq)) > > return true; > > > > /* > > -- > > 1.6.2.5 > > > > > > Now I am wondering what are the side affects of above change. > > > > One advantage of idling on service tree even on NCQ SSD is that READS get > > more protection from WRITES. Especially if there are SSDs which implement > > NCQ but really suck at WRITE speed or don't implement parallel IO > > channels. > I think the following code: > /* > * If this is an async queue and we have sync IO in flight, let it wait > */ > if (cfqd->rq_in_flight[BLK_RW_SYNC] && !cfq_cfqq_sync(cfqq)) > return false; > already provides this protection, regardless of the result of cfq_should_idle(). It does but in some cases it does not. For example, consider a case of one writer and one reader (sequential or seeky) trying to do some IO. Now reader will do one read and then immediately writer will pump few requests (4-5 in dispatch queue) and then reader does one read and it goes on and this can add to latency of reader on bad NCQ SSDs. > > > > > I see cfq_should_idle() being used at four places. > > > > - cfq_select_queue() > > if (cfqq->cfqg->nr_cfqq == 1 && RB_EMPTY_ROOT(&cfqq->sort_list) > > && cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { > > cfqq = NULL; > > goto keep_queue; > > } > > > > This is for fairness in group scheduling. I think that impact here should > > not be much because this condition is primarily hit on media where we idle > > on the queue. > > > > On NCQ SSD, I don't think we get fairness in group scheduling much until > > and unless a group is generating enough traffic to keep the request queue > > full. > > > > - cfq_select_queue() > > (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) { > > cfqq = NULL; > > > > Should we really wait for a dispatched request to complete on NCQ SSD? > I don't think so. This is the cause for the bad performance we are looking at. > And remember that, without my patch, you will still get the same bad > behaviour, just with different conditions (one random and one > sequential reader, regardless of request size). > > On SSD with parallel channel, I think this will reduce throughput? > > > > - cfq_may_dispatch() > > > > /* > > * Drain async requests before we start sync IO > > */ > > if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > return false; > > > > If NCQ SSD is implementing parallel IO channels, probably there is no need > > to clear out WRITES before we start sync IO? > Right. I think this is present for crazy NCQ rotational disks, where a > write could sit in cache forever if a stream of reads is coming. > > > > - cfq_arm_slice_timer() > > > > if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) > > return; > > > > cfq_arm_slice_timer() already check for NCQ SSD and does not arm timer > > so this function should remain unaffected with this change. > > > > So the question is, should we wait on service tree on an NCQ SSD or not? > I don't see a good reason to wait, and bad results if we do. Actually with my patch, we will not get fairness on NCQ SSD (group scheduling) until and unless a group has enough traffic to keep the SSD busy (or dispatch queue full). Previously we used to wait on empty service tree on NCQ SSD and now we will not. But I guess, we can move wait on the group under the tunable group_isolation=1, so that if one prefers more isolation between groups even at the expense of total throughput, he can choose to do so. > > > > Secondly, your patch of determining cfqq seeky nature based on request > > size on SSD, helps only on non NCQ SSD. > We know it helps on non NCQ. If we fix the waiting bug, it could help > also on NCQ, in scenarios where lot of processes (> NCQ depth) are > submitting requests with largely different sizes. Ok, I will run some more tests with my patch and if I don't see any major regressions, I will send it to Jens. Thanks Vivek ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com> 2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo @ 2010-02-28 18:41 ` Jens Axboe 2010-03-01 16:35 ` Vivek Goyal 2 siblings, 0 replies; 16+ messages in thread From: Jens Axboe @ 2010-02-28 18:41 UTC (permalink / raw) To: Corrado Zoccolo Cc: Linux-Kernel, Jeff Moyer, Vivek Goyal, Shaohua Li, Gui Jianfeng, #, This, line, is, "ignored." On Sat, Feb 27 2010, Corrado Zoccolo wrote: > > Hi, I'm resending the rework seeky detection patch, together with > the companion patch for SSDs, in order to get some testing on more > hardware. > > The first patch in the series fixes a regression introduced in 2.6.33 > for random mmap reads of more than one page, when multiple processes > are competing for the disk. > There is at least one HW RAID controller where it reduces performance, > though (but this controller generally performs worse with CFQ than > with NOOP, probably because it is performing non-work-conserving > I/O scheduling inside), so more testing on RAIDs is appreciated. > > The second patch changes the seeky detection logic to be meaningful > also for SSDs. A seeky request is one that doesn't utilize the full > bandwidth for the device. For SSDs, this happens for small requests, > regardless of their location. > With this change, the grouping of "seeky" requests done by CFQ can > result in a fairer distribution of disk service time among processes. Thanks, I have applied this. -- Jens Axboe ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 [not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com> 2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo 2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe @ 2010-03-01 16:35 ` Vivek Goyal 2010-03-01 19:45 ` Vivek Goyal 2010-03-01 23:01 ` Corrado Zoccolo 2 siblings, 2 replies; 16+ messages in thread From: Vivek Goyal @ 2010-03-01 16:35 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #, This, line, is, "ignored." On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote: > > Hi, I'm resending the rework seeky detection patch, together with > the companion patch for SSDs, in order to get some testing on more > hardware. > > The first patch in the series fixes a regression introduced in 2.6.33 > for random mmap reads of more than one page, when multiple processes > are competing for the disk. > There is at least one HW RAID controller where it reduces performance, > though (but this controller generally performs worse with CFQ than > with NOOP, probably because it is performing non-work-conserving > I/O scheduling inside), so more testing on RAIDs is appreciated. > Hi Corrado, This time I don't have the machine where I had previously reported regressions. But somebody has exported me two Lun from an storage box over SAN and I have done my testing on that. With this seek patch applied, I still see the regressions. iosched=cfq Filesz=1G bs=64K 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- brrmmap 3 1 7113 0 7044 0 0% 0% brrmmap 3 2 6977 0 6774 0 -2% 0% brrmmap 3 4 7410 0 6181 0 -16% 0% brrmmap 3 8 9405 0 6020 0 -35% 0% brrmmap 3 16 11445 0 5792 0 -49% 0% 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- drrmmap 3 1 7195 0 7337 0 1% 0% drrmmap 3 2 7016 0 6855 0 -2% 0% drrmmap 3 4 7438 0 6103 0 -17% 0% drrmmap 3 8 9298 0 6020 0 -35% 0% drrmmap 3 16 11576 0 5827 0 -49% 0% I have run buffered random reads on mmaped files (brrmmap) and direct random reads on mmaped files (drrmmap) using fio. I have run these for increasing number of threads and did this for 3 times and took average of three sets for reporting. I have used filesize 1G and bz=64K and ran each test sample for 30 seconds. Because with new seek logic, we will mark above type of cfqq as non seeky and will idle on these, I take a significant hit in performance on storage boxes which have more than 1 spindle. So basically, the regression is not only on that particular RAID card but on other kind of devices which can support more than one spindle. I will run some test on single SATA disk also where this patch should benefit. Based on testing results so far, I am not a big fan of marking these mmap queues as sync-idle. I guess if this patch really benefits, then we need to first put in place some kind of logic to detect whether if it is single spindle SATA disk and then on these disks, mark mmap queues as sync. Apart from synthetic workloads, in practice, where this patch is helping you? Thanks Vivek > The second patch changes the seeky detection logic to be meaningful > also for SSDs. A seeky request is one that doesn't utilize the full > bandwidth for the device. For SSDs, this happens for small requests, > regardless of their location. > With this change, the grouping of "seeky" requests done by CFQ can > result in a fairer distribution of disk service time among processes. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 2010-03-01 16:35 ` Vivek Goyal @ 2010-03-01 19:45 ` Vivek Goyal 2010-03-01 23:01 ` Corrado Zoccolo 1 sibling, 0 replies; 16+ messages in thread From: Vivek Goyal @ 2010-03-01 19:45 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Mon, Mar 01, 2010 at 11:35:52AM -0500, Vivek Goyal wrote: > On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote: > > > > Hi, I'm resending the rework seeky detection patch, together with > > the companion patch for SSDs, in order to get some testing on more > > hardware. > > > > The first patch in the series fixes a regression introduced in 2.6.33 > > for random mmap reads of more than one page, when multiple processes > > are competing for the disk. > > There is at least one HW RAID controller where it reduces performance, > > though (but this controller generally performs worse with CFQ than > > with NOOP, probably because it is performing non-work-conserving > > I/O scheduling inside), so more testing on RAIDs is appreciated. > > > > Hi Corrado, > > This time I don't have the machine where I had previously reported > regressions. But somebody has exported me two Lun from an storage box > over SAN and I have done my testing on that. With this seek patch applied, > I still see the regressions. > > iosched=cfq Filesz=1G bs=64K > > 2.6.33 2.6.33-seek > workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > brrmmap 3 1 7113 0 7044 0 0% 0% > brrmmap 3 2 6977 0 6774 0 -2% 0% > brrmmap 3 4 7410 0 6181 0 -16% 0% > brrmmap 3 8 9405 0 6020 0 -35% 0% > brrmmap 3 16 11445 0 5792 0 -49% 0% > > 2.6.33 2.6.33-seek > workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > drrmmap 3 1 7195 0 7337 0 1% 0% > drrmmap 3 2 7016 0 6855 0 -2% 0% > drrmmap 3 4 7438 0 6103 0 -17% 0% > drrmmap 3 8 9298 0 6020 0 -35% 0% > drrmmap 3 16 11576 0 5827 0 -49% 0% > > > I have run buffered random reads on mmaped files (brrmmap) and direct > random reads on mmaped files (drrmmap) using fio. I have run these for > increasing number of threads and did this for 3 times and took average of > three sets for reporting. > > I have used filesize 1G and bz=64K and ran each test sample for 30 > seconds. > > Because with new seek logic, we will mark above type of cfqq as non seeky > and will idle on these, I take a significant hit in performance on storage > boxes which have more than 1 spindle. > > So basically, the regression is not only on that particular RAID card but > on other kind of devices which can support more than one spindle. > > I will run some test on single SATA disk also where this patch should > benefit. > Ok, some more results on a single SATA disk. iosched=cfq Filesz=1G bs=64K 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- brrmmap 3 1 4200 0 4200 0 0% 0% brrmmap 3 2 4214 0 4246 0 0% 0% brrmmap 3 4 3296 0 3868 0 17% 0% brrmmap 3 8 2442 0 3117 0 27% 0% brrmmap 3 16 1895 0 2510 0 32% 0% 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- drrmmap 3 1 5476 0 5494 0 0% 0% drrmmap 3 2 5065 0 5070 0 0% 0% drrmmap 3 4 3607 0 4213 0 16% 0% drrmmap 3 8 2474 0 3198 0 29% 0% drrmmap 3 16 1912 0 2418 0 26% 0% So we see improvements on single SATA disk as expected. But we lose more on higher end storage/hardware RAID setups. Also ran same test with bs=32K on SATA disk. iosched=cfq Filesz=1G bs=32K 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- brrmmap 3 1 2408 0 2374 0 -1% 0% brrmmap 3 2 2045 0 2304 0 12% 0% brrmmap 3 4 1687 0 1753 0 3% 0% brrmmap 3 8 1697 0 1562 0 -7% 0% brrmmap 3 16 1604 0 1573 0 -1% 0% 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- drrmmap 3 1 3171 0 3145 0 0% 0% drrmmap 3 2 2634 0 2838 0 7% 0% drrmmap 3 4 1844 0 1935 0 4% 0% drrmmap 3 8 1761 0 1609 0 -8% 0% drrmmap 3 16 1602 0 1573 0 -1% 0% I think in this case cfqq is not being marked as sync-idle and continues to be sync-noidle. So in summary, yes we gain on single SATA disks for this test case but lose on multi spindle setups. IMHO, we should enhance this patch with some kind of single spindle detection and enable this functionality only with those disks so that higher end storage does not incur the penalty. Thanks Vivek > Based on testing results so far, I am not a big fan of marking these mmap > queues as sync-idle. I guess if this patch really benefits, then we need > to first put in place some kind of logic to detect whether if it is single > spindle SATA disk and then on these disks, mark mmap queues as sync. > > Apart from synthetic workloads, in practice, where this patch is helping you? > > Thanks > Vivek > > > > The second patch changes the seeky detection logic to be meaningful > > also for SSDs. A seeky request is one that doesn't utilize the full > > bandwidth for the device. For SSDs, this happens for small requests, > > regardless of their location. > > With this change, the grouping of "seeky" requests done by CFQ can > > result in a fairer distribution of disk service time among processes. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 2010-03-01 16:35 ` Vivek Goyal 2010-03-01 19:45 ` Vivek Goyal @ 2010-03-01 23:01 ` Corrado Zoccolo 2010-03-03 22:39 ` Corrado Zoccolo 1 sibling, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-03-01 23:01 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #, This, line, is, "ignored." Hi Vivek, On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote: >> >> Hi, I'm resending the rework seeky detection patch, together with >> the companion patch for SSDs, in order to get some testing on more >> hardware. >> >> The first patch in the series fixes a regression introduced in 2.6.33 >> for random mmap reads of more than one page, when multiple processes >> are competing for the disk. >> There is at least one HW RAID controller where it reduces performance, >> though (but this controller generally performs worse with CFQ than >> with NOOP, probably because it is performing non-work-conserving >> I/O scheduling inside), so more testing on RAIDs is appreciated. >> > > Hi Corrado, > > This time I don't have the machine where I had previously reported > regressions. But somebody has exported me two Lun from an storage box > over SAN and I have done my testing on that. With this seek patch applied, > I still see the regressions. > > iosched=cfq Filesz=1G bs=64K > > 2.6.33 2.6.33-seek > workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > brrmmap 3 1 7113 0 7044 0 0% 0% > brrmmap 3 2 6977 0 6774 0 -2% 0% > brrmmap 3 4 7410 0 6181 0 -16% 0% > brrmmap 3 8 9405 0 6020 0 -35% 0% > brrmmap 3 16 11445 0 5792 0 -49% 0% > > 2.6.33 2.6.33-seek > workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > drrmmap 3 1 7195 0 7337 0 1% 0% > drrmmap 3 2 7016 0 6855 0 -2% 0% > drrmmap 3 4 7438 0 6103 0 -17% 0% > drrmmap 3 8 9298 0 6020 0 -35% 0% > drrmmap 3 16 11576 0 5827 0 -49% 0% > > > I have run buffered random reads on mmaped files (brrmmap) and direct > random reads on mmaped files (drrmmap) using fio. I have run these for > increasing number of threads and did this for 3 times and took average of > three sets for reporting. > > I have used filesize 1G and bz=64K and ran each test sample for 30 > seconds. > > Because with new seek logic, we will mark above type of cfqq as non seeky > and will idle on these, I take a significant hit in performance on storage > boxes which have more than 1 spindle. Thanks for testing on a different setup. I wonder if the wrong part for multi-spindle is the 64kb threshold. Can you run with larger bs, and see if there is a value for which idling is better? For example on a 2 disk raid 0 I would expect that a bs larger than the stripe will still benefit by idling. > > So basically, the regression is not only on that particular RAID card but > on other kind of devices which can support more than one spindle. > > I will run some test on single SATA disk also where this patch should > benefit. > > Based on testing results so far, I am not a big fan of marking these mmap > queues as sync-idle. I guess if this patch really benefits, then we need > to first put in place some kind of logic to detect whether if it is single > spindle SATA disk and then on these disks, mark mmap queues as sync. > > Apart from synthetic workloads, in practice, where this patch is helping you? The synthetic workload mimics the page fault patterns that can be seen on program startup, and that is the target of my optimization. In 2.6.32, we went the direction of enabling idling also for seeky queues, while 2.6.33 tried to be more friendly with parallel storage by usually allowing more parallel requests. Unfortunately, this impacted this peculiar access pattern, so we need to fix it somehow. Thanks, Corrado > > Thanks > Vivek > > >> The second patch changes the seeky detection logic to be meaningful >> also for SSDs. A seeky request is one that doesn't utilize the full >> bandwidth for the device. For SSDs, this happens for small requests, >> regardless of their location. >> With this change, the grouping of "seeky" requests done by CFQ can >> result in a fairer distribution of disk service time among processes. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 2010-03-01 23:01 ` Corrado Zoccolo @ 2010-03-03 22:39 ` Corrado Zoccolo 2010-03-03 23:11 ` Vivek Goyal 0 siblings, 1 reply; 16+ messages in thread From: Corrado Zoccolo @ 2010-03-03 22:39 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng, #, This, line, is, "ignored." On Tue, Mar 2, 2010 at 12:01 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > Hi Vivek, > On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote: >>> >>> Hi, I'm resending the rework seeky detection patch, together with >>> the companion patch for SSDs, in order to get some testing on more >>> hardware. >>> >>> The first patch in the series fixes a regression introduced in 2.6.33 >>> for random mmap reads of more than one page, when multiple processes >>> are competing for the disk. >>> There is at least one HW RAID controller where it reduces performance, >>> though (but this controller generally performs worse with CFQ than >>> with NOOP, probably because it is performing non-work-conserving >>> I/O scheduling inside), so more testing on RAIDs is appreciated. >>> >> >> Hi Corrado, >> >> This time I don't have the machine where I had previously reported >> regressions. But somebody has exported me two Lun from an storage box >> over SAN and I have done my testing on that. With this seek patch applied, >> I still see the regressions. >> >> iosched=cfq Filesz=1G bs=64K >> >> 2.6.33 2.6.33-seek >> workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr >> -------- --- -- ---------- ---------- ---------- ---------- ---- ---- >> brrmmap 3 1 7113 0 7044 0 0% 0% >> brrmmap 3 2 6977 0 6774 0 -2% 0% >> brrmmap 3 4 7410 0 6181 0 -16% 0% >> brrmmap 3 8 9405 0 6020 0 -35% 0% >> brrmmap 3 16 11445 0 5792 0 -49% 0% >> >> 2.6.33 2.6.33-seek >> workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr >> -------- --- -- ---------- ---------- ---------- ---------- ---- ---- >> drrmmap 3 1 7195 0 7337 0 1% 0% >> drrmmap 3 2 7016 0 6855 0 -2% 0% >> drrmmap 3 4 7438 0 6103 0 -17% 0% >> drrmmap 3 8 9298 0 6020 0 -35% 0% >> drrmmap 3 16 11576 0 5827 0 -49% 0% >> >> >> I have run buffered random reads on mmaped files (brrmmap) and direct >> random reads on mmaped files (drrmmap) using fio. I have run these for >> increasing number of threads and did this for 3 times and took average of >> three sets for reporting. BTW, I think O_DIRECT doesn't affect mmap operation. >> >> I have used filesize 1G and bz=64K and ran each test sample for 30 >> seconds. >> >> Because with new seek logic, we will mark above type of cfqq as non seeky >> and will idle on these, I take a significant hit in performance on storage >> boxes which have more than 1 spindle. Thinking about this, can you check if your disks have a non-zero /sys/block/sda/queue/optimal_io_size ? >From the comment in blk-settings.c, I see this should be non-zero for RAIDs, so it may help discriminating the cases we want to optimize for. It could also help in identifying the correct threshold. > > Thanks for testing on a different setup. > I wonder if the wrong part for multi-spindle is the 64kb threshold. > Can you run with larger bs, and see if there is a value for which > idling is better? > For example on a 2 disk raid 0 I would expect that a bs larger than > the stripe will still benefit by idling. > >> >> So basically, the regression is not only on that particular RAID card but >> on other kind of devices which can support more than one spindle. Ok makes sense. If the number of sequential pages read before jumping to a random address is smaller than the raid stripe, we are wasting potential parallelism. >> >> I will run some test on single SATA disk also where this patch should >> benefit. >> >> Based on testing results so far, I am not a big fan of marking these mmap >> queues as sync-idle. I guess if this patch really benefits, then we need >> to first put in place some kind of logic to detect whether if it is single >> spindle SATA disk and then on these disks, mark mmap queues as sync. >> >> Apart from synthetic workloads, in practice, where this patch is helping you? > > The synthetic workload mimics the page fault patterns that can be seen > on program startup, and that is the target of my optimization. In > 2.6.32, we went the direction of enabling idling also for seeky > queues, while 2.6.33 tried to be more friendly with parallel storage > by usually allowing more parallel requests. Unfortunately, this > impacted this peculiar access pattern, so we need to fix it somehow. > > Thanks, > Corrado > >> >> Thanks >> Vivek >> >> >>> The second patch changes the seeky detection logic to be meaningful >>> also for SSDs. A seeky request is one that doesn't utilize the full >>> bandwidth for the device. For SSDs, this happens for small requests, >>> regardless of their location. >>> With this change, the grouping of "seeky" requests done by CFQ can >>> result in a fairer distribution of disk service time among processes. >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 2010-03-03 22:39 ` Corrado Zoccolo @ 2010-03-03 23:11 ` Vivek Goyal 0 siblings, 0 replies; 16+ messages in thread From: Vivek Goyal @ 2010-03-03 23:11 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Linux-Kernel, Jeff Moyer, Shaohua Li, Gui Jianfeng On Wed, Mar 03, 2010 at 11:39:05PM +0100, Corrado Zoccolo wrote: > On Tue, Mar 2, 2010 at 12:01 AM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > > Hi Vivek, > > On Mon, Mar 1, 2010 at 5:35 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> On Sat, Feb 27, 2010 at 07:45:38PM +0100, Corrado Zoccolo wrote: > >>> > >>> Hi, I'm resending the rework seeky detection patch, together with > >>> the companion patch for SSDs, in order to get some testing on more > >>> hardware. > >>> > >>> The first patch in the series fixes a regression introduced in 2.6.33 > >>> for random mmap reads of more than one page, when multiple processes > >>> are competing for the disk. > >>> There is at least one HW RAID controller where it reduces performance, > >>> though (but this controller generally performs worse with CFQ than > >>> with NOOP, probably because it is performing non-work-conserving > >>> I/O scheduling inside), so more testing on RAIDs is appreciated. > >>> > >> > >> Hi Corrado, > >> > >> This time I don't have the machine where I had previously reported > >> regressions. But somebody has exported me two Lun from an storage box > >> over SAN and I have done my testing on that. With this seek patch applied, > >> I still see the regressions. > >> > >> iosched=cfq Filesz=1G bs=64K > >> > >> 2.6.33 2.6.33-seek > >> workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > >> -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > >> brrmmap 3 1 7113 0 7044 0 0% 0% > >> brrmmap 3 2 6977 0 6774 0 -2% 0% > >> brrmmap 3 4 7410 0 6181 0 -16% 0% > >> brrmmap 3 8 9405 0 6020 0 -35% 0% > >> brrmmap 3 16 11445 0 5792 0 -49% 0% > >> > >> 2.6.33 2.6.33-seek > >> workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr > >> -------- --- -- ---------- ---------- ---------- ---------- ---- ---- > >> drrmmap 3 1 7195 0 7337 0 1% 0% > >> drrmmap 3 2 7016 0 6855 0 -2% 0% > >> drrmmap 3 4 7438 0 6103 0 -17% 0% > >> drrmmap 3 8 9298 0 6020 0 -35% 0% > >> drrmmap 3 16 11576 0 5827 0 -49% 0% > >> > >> > >> I have run buffered random reads on mmaped files (brrmmap) and direct > >> random reads on mmaped files (drrmmap) using fio. I have run these for > >> increasing number of threads and did this for 3 times and took average of > >> three sets for reporting. > > BTW, I think O_DIRECT doesn't affect mmap operation. Yes, just for the sake of curiosity I tested O_DIRECT case also. > > >> > >> I have used filesize 1G and bz=64K and ran each test sample for 30 > >> seconds. > >> > >> Because with new seek logic, we will mark above type of cfqq as non seeky > >> and will idle on these, I take a significant hit in performance on storage > >> boxes which have more than 1 spindle. > Thinking about this, can you check if your disks have a non-zero > /sys/block/sda/queue/optimal_io_size ? > >From the comment in blk-settings.c, I see this should be non-zero for > RAIDs, so it may help discriminating the cases we want to optimize > for. > It could also help in identifying the correct threshold. I have got multipath device setup. But I see optimal_io_size=0 both on higher level multipath device as well as underlying component devices. > > > > Thanks for testing on a different setup. > > I wonder if the wrong part for multi-spindle is the 64kb threshold. > > Can you run with larger bs, and see if there is a value for which > > idling is better? > > For example on a 2 disk raid 0 I would expect that a bs larger than > > the stripe will still benefit by idling. > > > >> > >> So basically, the regression is not only on that particular RAID card but > >> on other kind of devices which can support more than one spindle. > Ok makes sense. If the number of sequential pages read before jumping > to a random address is smaller than the raid stripe, we are wasting > potential parallelism. Actually even if we are doing IO size bigger than stripe size, we will probably keep only request_size/stripe_size spindles busy by one request. We are still not exploiting parallelism of rest of the spindles. Secondly in this particular case, becuse you are issuing 4K pages reads at a time, you are for sure going to keep one spindle busy. Increasing the block size to 128K or 256K does bring down the % of regression, but I think that primarly comes from the fact that now we have made workload less random and more sequential (One seek after 256/4=64 sequential reads as opposed to one seek after 64K/4=16 sequentila reads). With bs=128K =========== 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- brrmmap 3 1 8338 0 8532 0 2% 0% brrmmap 3 2 8724 0 8553 0 -1% 0% brrmmap 3 4 9577 0 8002 0 -16% 0% brrmmap 3 8 11806 0 7990 0 -32% 0% brrmmap 3 16 13329 0 8101 0 -39% 0% With bs=256K =========== 2.6.33 2.6.33-seek workload Set NR RDBW(KB/s) WRBW(KB/s) RDBW(KB/s) WRBW(KB/s) %Rd %Wr -------- --- -- ---------- ---------- ---------- ---------- ---- ---- brrmmap 3 1 9778 0 9572 0 -2% 0% brrmmap 3 2 10321 0 10029 0 -2% 0% brrmmap 3 4 11132 0 9675 0 -13% 0% brrmmap 3 8 13111 0 10057 0 -23% 0% brrmmap 3 16 13910 0 10366 0 -25% 0% So if we can detect there are multiple spindles underlying, we can probably make the non-seeky definition stricter and that is instead of looking for 4 seeky requests per 32 samples, we could say 2 seeky requests per 64 samples etc. That could help a bit on storages with multiple spindles behind single Lun. Thanks Vivek > >> > >> I will run some test on single SATA disk also where this patch should > >> benefit. > >> > >> Based on testing results so far, I am not a big fan of marking these mmap > >> queues as sync-idle. I guess if this patch really benefits, then we need > >> to first put in place some kind of logic to detect whether if it is single > >> spindle SATA disk and then on these disks, mark mmap queues as sync. > >> > >> Apart from synthetic workloads, in practice, where this patch is helping you? > > > > The synthetic workload mimics the page fault patterns that can be seen > > on program startup, and that is the target of my optimization. In > > 2.6.32, we went the direction of enabling idling also for seeky > > queues, while 2.6.33 tried to be more friendly with parallel storage > > by usually allowing more parallel requests. Unfortunately, this > > impacted this peculiar access pattern, so we need to fix it somehow. > > > > Thanks, > > Corrado > > > >> > >> Thanks > >> Vivek > >> > >> > >>> The second patch changes the seeky detection logic to be meaningful > >>> also for SSDs. A seeky request is one that doesn't utilize the full > >>> bandwidth for the device. For SSDs, this happens for small requests, > >>> regardless of their location. > >>> With this change, the grouping of "seeky" requests done by CFQ can > >>> result in a fairer distribution of disk service time among processes. > >> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-03-08 14:08 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1267296340-3820-1-git-send-email-czoccolo@gmail.com>
2010-02-27 18:45 ` [PATCH 1/2] cfq-iosched: rework seeky detection Corrado Zoccolo
2010-02-27 18:45 ` [PATCH 2/2] cfq-iosched: rethink seeky detection for SSDs Corrado Zoccolo
2010-03-01 14:25 ` Vivek Goyal
2010-03-03 19:47 ` Corrado Zoccolo
2010-03-03 21:21 ` Vivek Goyal
2010-03-03 23:28 ` Vivek Goyal
2010-03-04 20:34 ` Corrado Zoccolo
2010-03-04 22:27 ` Vivek Goyal
2010-03-05 22:31 ` Corrado Zoccolo
2010-03-08 14:08 ` Vivek Goyal
2010-02-28 18:41 ` [RFC, PATCH 0/2] Reworking seeky detection for 2.6.34 Jens Axboe
2010-03-01 16:35 ` Vivek Goyal
2010-03-01 19:45 ` Vivek Goyal
2010-03-01 23:01 ` Corrado Zoccolo
2010-03-03 22:39 ` Corrado Zoccolo
2010-03-03 23:11 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox