* [PATCH] cfq: Make use of service count to estimate the rb_key offset @ 2009-11-26 6:20 Gui Jianfeng 2009-11-26 8:14 ` Jens Axboe 2009-11-26 9:08 ` Corrado Zoccolo 0 siblings, 2 replies; 11+ messages in thread From: Gui Jianfeng @ 2009-11-26 6:20 UTC (permalink / raw) To: Jens Axboe, czoccolo; +Cc: Vivek Goyal, linux-kernel Hi Jens, Czoccolo For the moment, different workload cfq queues are put into different service trees. But CFQ still uses "busy_queues" to estimate rb_key offset when inserting a cfq queue into a service tree. I think this isn't appropriate, and it should make use of service tree count to do this estimation. This patch is for for-2.6.33 branch. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> --- block/cfq-iosched.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 1bcbd8c..467981e 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -600,11 +600,15 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, static unsigned long cfq_slice_offset(struct cfq_data *cfqd, struct cfq_queue *cfqq) { + struct cfq_rb_root *service_tree; + + service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd); + /* * just an approximation, should be ok. */ - return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); + return service_tree->count * (cfq_prio_slice(cfqd, 1, 0) - + cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); } /* -- 1.5.4.rc3 -- Regards Gui Jianfeng ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-26 6:20 [PATCH] cfq: Make use of service count to estimate the rb_key offset Gui Jianfeng @ 2009-11-26 8:14 ` Jens Axboe 2009-11-26 9:08 ` Corrado Zoccolo 1 sibling, 0 replies; 11+ messages in thread From: Jens Axboe @ 2009-11-26 8:14 UTC (permalink / raw) To: Gui Jianfeng; +Cc: czoccolo, Vivek Goyal, linux-kernel On Thu, Nov 26 2009, Gui Jianfeng wrote: > Hi Jens, Czoccolo > > For the moment, different workload cfq queues are put into different > service trees. But CFQ still uses "busy_queues" to estimate rb_key > offset when inserting a cfq queue into a service tree. I think this > isn't appropriate, and it should make use of service tree count to do > this estimation. This patch is for for-2.6.33 branch. Makes sense, we've before discussed that using ->busy_queues isn't very good, as it's not a stable count. service_tree count isn't that much better, but for for-2.6.33 it does make more sense. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-26 6:20 [PATCH] cfq: Make use of service count to estimate the rb_key offset Gui Jianfeng 2009-11-26 8:14 ` Jens Axboe @ 2009-11-26 9:08 ` Corrado Zoccolo 2009-11-27 1:42 ` Gui Jianfeng 2009-11-30 15:36 ` Vivek Goyal 1 sibling, 2 replies; 11+ messages in thread From: Corrado Zoccolo @ 2009-11-26 9:08 UTC (permalink / raw) To: Gui Jianfeng; +Cc: Jens Axboe, Vivek Goyal, linux-kernel Hi Gui, Jens On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng <guijianfeng@cn.fujitsu.com> wrote: > Hi Jens, Czoccolo > > For the moment, different workload cfq queues are put into different > service trees. But CFQ still uses "busy_queues" to estimate rb_key > offset when inserting a cfq queue into a service tree. I think this > isn't appropriate, and it should make use of service tree count to do > this estimation. This patch is for for-2.6.33 branch. In cfq_choose_wl, we rely on consistency of rb_keys across service trees to compute the next workload to be serviced. for (i = 0; i < 3; ++i) { /* otherwise, select the one with lowest rb_key */ queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); if (queue && (!key_valid || time_before(queue->rb_key, lowest_key))) { lowest_key = queue->rb_key; cur_best = i; key_valid = true; } } If you change how the rb_key is computed (so it is no longer consistent across service trees) without changing how it is used can introduce problems. Thanks, Corrado > > Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> > --- > block/cfq-iosched.c | 8 ++++++-- > 1 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index 1bcbd8c..467981e 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -600,11 +600,15 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, > static unsigned long cfq_slice_offset(struct cfq_data *cfqd, > struct cfq_queue *cfqq) > { > + struct cfq_rb_root *service_tree; > + > + service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd); > + > /* > * just an approximation, should be ok. > */ > - return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - > - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > + return service_tree->count * (cfq_prio_slice(cfqd, 1, 0) - > + cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > } > > /* > -- > 1.5.4.rc3 > > -- > Regards > Gui Jianfeng > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-26 9:08 ` Corrado Zoccolo @ 2009-11-27 1:42 ` Gui Jianfeng 2009-11-27 8:16 ` Corrado Zoccolo 2009-11-30 15:36 ` Vivek Goyal 1 sibling, 1 reply; 11+ messages in thread From: Gui Jianfeng @ 2009-11-27 1:42 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Jens Axboe, Vivek Goyal, linux-kernel Corrado Zoccolo wrote: > Hi Gui, Jens > On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng > <guijianfeng@cn.fujitsu.com> wrote: >> Hi Jens, Czoccolo >> >> For the moment, different workload cfq queues are put into different >> service trees. But CFQ still uses "busy_queues" to estimate rb_key >> offset when inserting a cfq queue into a service tree. I think this >> isn't appropriate, and it should make use of service tree count to do >> this estimation. This patch is for for-2.6.33 branch. > > In cfq_choose_wl, we rely on consistency of rb_keys across service > trees to compute the next workload to be serviced. > for (i = 0; i < 3; ++i) { > /* otherwise, select the one with lowest rb_key */ > queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); > if (queue && > (!key_valid || time_before(queue->rb_key, lowest_key))) { > lowest_key = queue->rb_key; > cur_best = i; > key_valid = true; > } > } > > If you change how the rb_key is computed (so it is no longer > consistent across service trees) without changing how it is used can > introduce problems. Ok, I think I was missing this part. This part still behaves like old CFQ regardless of workload type. I'm wondering why you prefer starting from sync no-idle only when priorities switched, after that, you do it like old CFQ behavior? In order to improve latency for sync no-idle workload, is it possible to take workload type into account, not only rely on rb_keys across service trees? Thanks, Gui > > Thanks, > Corrado > >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> >> --- >> block/cfq-iosched.c | 8 ++++++-- >> 1 files changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> index 1bcbd8c..467981e 100644 >> --- a/block/cfq-iosched.c >> +++ b/block/cfq-iosched.c >> @@ -600,11 +600,15 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, >> static unsigned long cfq_slice_offset(struct cfq_data *cfqd, >> struct cfq_queue *cfqq) >> { >> + struct cfq_rb_root *service_tree; >> + >> + service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd); >> + >> /* >> * just an approximation, should be ok. >> */ >> - return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - >> - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); >> + return service_tree->count * (cfq_prio_slice(cfqd, 1, 0) - >> + cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); >> } >> >> /* >> -- >> 1.5.4.rc3 >> >> -- >> Regards >> Gui Jianfeng >> >> > > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-27 1:42 ` Gui Jianfeng @ 2009-11-27 8:16 ` Corrado Zoccolo 2009-11-30 3:02 ` Gui Jianfeng 0 siblings, 1 reply; 11+ messages in thread From: Corrado Zoccolo @ 2009-11-27 8:16 UTC (permalink / raw) To: Gui Jianfeng; +Cc: Jens Axboe, Vivek Goyal, linux-kernel Hi Guy, On Fri, Nov 27, 2009 at 2:42 AM, Gui Jianfeng <guijianfeng@cn.fujitsu.com> wrote: > Corrado Zoccolo wrote: >> Hi Gui, Jens >> On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng >> <guijianfeng@cn.fujitsu.com> wrote: >>> Hi Jens, Czoccolo >>> >>> For the moment, different workload cfq queues are put into different >>> service trees. But CFQ still uses "busy_queues" to estimate rb_key >>> offset when inserting a cfq queue into a service tree. I think this >>> isn't appropriate, and it should make use of service tree count to do >>> this estimation. This patch is for for-2.6.33 branch. >> >> In cfq_choose_wl, we rely on consistency of rb_keys across service >> trees to compute the next workload to be serviced. >> for (i = 0; i < 3; ++i) { >> /* otherwise, select the one with lowest rb_key */ >> queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); >> if (queue && >> (!key_valid || time_before(queue->rb_key, lowest_key))) { >> lowest_key = queue->rb_key; >> cur_best = i; >> key_valid = true; >> } >> } >> >> If you change how the rb_key is computed (so it is no longer >> consistent across service trees) without changing how it is used can >> introduce problems. > > Ok, I think I was missing this part. This part still behaves like old CFQ regardless > of workload type. I'm wondering why you prefer starting from sync no-idle only when > priorities switched, after that, you do it like old CFQ behavior? When switching priorities (e.g. from RT to BE), we may come from a long stall. In this case, I think it is better to run no-idle first. During normal operation, instead, we want a fair, starvation free way to switch between workloads, and I thought it was simpler to mimic old CFQ behaviour, instead of cook up a different method. The difference between new and old CFQ is that now, when we decide to service one no-idle request, we will then service subsequent ones from the same workload type. This allows processing them optimally on NCQ hardware. Moreover, when no more no-idle requests are available, but the timeslice for this workload did not expire yet, we will wait for more. This guarantees fairness for no-idle workload. > In order to improve > latency for sync no-idle workload, is it possible to take workload type into account, > not only rely on rb_keys across service trees? When loading a program into memory, your process will go through various phases w.r.t. disk access pattern: some are seeky, some others are sequential. If you just improve latency for one workload, penalizing the others, you won't get an overall improvement of the system. The new scheme improves overall system behaviour because grouping no-idle requests together gives a better utilization of the disk, and fairness allows also processes making seeky requests to progress. Penalizing the idle service tree, instead, you will give you lower overall throughput (forbidding progress to the processes that make sequential requests), while penalizing writeback you will find yourself waiting for freeing dirty pages more often, and maybe incurring in OOM conditions. Regarding the rb_key computation, I have done various experiments, and found that the actual formula doesn't matter much on rotational hardware, where the slice length has most importance. But I think it is essential on NCQ SSDs, to obtain fairness. Unfortunately, I don't have an NCQ SSD, so I can't test my improvement ideas. Thanks, Corrado > Thanks, > Gui ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-27 8:16 ` Corrado Zoccolo @ 2009-11-30 3:02 ` Gui Jianfeng 2009-11-30 8:38 ` Jens Axboe 0 siblings, 1 reply; 11+ messages in thread From: Gui Jianfeng @ 2009-11-30 3:02 UTC (permalink / raw) To: Corrado Zoccolo, Jens Axboe; +Cc: Vivek Goyal, linux-kernel Corrado Zoccolo wrote: > Hi Guy, > On Fri, Nov 27, 2009 at 2:42 AM, Gui Jianfeng > <guijianfeng@cn.fujitsu.com> wrote: >> Corrado Zoccolo wrote: >>> Hi Gui, Jens >>> On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng >>> <guijianfeng@cn.fujitsu.com> wrote: >>>> Hi Jens, Czoccolo >>>> >>>> For the moment, different workload cfq queues are put into different >>>> service trees. But CFQ still uses "busy_queues" to estimate rb_key >>>> offset when inserting a cfq queue into a service tree. I think this >>>> isn't appropriate, and it should make use of service tree count to do >>>> this estimation. This patch is for for-2.6.33 branch. >>> In cfq_choose_wl, we rely on consistency of rb_keys across service >>> trees to compute the next workload to be serviced. >>> for (i = 0; i < 3; ++i) { >>> /* otherwise, select the one with lowest rb_key */ >>> queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); >>> if (queue && >>> (!key_valid || time_before(queue->rb_key, lowest_key))) { >>> lowest_key = queue->rb_key; >>> cur_best = i; >>> key_valid = true; >>> } >>> } >>> >>> If you change how the rb_key is computed (so it is no longer >>> consistent across service trees) without changing how it is used can >>> introduce problems. >> Ok, I think I was missing this part. This part still behaves like old CFQ regardless >> of workload type. I'm wondering why you prefer starting from sync no-idle only when >> priorities switched, after that, you do it like old CFQ behavior? > > When switching priorities (e.g. from RT to BE), we may come from a > long stall. In this case, I think it is better to run no-idle first. > During normal operation, instead, we want a fair, starvation free way > to switch between workloads, and I thought it was simpler to mimic old > CFQ behaviour, instead of cook up a different method. > The difference between new and old CFQ is that now, when we decide to > service one no-idle request, we will then service subsequent ones from > the same workload type. > This allows processing them optimally on NCQ hardware. > Moreover, when no more no-idle requests are available, but the > timeslice for this workload did not expire yet, we will wait for more. > This guarantees fairness for no-idle workload. > >> In order to improve >> latency for sync no-idle workload, is it possible to take workload type into account, >> not only rely on rb_keys across service trees? > When loading a program into memory, your process will go through > various phases w.r.t. disk access pattern: some are seeky, some others > are sequential. > > If you just improve latency for one workload, penalizing the others, > you won't get an overall improvement of the system. > The new scheme improves overall system behaviour because grouping > no-idle requests together gives a better utilization of the disk, and > fairness allows also processes making seeky requests to progress. > Penalizing the idle service tree, instead, you will give you lower > overall throughput (forbidding progress to the processes that make > sequential requests), while penalizing writeback you will find > yourself waiting for freeing dirty pages more often, and maybe > incurring in OOM conditions. > > Regarding the rb_key computation, I have done various experiments, and > found that the actual formula doesn't matter much on rotational > hardware, where the slice length has most importance. > But I think it is essential on NCQ SSDs, to obtain fairness. > Unfortunately, I don't have an NCQ SSD, so I can't test my improvement ideas. > Corrado, thanks for the detailed explanation. Jens, I think Corrado is right, we still need consistency of rb_keys across service to compute next workload type. So would you revert this patch please? Thanks, Gui ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-30 3:02 ` Gui Jianfeng @ 2009-11-30 8:38 ` Jens Axboe 0 siblings, 0 replies; 11+ messages in thread From: Jens Axboe @ 2009-11-30 8:38 UTC (permalink / raw) To: Gui Jianfeng; +Cc: Corrado Zoccolo, Vivek Goyal, linux-kernel On Mon, Nov 30 2009, Gui Jianfeng wrote: > Corrado Zoccolo wrote: > > Hi Guy, > > On Fri, Nov 27, 2009 at 2:42 AM, Gui Jianfeng > > <guijianfeng@cn.fujitsu.com> wrote: > >> Corrado Zoccolo wrote: > >>> Hi Gui, Jens > >>> On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng > >>> <guijianfeng@cn.fujitsu.com> wrote: > >>>> Hi Jens, Czoccolo > >>>> > >>>> For the moment, different workload cfq queues are put into different > >>>> service trees. But CFQ still uses "busy_queues" to estimate rb_key > >>>> offset when inserting a cfq queue into a service tree. I think this > >>>> isn't appropriate, and it should make use of service tree count to do > >>>> this estimation. This patch is for for-2.6.33 branch. > >>> In cfq_choose_wl, we rely on consistency of rb_keys across service > >>> trees to compute the next workload to be serviced. > >>> for (i = 0; i < 3; ++i) { > >>> /* otherwise, select the one with lowest rb_key */ > >>> queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); > >>> if (queue && > >>> (!key_valid || time_before(queue->rb_key, lowest_key))) { > >>> lowest_key = queue->rb_key; > >>> cur_best = i; > >>> key_valid = true; > >>> } > >>> } > >>> > >>> If you change how the rb_key is computed (so it is no longer > >>> consistent across service trees) without changing how it is used can > >>> introduce problems. > >> Ok, I think I was missing this part. This part still behaves like old CFQ regardless > >> of workload type. I'm wondering why you prefer starting from sync no-idle only when > >> priorities switched, after that, you do it like old CFQ behavior? > > > > When switching priorities (e.g. from RT to BE), we may come from a > > long stall. In this case, I think it is better to run no-idle first. > > During normal operation, instead, we want a fair, starvation free way > > to switch between workloads, and I thought it was simpler to mimic old > > CFQ behaviour, instead of cook up a different method. > > The difference between new and old CFQ is that now, when we decide to > > service one no-idle request, we will then service subsequent ones from > > the same workload type. > > This allows processing them optimally on NCQ hardware. > > Moreover, when no more no-idle requests are available, but the > > timeslice for this workload did not expire yet, we will wait for more. > > This guarantees fairness for no-idle workload. > > > >> In order to improve > >> latency for sync no-idle workload, is it possible to take workload type into account, > >> not only rely on rb_keys across service trees? > > When loading a program into memory, your process will go through > > various phases w.r.t. disk access pattern: some are seeky, some others > > are sequential. > > > > If you just improve latency for one workload, penalizing the others, > > you won't get an overall improvement of the system. > > The new scheme improves overall system behaviour because grouping > > no-idle requests together gives a better utilization of the disk, and > > fairness allows also processes making seeky requests to progress. > > Penalizing the idle service tree, instead, you will give you lower > > overall throughput (forbidding progress to the processes that make > > sequential requests), while penalizing writeback you will find > > yourself waiting for freeing dirty pages more often, and maybe > > incurring in OOM conditions. > > > > Regarding the rb_key computation, I have done various experiments, and > > found that the actual formula doesn't matter much on rotational > > hardware, where the slice length has most importance. > > But I think it is essential on NCQ SSDs, to obtain fairness. > > Unfortunately, I don't have an NCQ SSD, so I can't test my improvement ideas. > > > > Corrado, thanks for the detailed explanation. > > Jens, I think Corrado is right, we still need consistency of rb_keys > across service to compute next workload type. So would you revert this > patch please? Yes, I agree. I thought I had already reverted it, will do so now. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-26 9:08 ` Corrado Zoccolo 2009-11-27 1:42 ` Gui Jianfeng @ 2009-11-30 15:36 ` Vivek Goyal 2009-11-30 16:01 ` Corrado Zoccolo 1 sibling, 1 reply; 11+ messages in thread From: Vivek Goyal @ 2009-11-30 15:36 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Gui Jianfeng, Jens Axboe, linux-kernel On Thu, Nov 26, 2009 at 10:08:36AM +0100, Corrado Zoccolo wrote: > Hi Gui, Jens > On Thu, Nov 26, 2009 at 7:20 AM, Gui Jianfeng > <guijianfeng@cn.fujitsu.com> wrote: > > Hi Jens, Czoccolo > > > > For the moment, different workload cfq queues are put into different > > service trees. But CFQ still uses "busy_queues" to estimate rb_key > > offset when inserting a cfq queue into a service tree. I think this > > isn't appropriate, and it should make use of service tree count to do > > this estimation. This patch is for for-2.6.33 branch. > > In cfq_choose_wl, we rely on consistency of rb_keys across service > trees to compute the next workload to be serviced. > for (i = 0; i < 3; ++i) { > /* otherwise, select the one with lowest rb_key */ > queue = cfq_rb_first(service_tree_for(prio, i, cfqd)); > if (queue && > (!key_valid || time_before(queue->rb_key, lowest_key))) { > lowest_key = queue->rb_key; > cur_best = i; > key_valid = true; > } > } > > If you change how the rb_key is computed (so it is no longer > consistent across service trees) without changing how it is used can > introduce problems. > Hi Corrado, currently rb_key seems to be combination of two things. busy_queues and jiffies. In new scheme, where we decide the share of a workload and then switch to new workload, dependence on busy_queues does not seem to make much sense. Assume, a bunch of sequential readers get backlogged and then few random readers gets backlogged. Now random reader will get higher rb_key because there are 8 sequential reders on sync-idle tree. IIUC, with above logic, even if we expire the sync-idle workload duration once, we might not switch to sync-noidle workload and start running the sync-idle workload again. (Because minimum slice length restrictions or if low_latency is not set). So instead of relying on rb_keys to switch the workload type, why not do it in round robin manner across the workload types? So rb_key will be significant only with-in service tree and not across service tree? Thanks Vivek > Thanks, > Corrado > > > > > Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> > > --- > > block/cfq-iosched.c | 8 ++++++-- > > 1 files changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index 1bcbd8c..467981e 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -600,11 +600,15 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq, > > static unsigned long cfq_slice_offset(struct cfq_data *cfqd, > > struct cfq_queue *cfqq) > > { > > + struct cfq_rb_root *service_tree; > > + > > + service_tree = service_tree_for(cfqq_prio(cfqq), cfqq_type(cfqq), cfqd); > > + > > /* > > * just an approximation, should be ok. > > */ > > - return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - > > - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > > + return service_tree->count * (cfq_prio_slice(cfqd, 1, 0) - > > + cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > > } > > > > /* > > -- > > 1.5.4.rc3 > > > > -- > > Regards > > Gui Jianfeng > > > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-30 15:36 ` Vivek Goyal @ 2009-11-30 16:01 ` Corrado Zoccolo 2009-11-30 16:46 ` Vivek Goyal 0 siblings, 1 reply; 11+ messages in thread From: Corrado Zoccolo @ 2009-11-30 16:01 UTC (permalink / raw) To: Vivek Goyal; +Cc: Gui Jianfeng, Jens Axboe, linux-kernel On Mon, Nov 30, 2009 at 4:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > Hi Corrado, > > currently rb_key seems to be combination of two things. busy_queues and > jiffies. > > In new scheme, where we decide the share of a workload and then switch to > new workload, dependence on busy_queues does not seem to make much sense. > > Assume, a bunch of sequential readers get backlogged and then few random > readers gets backlogged. Now random reader will get higher rb_key because > there are 8 sequential reders on sync-idle tree. Even if we didn't have busy_queues, we would have the same situation, e.g. we have the 8 seq readers at time t (jiffies), and the seeky readers at time t+1. busy_queues really doesn't add anything when all queues have the same priority. > > IIUC, with above logic, even if we expire the sync-idle workload duration > once, we might not switch to sync-noidle workload and start running the > sync-idle workload again. (Because minimum slice length restrictions or > if low_latency is not set). Yes. > > So instead of relying on rb_keys to switch the workload type, why not do > it in round robin manner across the workload types? So rb_key will be > significant only with-in service tree and not across service tree? This is a good option. I have also tested it, and it works quite well (you can even have an async penalization like in deadline, so you do few rounds between seq and seeky, and then one of async). Besides a more complex code, I felt it was against the spirit of CFQ, since in that way, you are not providing fairness across workloads (especially if you don't want low_latency). BTW, my idea how to improve the rb_key computation is: * for NCQ SSD (or when sched_idle = 0): rb_key = jiffies - function(priority) * for others: rb_key = jiffies - sched_resid Currently, sched_resid is meaningless for NCQ SSD, since we always expire the queue immediately. Subtracting sched_resid would just give an advantage to a queue that already dispatched over the ones that didn't. Priority, instead, should be used only for NCQ SSD. For the others, priority already affects time slice, so having it here would cause over-prioritization. Thanks, Corrado > Thanks > Vivek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-30 16:01 ` Corrado Zoccolo @ 2009-11-30 16:46 ` Vivek Goyal 2009-11-30 21:56 ` Corrado Zoccolo 0 siblings, 1 reply; 11+ messages in thread From: Vivek Goyal @ 2009-11-30 16:46 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Gui Jianfeng, Jens Axboe, linux-kernel On Mon, Nov 30, 2009 at 05:01:28PM +0100, Corrado Zoccolo wrote: > On Mon, Nov 30, 2009 at 4:36 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > Hi Corrado, > > > > currently rb_key seems to be combination of two things. busy_queues and > > jiffies. > > > > In new scheme, where we decide the share of a workload and then switch to > > new workload, dependence on busy_queues does not seem to make much sense. > > > > Assume, a bunch of sequential readers get backlogged and then few random > > readers gets backlogged. Now random reader will get higher rb_key because > > there are 8 sequential reders on sync-idle tree. > Even if we didn't have busy_queues, we would have the same situation, > e.g. we have the 8 seq readers at time t (jiffies), and the seeky > readers at time t+1. > busy_queues really doesn't add anything when all queues have the same priority. > > > > IIUC, with above logic, even if we expire the sync-idle workload duration > > once, we might not switch to sync-noidle workload and start running the > > sync-idle workload again. (Because minimum slice length restrictions or > > if low_latency is not set). > Yes. > > > > So instead of relying on rb_keys to switch the workload type, why not do > > it in round robin manner across the workload types? So rb_key will be > > significant only with-in service tree and not across service tree? > This is a good option. I have also tested it, and it works quite well > (you can even have an async penalization like in deadline, so you do > few rounds between seq and seeky, and then one of async). Or to keep it even simpler just reduce the share of async workload per round. Currently you already reduce that share in the ratio of sync/async base slices. > Besides a > more complex code, I felt it was against the spirit of CFQ, since in > that way, you are not providing fairness across workloads (especially > if you don't want low_latency). I am not sure what is the spirit of CFQ when it comes to serving the various workloads currently. It seems sync queues get the maximum share and that led to starvation of random seeky readers. Your patches of idling on sync-noidle tree improved that situation by increasing the disk share for sync-noidle workload. When it comes to disk share for async workload, I don't think CFQ has any fixed formula for that. We do slice length calculation but to me it is of not much use most of the time as async queues are preempted by sync queues. Yes resid computation should help here a bit but still preempting queue gets to run first always (at least in old cfq). Now in new cfq, to me even if we preempt, we will be put at the front of the respective serviece tree but we might continue to dispatch from async workload as time slice for that workload has not expired and till then we will not choose a new workload. So I am not sure if old cfq had any notion of in what ratio async will get to use the disk. Only thing we ensured was that async queue should not increase the latency of sync queues and put various hooks like sync queue can preempt async queue, don't allow dispatch from async queue if sync requests are in flight or build up the async queue depth slowly etc. So the point is that as such old CFQ was not guranteeing anything about the share of type of workload. So by enforcing round robin between workload type we should not be breaking any gurantee. > > BTW, my idea how to improve the rb_key computation is: > * for NCQ SSD (or when sched_idle = 0): > rb_key = jiffies - function(priority) > * for others: > rb_key = jiffies - sched_resid > > Currently, sched_resid is meaningless for NCQ SSD, since we always > expire the queue immediately. Subtracting sched_resid would just give > an advantage to a queue that already dispatched over the ones that > didn't. In NCQ SSD, will resid be not zero most of the time? I thought resid is set to something only if queue is preemted. For usual expiry by cfq_select_queue(), timed_out=0 and resid=0. So on NCQ SSD resid should not be playing any role. > Priority, instead, should be used only for NCQ SSD. For the others, > priority already affects time slice, so having it here would cause > over-prioritization. What's the goal here? By doing this computation, what do we gain, Is it about getting better service differentation on NCQ SSDs for different prio processes? So providing lower rb_key for higher prio process should help on NCQ SSD. That's a different thing it might not be very deterministic. Thanks Vivek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] cfq: Make use of service count to estimate the rb_key offset 2009-11-30 16:46 ` Vivek Goyal @ 2009-11-30 21:56 ` Corrado Zoccolo 0 siblings, 0 replies; 11+ messages in thread From: Corrado Zoccolo @ 2009-11-30 21:56 UTC (permalink / raw) To: Vivek Goyal; +Cc: Gui Jianfeng, Jens Axboe, linux-kernel Hi Vivek, On Mon, Nov 30, 2009 at 5:46 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Nov 30, 2009 at 05:01:28PM +0100, Corrado Zoccolo wrote: >> This is a good option. I have also tested it, and it works quite well >> (you can even have an async penalization like in deadline, so you do >> few rounds between seq and seeky, and then one of async). > > Or to keep it even simpler just reduce the share of async workload per > round. Currently you already reduce that share in the ratio of sync/async > base slices. Reducing the share per round can increase seekiness. Having a big enough share every few rounds, instead, will have better throughput. > >> Besides a >> more complex code, I felt it was against the spirit of CFQ, since in >> that way, you are not providing fairness across workloads (especially >> if you don't want low_latency). > > I am not sure what is the spirit of CFQ when it comes to serving the > various workloads currently. CFQ spirit is to be completely fair. 2.6.32 with low_latency introduced fairness also for seeky queues. > It seems sync queues get the maximum share > and that led to starvation of random seeky readers. Your patches of idling > on sync-noidle tree improved that situation by increasing the disk share > for sync-noidle workload. Yes. But we don't want now to penalize the sequential processes. We want workloads serviced fairly, so if the 8 sequential queues all requested service before the seeky ones, they should be serviced first, unless we have preemption (e.g. one seeky queue wants to read metadata). In that case, we will start servicing the meta data request, and then also the other seeky ones, to fill the NCQ slots. > When it comes to disk share for async workload, I don't think CFQ has any > fixed formula for that. We do slice length calculation but to me it is > of not much use most of the time as async queues are preempted by sync > queues. Yes resid computation should help here a bit but still preempting > queue gets to run first always (at least in old cfq). Now in new cfq, to > me even if we preempt, we will be put at the front of the respective > serviece tree but we might continue to dispatch from async workload as > time slice for that workload has not expired and till then we will not > choose a new workload. > > So I am not sure if old cfq had any notion of in what ratio async will get > to use the disk. Only thing we ensured was that async queue should not > increase the latency of sync queues and put various hooks like sync queue > can preempt async queue, don't allow dispatch from async queue if sync > requests are in flight or build up the async queue depth slowly etc. Right. Async is an other story. But for sync, CFQ wants to provide fairness in the long run, so penalizing sequential queues w.r.t. seeky ones is not advisable. You should note that, when loading an application, you will start being seeky, and then become sequential. If you service seeky queues with higher priority, your applications will complete the seeky phase quickly, and then still consume a lot of time during the sequential part, so the perceived latency will not improve. > So the point is that as such old CFQ was not guranteeing anything about > the share of type of workload. So by enforcing round robin between > workload type we should not be breaking any gurantee. > >> >> BTW, my idea how to improve the rb_key computation is: >> * for NCQ SSD (or when sched_idle = 0): >> rb_key = jiffies - function(priority) >> * for others: >> rb_key = jiffies - sched_resid >> >> Currently, sched_resid is meaningless for NCQ SSD, since we always >> expire the queue immediately. Subtracting sched_resid would just give >> an advantage to a queue that already dispatched over the ones that >> didn't. > > In NCQ SSD, will resid be not zero most of the time? I thought resid is > set to something only if queue is preemted. For usual expiry by > cfq_select_queue(), timed_out=0 and resid=0. So on NCQ SSD resid should > not be playing any role. Good. So I still don't have an explanation for the apparent starvation of some queues, when many sequential queues are requesting service. Your test case with 4k direct reads showed it clearly, more than can be seen when passing through the page cache and readahead. Do you have any idea why this happened? > >> Priority, instead, should be used only for NCQ SSD. For the others, >> priority already affects time slice, so having it here would cause >> over-prioritization. > > What's the goal here? By doing this computation, what do we gain, Is it > about getting better service differentation on NCQ SSDs for different prio > processes? So providing lower rb_key for higher prio process should help > on NCQ SSD. That's a different thing it might not be very deterministic. Yes. It may not be deterministic, but it will provide some. Maybe it will work better when we have autotune automatically set slice_idle=0. However, my point is that in current formula priority is already present, but we should remove it when disk is rotational or non NCQ, since it will give over-prioritization. Thanks, Corrado > > Thanks > Vivek > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-11-30 21:56 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-11-26 6:20 [PATCH] cfq: Make use of service count to estimate the rb_key offset Gui Jianfeng 2009-11-26 8:14 ` Jens Axboe 2009-11-26 9:08 ` Corrado Zoccolo 2009-11-27 1:42 ` Gui Jianfeng 2009-11-27 8:16 ` Corrado Zoccolo 2009-11-30 3:02 ` Gui Jianfeng 2009-11-30 8:38 ` Jens Axboe 2009-11-30 15:36 ` Vivek Goyal 2009-11-30 16:01 ` Corrado Zoccolo 2009-11-30 16:46 ` Vivek Goyal 2009-11-30 21:56 ` Corrado Zoccolo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox