* [RFC PATCH] cfq-iosched: Implement group idle V2
@ 2010-07-19 17:20 Vivek Goyal
2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw)
To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal
[ Got Jens's mail id wrong in last post hence reposting. Sorry for cluttering
your mailboxes.]
Hi,
This is V2 of the group_idle implementation patchset. I have done some more
testing since V1 and fixed couple of bugs since V1.
What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.
All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.
This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.
Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6245 12776 16591 23471 28746 36799 43031 49778 217437
bsr 1 2 5100 11063 17216 23136 23738 30391 35910 40874 187428
bsr 1 4 4623 9718 14746 18356 22875 30407 33215 38073 172013
bsr 1 8 4720 10143 13499 19115 22157 29126 31688 30784 161232
Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.
With this patch set, I have set slice_idle=0 and re-ran same test.
Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1 NRGRP=8
DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4
Workload=bsr iosched=cfq Filesz=512M bs=4K
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total
--- --- -- ---------------------------------------------------------------
bsr 1 1 6789 12764 17174 23111 28528 36807 42753 48826 216752
bsr 1 2 9845 20617 30521 39061 45514 51607 63683 63770 324618
bsr 1 4 14835 24241 42444 55207 45914 51318 54661 60318 348938
bsr 1 8 12022 24478 36732 48651 54333 60974 64856 72930 374976
Notice how overall throughput has shot upto 374MB/s while retaining the ability
to do the IO control.
So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.
If you have thoughts on other ways of solving the problem, I am all ears
to it.
Thanks
Vivek
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal @ 2010-07-19 17:20 ` Vivek Goyal 2010-07-19 18:47 ` Jeff Moyer 2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal 2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal 2 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw) To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal - Currently in CFQ there are many situations where don't know how much time slice has been consumed by a queue. For example, all the random reader/writer queues where we don't idle on individual queues and we expire the queue either immediately after the request dispatch. - In this case time consumed by a queue is just a memory copy operation. Actually time measurement is possible only if we idle on a queue and allow dispatch from a queue for significant amount of time. - As of today, in such cases we calculate the time since the dispatch from the queue started and charge all that time. Generally this rounds to 1 jiffy but in some cases it can be more. For example, if we are driving high request queue depth and driver is too busy and does not ask for new requests for 8-10 jiffies. In such cases, the active queue gets charged very unfairly. - So fundamentally, whole notion of charging for time slice is valid only if we have been idling on the queue. Otherwise in an NCQ queue, there might be other requests on the queue and we can not do the time slice calculation. - This patch tweaks the slice charging logic a bit so that in the cases where we can't know the amount of time, we start charging in terms of number of requests dispatched (IOPS). This practically switching CFQ fairness model to fairness in terms of IOPS with slice_idle=0. - As of today this will primarily be useful only with group_idle patches so that we get fairness in terms of IOPS across groups. The idea is that on fast storage one can run CFQ with slice_idle=0 and still get IO controller working without losing too much of throughput. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 24 +++++++++++++++++++++--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 7982b83..f44064c 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) * if there are mutiple queues in the group, each can dispatch * a single request on seeky media and cause lots of seek time * and group will never know it. + * + * If drive is NCQ and we are driving deep queue depths, then + * it is not reasonable to charge the slice since dispatch + * started because this time will include time taken by all + * the other requests in the queue. + * + * Actually there is no reasonable way to know the disk time + * here and we need to come up with some approximation. If + * disk is non NCQ, we should be driving request queue depth + * 1, then charge for time since dispatch start and this will + * account for seek time properly on seeky media. If request + * queue depth is high, then charge for number of requests + * dispatched from the queue. This will sort of becoming + * charging in terms of IOPS. */ - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start), - 1); + if (cfqq->cfqd->hw_tag == 0) + slice_used = max_t(unsigned, + (jiffies - cfqq->dispatch_start), 1); + else + slice_used = cfqq->slice_dispatch; } else { slice_used = jiffies - cfqq->slice_start; if (slice_used > cfqq->allocated_slice) slice_used = cfqq->allocated_slice; } - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used); + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, + cfqq->slice_dispatch); return slice_used; } -- 1.7.1.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal @ 2010-07-19 18:47 ` Jeff Moyer 2010-07-19 18:58 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Jeff Moyer @ 2010-07-19 18:47 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo Vivek Goyal <vgoyal@redhat.com> writes: > - Currently in CFQ there are many situations where don't know how > much time slice has been consumed by a queue. For example, all > the random reader/writer queues where we don't idle on > individual queues and we expire the queue either immediately > after the request dispatch. > > - In this case time consumed by a queue is just a memory copy > operation. Actually time measurement is possible only if we > idle on a queue and allow dispatch from a queue for significant > amount of time. > > - As of today, in such cases we calculate the time since the > dispatch from the queue started and charge all that time. > Generally this rounds to 1 jiffy but in some cases it can > be more. For example, if we are driving high request queue > depth and driver is too busy and does not ask for new > requests for 8-10 jiffies. In such cases, the active queue > gets charged very unfairly. > > - So fundamentally, whole notion of charging for time slice > is valid only if we have been idling on the queue. Otherwise > in an NCQ queue, there might be other requests on the queue > and we can not do the time slice calculation. > > - This patch tweaks the slice charging logic a bit so that > in the cases where we can't know the amount of time, we > start charging in terms of number of requests dispatched > (IOPS). This practically switching CFQ fairness model to > fairness in terms of IOPS with slice_idle=0. > > - As of today this will primarily be useful only with > group_idle patches so that we get fairness in terms of > IOPS across groups. The idea is that on fast storage > one can run CFQ with slice_idle=0 and still get IO > controller working without losing too much of > throughput. I'm not fluent in the cgroup code, my apologies for that. However, just trying to make sense of this is giving me a headache. Now, in some cases you are using IOPS *in place of* jiffies. How are we to know which is which and in what cases? It sounds like this is addressing an important problem, but I'm having a hard time picking out what that problem is. Is this problem noticable for competing sync-noidle workloads (competing between groups, that is)? If not, then what? Thanks, Jeff > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > block/cfq-iosched.c | 24 +++++++++++++++++++++--- > 1 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index 7982b83..f44064c 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) > * if there are mutiple queues in the group, each can dispatch > * a single request on seeky media and cause lots of seek time > * and group will never know it. > + * > + * If drive is NCQ and we are driving deep queue depths, then > + * it is not reasonable to charge the slice since dispatch > + * started because this time will include time taken by all > + * the other requests in the queue. > + * > + * Actually there is no reasonable way to know the disk time > + * here and we need to come up with some approximation. If > + * disk is non NCQ, we should be driving request queue depth > + * 1, then charge for time since dispatch start and this will > + * account for seek time properly on seeky media. If request > + * queue depth is high, then charge for number of requests > + * dispatched from the queue. This will sort of becoming > + * charging in terms of IOPS. > */ > - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start), > - 1); > + if (cfqq->cfqd->hw_tag == 0) > + slice_used = max_t(unsigned, > + (jiffies - cfqq->dispatch_start), 1); > + else > + slice_used = cfqq->slice_dispatch; > } else { > slice_used = jiffies - cfqq->slice_start; > if (slice_used > cfqq->allocated_slice) > slice_used = cfqq->allocated_slice; > } > > - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used); > + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, > + cfqq->slice_dispatch); > return slice_used; > } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 18:47 ` Jeff Moyer @ 2010-07-19 18:58 ` Vivek Goyal 2010-07-19 20:32 ` Divyesh Shah 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 18:58 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo On Mon, Jul 19, 2010 at 02:47:20PM -0400, Jeff Moyer wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > - Currently in CFQ there are many situations where don't know how > > much time slice has been consumed by a queue. For example, all > > the random reader/writer queues where we don't idle on > > individual queues and we expire the queue either immediately > > after the request dispatch. > > > > - In this case time consumed by a queue is just a memory copy > > operation. Actually time measurement is possible only if we > > idle on a queue and allow dispatch from a queue for significant > > amount of time. > > > > - As of today, in such cases we calculate the time since the > > dispatch from the queue started and charge all that time. > > Generally this rounds to 1 jiffy but in some cases it can > > be more. For example, if we are driving high request queue > > depth and driver is too busy and does not ask for new > > requests for 8-10 jiffies. In such cases, the active queue > > gets charged very unfairly. > > > > - So fundamentally, whole notion of charging for time slice > > is valid only if we have been idling on the queue. Otherwise > > in an NCQ queue, there might be other requests on the queue > > and we can not do the time slice calculation. > > > > - This patch tweaks the slice charging logic a bit so that > > in the cases where we can't know the amount of time, we > > start charging in terms of number of requests dispatched > > (IOPS). This practically switching CFQ fairness model to > > fairness in terms of IOPS with slice_idle=0. > > > > - As of today this will primarily be useful only with > > group_idle patches so that we get fairness in terms of > > IOPS across groups. The idea is that on fast storage > > one can run CFQ with slice_idle=0 and still get IO > > controller working without losing too much of > > throughput. > > I'm not fluent in the cgroup code, my apologies for that. However, just > trying to make sense of this is giving me a headache. Now, in some > cases you are using IOPS *in place of* jiffies. How are we to know > which is which and in what cases? Yes it is mixed now for default CFQ case. Whereever we don't have the capability to determine the slice_used, we charge IOPS. For slice_idle=0 case, we should charge IOPS almost all the time. Though if there is a workload where single cfqq can keep the request queue saturated, then current code will charge in terms of time. I agree that this is little confusing. May be in case of slice_idle=0 we can always charge in terms of IOPS. > > It sounds like this is addressing an important problem, but I'm having a > hard time picking out what that problem is. Is this problem noticable > for competing sync-noidle workloads (competing between groups, that is)? > If not, then what? I noticed problem during competing workloads in different groups. With slice_idle 0, we will drive full queue depth of 32. Sometimes when we hit high queue depth, say 32, for few jiffies, driver did not ask for new requests. So say for 10-12 ms, requests only completed and new requests did not get issued. In that case, all this 10-12 ms gets charged to active queue and the fact is that this active queue did not even get to dispatch more than 1 request. This queue was just unfortunate to be there at that time. The higher weight queue ofen run into this situation because CFQ tries to keep them as active queue more often. So if you are driving full queue depth where in NCQ request queue there are requests pending from multiple queues and groups, you have no way to measure the time. My impression is that on fast devices, we can no longer stick to the model of measuring the time. If we switch to IOPS model, then we can drive deeper requests queue depths and keep the device saturated, at the same time achieve group IO control. Thanks Vivek > > Thanks, > Jeff > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > > --- > > block/cfq-iosched.c | 24 +++++++++++++++++++++--- > > 1 files changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index 7982b83..f44064c 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) > > * if there are mutiple queues in the group, each can dispatch > > * a single request on seeky media and cause lots of seek time > > * and group will never know it. > > + * > > + * If drive is NCQ and we are driving deep queue depths, then > > + * it is not reasonable to charge the slice since dispatch > > + * started because this time will include time taken by all > > + * the other requests in the queue. > > + * > > + * Actually there is no reasonable way to know the disk time > > + * queue depth is high, then charge for number of requests > > + * dispatched from the queue. This will sort of becoming > > + * charging in terms of IOPS. > > */ > > - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start), > > - 1); > > + if (cfqq->cfqd->hw_tag == 0) > > + slice_used = max_t(unsigned, > > + (jiffies - cfqq->dispatch_start), 1); > > + else > > + slice_used = cfqq->slice_dispatch; > > } else { > > slice_used = jiffies - cfqq->slice_start; > > if (slice_used > cfqq->allocated_slice) > > slice_used = cfqq->allocated_slice; > > } > > > > - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used); > > + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, > > + cfqq->slice_dispatch); > > return slice_used; > > } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 18:58 ` Vivek Goyal @ 2010-07-19 20:32 ` Divyesh Shah 2010-07-19 20:44 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Divyesh Shah @ 2010-07-19 20:32 UTC (permalink / raw) To: Vivek Goyal Cc: Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng, czoccolo On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > Yes it is mixed now for default CFQ case. Whereever we don't have the > capability to determine the slice_used, we charge IOPS. > > For slice_idle=0 case, we should charge IOPS almost all the time. Though > if there is a workload where single cfqq can keep the request queue > saturated, then current code will charge in terms of time. > > I agree that this is little confusing. May be in case of slice_idle=0 > we can always charge in terms of IOPS. I agree with Jeff that this is very confusing. Also there are absolutely no bets that one job may end up getting charged in IOPs for this behavior while other jobs continue getting charged in timefor their IOs. Depending on the speed of the disk, this could be a huge advantage or disadvantage for the cgroup being charged in IOPs. It should be black or white, time or IOPs and also very clearly called out not just in code comments but in the Documentation too. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 20:32 ` Divyesh Shah @ 2010-07-19 20:44 ` Vivek Goyal 2010-07-19 21:19 ` Corrado Zoccolo 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 20:44 UTC (permalink / raw) To: Divyesh Shah Cc: Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng, czoccolo On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote: > On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > Yes it is mixed now for default CFQ case. Whereever we don't have the > > capability to determine the slice_used, we charge IOPS. > > > > For slice_idle=0 case, we should charge IOPS almost all the time. Though > > if there is a workload where single cfqq can keep the request queue > > saturated, then current code will charge in terms of time. > > > > I agree that this is little confusing. May be in case of slice_idle=0 > > we can always charge in terms of IOPS. > > I agree with Jeff that this is very confusing. Also there are > absolutely no bets that one job may end up getting charged in IOPs for > this behavior while other jobs continue getting charged in timefor > their IOs. Depending on the speed of the disk, this could be a huge > advantage or disadvantage for the cgroup being charged in IOPs. > > It should be black or white, time or IOPs and also very clearly called > out not just in code comments but in the Documentation too. Ok, how about always charging in IOPS when slice_idle=0? So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ starts doing accounting in IOPS instead of time. On slow devices we continue to run with slice_idle=8 and nothing changes. Personally I feel that it is hard to sustain time based logic on high end devices and still get good throughput. We could make CFQ a dual mode kind of scheduler which is capable of doing accouting both in terms of time as well as IOPS. When slice_idle !=0, we do accounting in terms of time and it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting in terms of IOPS. I think this change should bring us one step closer to our goal of one IO sheduler for all devices. Jens, what do you think? Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 20:44 ` Vivek Goyal @ 2010-07-19 21:19 ` Corrado Zoccolo 2010-07-19 22:05 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Corrado Zoccolo @ 2010-07-19 21:19 UTC (permalink / raw) To: Vivek Goyal Cc: Divyesh Shah, Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote: >> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > Yes it is mixed now for default CFQ case. Whereever we don't have the >> > capability to determine the slice_used, we charge IOPS. >> > >> > For slice_idle=0 case, we should charge IOPS almost all the time. Though >> > if there is a workload where single cfqq can keep the request queue >> > saturated, then current code will charge in terms of time. >> > >> > I agree that this is little confusing. May be in case of slice_idle=0 >> > we can always charge in terms of IOPS. >> >> I agree with Jeff that this is very confusing. Also there are >> absolutely no bets that one job may end up getting charged in IOPs for >> this behavior while other jobs continue getting charged in timefor >> their IOs. Depending on the speed of the disk, this could be a huge >> advantage or disadvantage for the cgroup being charged in IOPs. >> >> It should be black or white, time or IOPs and also very clearly called >> out not just in code comments but in the Documentation too. > > Ok, how about always charging in IOPS when slice_idle=0? > > So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ > starts doing accounting in IOPS instead of time. On slow devices we > continue to run with slice_idle=8 and nothing changes. > > Personally I feel that it is hard to sustain time based logic on high end > devices and still get good throughput. We could make CFQ a dual mode kind > of scheduler which is capable of doing accouting both in terms of time as > well as IOPS. When slice_idle !=0, we do accounting in terms of time and > it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting > in terms of IOPS. There is an other mode in which cfq can operate: for ncq ssds, it basically ignores slice_idle, and operates as if it was 0. This mode should also be handled as an IOPS counting mode. SSD mode, though, differs from rotational mode for the definition of "seekyness", and we should think if this mode is appropriate also for the other hardware where slice_idle=0 is beneficial. > > I think this change should bring us one step closer to our goal of one > IO sheduler for all devices. I think this is an interesting instance of a more general problem: cfq needs a cost function applicable to all requests on any hardware. The current function is a concrete one (measured time), but unfortunately it is not always applicable, because: - for fast hardware the resolution is too coarse (this can be fixed using higher resolution timers) - for hardware that allows parallel dispatching, we can't measure the cost of a single request (can we try something like average cost of the requests executed in parallel?). IOPS, instead, is a synthetic cost measure. It is a simplified model, that will approximate some devices (SSDs) better than others (multi-spindle rotational disks). But if we want to go for the synthetic path, we can have more complex measures, that also take into account other parameters, as sequentiality of the requests, their size and so on, all parameters that may have still some impact on high-end devices. Thanks, Corrado > > Jens, what do you think? > > Thanks > Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 21:19 ` Corrado Zoccolo @ 2010-07-19 22:05 ` Vivek Goyal 0 siblings, 0 replies; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 22:05 UTC (permalink / raw) To: Corrado Zoccolo Cc: Divyesh Shah, Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng On Mon, Jul 19, 2010 at 11:19:21PM +0200, Corrado Zoccolo wrote: > On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote: > >> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > Yes it is mixed now for default CFQ case. Whereever we don't have the > >> > capability to determine the slice_used, we charge IOPS. > >> > > >> > For slice_idle=0 case, we should charge IOPS almost all the time. Though > >> > if there is a workload where single cfqq can keep the request queue > >> > saturated, then current code will charge in terms of time. > >> > > >> > I agree that this is little confusing. May be in case of slice_idle=0 > >> > we can always charge in terms of IOPS. > >> > >> I agree with Jeff that this is very confusing. Also there are > >> absolutely no bets that one job may end up getting charged in IOPs for > >> this behavior while other jobs continue getting charged in timefor > >> their IOs. Depending on the speed of the disk, this could be a huge > >> advantage or disadvantage for the cgroup being charged in IOPs. > >> > >> It should be black or white, time or IOPs and also very clearly called > >> out not just in code comments but in the Documentation too. > > > > Ok, how about always charging in IOPS when slice_idle=0? > > > > So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ > > starts doing accounting in IOPS instead of time. On slow devices we > > continue to run with slice_idle=8 and nothing changes. > > > > Personally I feel that it is hard to sustain time based logic on high end > > devices and still get good throughput. We could make CFQ a dual mode kind > > of scheduler which is capable of doing accouting both in terms of time as > > well as IOPS. When slice_idle !=0, we do accounting in terms of time and > > it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting > > in terms of IOPS. > There is an other mode in which cfq can operate: for ncq ssds, it > basically ignores slice_idle, and operates as if it was 0. > This mode should also be handled as an IOPS counting mode. > SSD mode, though, differs from rotational mode for the definition of > "seekyness", and we should think if this mode is appropriate also for > the other hardware where slice_idle=0 is beneficial. I am always wondering that in practice, what is the difference between slice_idle=0 and rotational=0. I think the only difference is NCQ queue detection. slice_idle=0 will always not idle, irrespective of the fact whether queue is NCQ or not and rotational=0 will disable idling only if device supports NCQ. If that's the case, then we can probably internally switch the slice_idle=0 once we have detected that an SSD supports NCQ and we can get rid of this confusion. Well looking more closely, there seems to be one more difference. With SSD, and NCQ, we still idle on sync-noidle tree. This seemingly, will provide us protection from WRITES. Not sure if this is true for good SSDs also. I am assuming they should be giving priority to reads and balancing things out. cfq_should_idle() is interesting though, that we disable idling for sync-idle tree. So we idle on sync-noidle tree but do not provide any protection to sequential readers. Anyway, that's a minor detail.... In fact we can switch to IOPS model for NCQ SSD also. > > > > I think this change should bring us one step closer to our goal of one > > IO sheduler for all devices. > > I think this is an interesting instance of a more general problem: cfq > needs a cost function applicable to all requests on any hardware. The > current function is a concrete one (measured time), but unfortunately > it is not always applicable, because: > - for fast hardware the resolution is too coarse (this can be fixed > using higher resolution timers) Yes this is fixable. > - for hardware that allows parallel dispatching, we can't measure the > cost of a single request (can we try something like average cost of > the requests executed in parallel?). This is the biggest problem. How to get right estimate of time when a request queue can have requests from multiple processes at the same time. > IOPS, instead, is a synthetic cost measure. It is a simplified model, > that will approximate some devices (SSDs) better than others > (multi-spindle rotational disks). Agreed that IOPS is a simplified model. > But if we want to go for the > synthetic path, we can have more complex measures, that also take into > account other parameters, as sequentiality of the requests, Once we start dispatching requests from multiple cfq queues at a time, notion of sequentiality is lost (at least on the device). > their size > and so on, all parameters that may have still some impact on high-end > devices. size is an interesting factor though. Again we can only come up with some kind of approximation only as this cost will vary from device to device I think we can begin with something simple (IOPS) and if it works fine, then we can take into account additional factors (especially size of request) and factor that into the cost. The only thing to keep in mind is that group scheduling will benefit most from it. The notion of ioprio is fairly weak currently in CFQ (especially on SSD and with slice_idle=0). Thanks Vivek > > Thanks, > Corrado > > > > Jens, what do you think? > > > > Thanks > > Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle 2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal 2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal @ 2010-07-19 17:20 ` Vivek Goyal 2010-07-19 18:58 ` Jeff Moyer 2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal 2 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw) To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal o Implement a new tunable group_idle, which allows idling on the group instead of a cfq queue. Hence one can set slice_idle = 0 and not idle on the individual queues but idle on the group. This way on fast storage we can get fairness between groups at the same time overall throughput improves. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------ 1 files changed, 53 insertions(+), 7 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index f44064c..b23d7f4 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10; static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +static int cfq_group_idle = HZ / 125; static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ static const int cfq_hist_divisor = 4; @@ -198,6 +199,8 @@ struct cfq_group { struct hlist_node cfqd_node; atomic_t ref; #endif + /* number of requests that are on the dispatch list or inside driver */ + int dispatched; }; /* @@ -271,6 +274,7 @@ struct cfq_data { unsigned int cfq_slice[2]; unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned int cfq_group_idle; unsigned int cfq_latency; unsigned int cfq_group_isolation; @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(!service_tree); BUG_ON(!service_tree->count); + if (!cfqd->cfq_slice_idle) + return false; + /* We never do for idle class queues. */ if (prio == IDLE_WORKLOAD) return false; @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) { struct cfq_queue *cfqq = cfqd->active_queue; struct cfq_io_context *cic; - unsigned long sl; + unsigned long sl, group_idle = 0; /* * SSD device without seek penalty, disable idling. But only do so @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) /* * idle is disabled, either manually or by past process history */ - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) - return; + if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) { + /* no queue idling. Check for group idling */ + if (cfqd->cfq_group_idle) + group_idle = cfqd->cfq_group_idle; + else + return; + } /* * still active requests from this queue, don't idle @@ -1924,13 +1936,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) return; } + /* There are other queues in the group, don't do group idle */ + if (group_idle && cfqq->cfqg->nr_cfqq > 1) + return; + cfq_mark_cfqq_wait_request(cfqq); - sl = cfqd->cfq_slice_idle; + if (group_idle) + sl = cfqd->cfq_group_idle; + else + sl = cfqd->cfq_slice_idle; mod_timer(&cfqd->idle_slice_timer, jiffies + sl); cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg); - cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl); + cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl, + group_idle ? 1 : 0); } /* @@ -1946,6 +1966,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; + (RQ_CFQG(rq))->dispatched++; elv_dispatch_sort(q, rq); cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++; @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) cfqq = NULL; goto keep_queue; } else - goto expire; + goto check_group_idle; } /* @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto keep_queue; } + /* + * If group idle is enabled and there are requests dispatched from + * this group, wait for requests to complete. + */ +check_group_idle: + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1 + && cfqq->cfqg->dispatched) { + cfqq = NULL; + goto keep_queue; + } + expire: cfq_slice_expired(cfqd, 0); new_queue: @@ -3391,6 +3423,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) WARN_ON(!cfqq->dispatched); cfqd->rq_in_driver--; cfqq->dispatched--; + (RQ_CFQG(rq))->dispatched--; cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg, rq_start_time_ns(rq), rq_io_start_time_ns(rq), rq_data_dir(rq), rq_is_sync(rq)); @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) * the queue. */ if (cfq_should_wait_busy(cfqd, cfqq)) { - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle; + unsigned long extend_sl = cfqd->cfq_slice_idle; + if (!cfqd->cfq_slice_idle) + extend_sl = cfqd->cfq_group_idle; + cfqq->slice_end = jiffies + extend_sl; cfq_mark_cfqq_wait_busy(cfqq); cfq_log_cfqq(cfqd, cfqq, "will busy wait"); } @@ -3865,6 +3901,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice[1] = cfq_slice_sync; cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; + cfqd->cfq_group_idle = cfq_group_idle; cfqd->cfq_latency = 1; cfqd->cfq_group_isolation = 0; cfqd->hw_tag = -1; @@ -3937,6 +3974,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1); SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0); SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0); SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1); +SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1); SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); @@ -3969,6 +4007,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0); STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1); +STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, @@ -3990,6 +4029,7 @@ static struct elv_fs_entry cfq_attrs[] = { CFQ_ATTR(slice_async), CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), + CFQ_ATTR(group_idle), CFQ_ATTR(low_latency), CFQ_ATTR(group_isolation), __ATTR_NULL @@ -4043,6 +4083,12 @@ static int __init cfq_init(void) if (!cfq_slice_idle) cfq_slice_idle = 1; +#ifdef CONFIG_CFQ_GROUP_IOSCHED + if (!cfq_group_idle) + cfq_group_idle = 1; +#else + cfq_group_idle = 0; +#endif if (cfq_slab_setup()) return -ENOMEM; -- 1.7.1.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle 2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal @ 2010-07-19 18:58 ` Jeff Moyer 2010-07-19 20:20 ` Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Jeff Moyer @ 2010-07-19 18:58 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo Vivek Goyal <vgoyal@redhat.com> writes: > o Implement a new tunable group_idle, which allows idling on the group > instead of a cfq queue. Hence one can set slice_idle = 0 and not idle > on the individual queues but idle on the group. This way on fast storage > we can get fairness between groups at the same time overall throughput > improves. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------ > 1 files changed, 53 insertions(+), 7 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index f44064c..b23d7f4 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10; > static int cfq_slice_async = HZ / 25; > static const int cfq_slice_async_rq = 2; > static int cfq_slice_idle = HZ / 125; > +static int cfq_group_idle = HZ / 125; > static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ > static const int cfq_hist_divisor = 4; > > @@ -198,6 +199,8 @@ struct cfq_group { > struct hlist_node cfqd_node; > atomic_t ref; > #endif > + /* number of requests that are on the dispatch list or inside driver */ > + int dispatched; > }; > > /* > @@ -271,6 +274,7 @@ struct cfq_data { > unsigned int cfq_slice[2]; > unsigned int cfq_slice_async_rq; > unsigned int cfq_slice_idle; > + unsigned int cfq_group_idle; > unsigned int cfq_latency; > unsigned int cfq_group_isolation; > > @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) > BUG_ON(!service_tree); > BUG_ON(!service_tree->count); > > + if (!cfqd->cfq_slice_idle) > + return false; > + > /* We never do for idle class queues. */ > if (prio == IDLE_WORKLOAD) > return false; > @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > { > struct cfq_queue *cfqq = cfqd->active_queue; > struct cfq_io_context *cic; > - unsigned long sl; > + unsigned long sl, group_idle = 0; > > /* > * SSD device without seek penalty, disable idling. But only do so > @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > /* > * idle is disabled, either manually or by past process history > */ > - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) > - return; > + if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) { The check for cfqd->cfq_slice_idle is now redundant (as it's done in cfq_should_idle). > @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) > cfqq = NULL; > goto keep_queue; > } else > - goto expire; > + goto check_group_idle; > } > > /* > @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) > goto keep_queue; > } > > + /* > + * If group idle is enabled and there are requests dispatched from > + * this group, wait for requests to complete. > + */ > +check_group_idle: > + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1 > + && cfqq->cfqg->dispatched) { > + cfqq = NULL; > + goto keep_queue; > + } I really wish we could roll all of this logic into cfq_should_idle. > @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > * the queue. > */ > if (cfq_should_wait_busy(cfqd, cfqq)) { > - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle; > + unsigned long extend_sl = cfqd->cfq_slice_idle; > + if (!cfqd->cfq_slice_idle) > + extend_sl = cfqd->cfq_group_idle; > + cfqq->slice_end = jiffies + extend_sl; So, if slice_idle and group_idle are both set, slice_idle trumps group_idle? Did you give that case any thought? If it doesn't make sense to configure both, then we should probably make sure they can't both be set. Cheers, Jeff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle 2010-07-19 18:58 ` Jeff Moyer @ 2010-07-19 20:20 ` Vivek Goyal 0 siblings, 0 replies; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 20:20 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo On Mon, Jul 19, 2010 at 02:58:41PM -0400, Jeff Moyer wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > o Implement a new tunable group_idle, which allows idling on the group > > instead of a cfq queue. Hence one can set slice_idle = 0 and not idle > > on the individual queues but idle on the group. This way on fast storage > > we can get fairness between groups at the same time overall throughput > > improves. > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > > --- > > block/cfq-iosched.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------ > > 1 files changed, 53 insertions(+), 7 deletions(-) > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index f44064c..b23d7f4 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10; > > static int cfq_slice_async = HZ / 25; > > static const int cfq_slice_async_rq = 2; > > static int cfq_slice_idle = HZ / 125; > > +static int cfq_group_idle = HZ / 125; > > static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ > > static const int cfq_hist_divisor = 4; > > > > @@ -198,6 +199,8 @@ struct cfq_group { > > struct hlist_node cfqd_node; > > atomic_t ref; > > #endif > > + /* number of requests that are on the dispatch list or inside driver */ > > + int dispatched; > > }; > > > > /* > > @@ -271,6 +274,7 @@ struct cfq_data { > > unsigned int cfq_slice[2]; > > unsigned int cfq_slice_async_rq; > > unsigned int cfq_slice_idle; > > + unsigned int cfq_group_idle; > > unsigned int cfq_latency; > > unsigned int cfq_group_isolation; > > > > @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > BUG_ON(!service_tree); > > BUG_ON(!service_tree->count); > > > > + if (!cfqd->cfq_slice_idle) > > + return false; > > + > > /* We never do for idle class queues. */ > > if (prio == IDLE_WORKLOAD) > > return false; > > @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > > { > > struct cfq_queue *cfqq = cfqd->active_queue; > > struct cfq_io_context *cic; > > - unsigned long sl; > > + unsigned long sl, group_idle = 0; > > > > /* > > * SSD device without seek penalty, disable idling. But only do so > > @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > > /* > > * idle is disabled, either manually or by past process history > > */ > > - if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) > > - return; > > + if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) { > > The check for cfqd->cfq_slice_idle is now redundant (as it's done in > cfq_should_idle). Yep. I will get rid of extra check. > > > @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) > > cfqq = NULL; > > goto keep_queue; > > } else > > - goto expire; > > + goto check_group_idle; > > } > > > > /* > > @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) > > goto keep_queue; > > } > > > > + /* > > + * If group idle is enabled and there are requests dispatched from > > + * this group, wait for requests to complete. > > + */ > > +check_group_idle: > > + if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1 > > + && cfqq->cfqg->dispatched) { > > + cfqq = NULL; > > + goto keep_queue; > > + } > > I really wish we could roll all of this logic into cfq_should_idle. Currently whether to idle on queue or not logic is also part of select_queue(). So to keep it same it makes sense to keep group logic also in select_queue(). > > > @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > > * the queue. > > */ > > if (cfq_should_wait_busy(cfqd, cfqq)) { > > - cfqq->slice_end = jiffies + cfqd->cfq_slice_idle; > > + unsigned long extend_sl = cfqd->cfq_slice_idle; > > + if (!cfqd->cfq_slice_idle) > > + extend_sl = cfqd->cfq_group_idle; > > + cfqq->slice_end = jiffies + extend_sl; > > So, if slice_idle and group_idle are both set, slice_idle trumps > group_idle? Did you give that case any thought? If it doesn't make > sense to configure both, then we should probably make sure they can't > both be set. Actually, the wait busy logic makes sense in case of slice_idle, where we can give an extended slice to a queue and to make sure a queue/group does not get deleted immediately after completing the slice and hence losing the share. Now in case of slice_idle=0, if there is only one queue in the group then it becomes the same case as one queue in the group with slice_idle=8. So yes, slice_idle trumps group_idle. group_idle kicks in only if slice_idle=0. I think by default it being set it does not harm. What good it will do if slice_idle=8 and group_idle=0. Instead it will become more work for user to also set one more tunable if he plans to set slice_idle=0. Thanks Vivek > > Cheers, > Jeff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace 2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal 2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal 2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal @ 2010-07-19 17:20 ` Vivek Goyal 2010-07-19 18:59 ` Jeff Moyer 2010-07-19 22:16 ` Divyesh Shah 2 siblings, 2 replies; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw) To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal - Divyesh had gotten rid of this code in the past. I want to re-introduce it back as it helps me a lot during debugging. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b23d7f4..1f0d4ea 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -148,6 +148,8 @@ struct cfq_queue { struct cfq_queue *new_cfqq; struct cfq_group *cfqg; struct cfq_group *orig_cfqg; + /* Number of sectors dispatched from queue in single dispatch round */ + unsigned long nr_sectors; }; /* @@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) slice_used = cfqq->allocated_slice; } - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, - cfqq->slice_dispatch); + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u", + slice_used, cfqq->slice_dispatch, cfqq->nr_sectors); return slice_used; } @@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd, cfqq->allocated_slice = 0; cfqq->slice_end = 0; cfqq->slice_dispatch = 0; + cfqq->nr_sectors = 0; cfq_clear_cfqq_wait_request(cfqq); cfq_clear_cfqq_must_dispatch(cfqq); @@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) elv_dispatch_sort(q, rq); cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++; + cfqq->nr_sectors += blk_rq_sectors(rq); cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq), rq_data_dir(rq), rq_is_sync(rq)); } -- 1.7.1.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace 2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal @ 2010-07-19 18:59 ` Jeff Moyer 2010-07-19 22:16 ` Divyesh Shah 1 sibling, 0 replies; 15+ messages in thread From: Jeff Moyer @ 2010-07-19 18:59 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo Vivek Goyal <vgoyal@redhat.com> writes: > - Divyesh had gotten rid of this code in the past. I want to re-introduce it > back as it helps me a lot during debugging. Fine with me. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace 2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal 2010-07-19 18:59 ` Jeff Moyer @ 2010-07-19 22:16 ` Divyesh Shah 1 sibling, 0 replies; 15+ messages in thread From: Divyesh Shah @ 2010-07-19 22:16 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, guijianfeng, jmoyer, czoccolo Looks good to me. Reviewed-by: Divyesh Shah <dpshah@google.com> On Mon, Jul 19, 2010 at 10:20 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > - Divyesh had gotten rid of this code in the past. I want to re-introduce it > back as it helps me a lot during debugging. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > block/cfq-iosched.c | 8 ++++++-- > 1 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index b23d7f4..1f0d4ea 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -148,6 +148,8 @@ struct cfq_queue { > struct cfq_queue *new_cfqq; > struct cfq_group *cfqg; > struct cfq_group *orig_cfqg; > + /* Number of sectors dispatched from queue in single dispatch round */ > + unsigned long nr_sectors; > }; > > /* > @@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) > slice_used = cfqq->allocated_slice; > } > > - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, > - cfqq->slice_dispatch); > + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u", > + slice_used, cfqq->slice_dispatch, cfqq->nr_sectors); > return slice_used; > } > > @@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd, > cfqq->allocated_slice = 0; > cfqq->slice_end = 0; > cfqq->slice_dispatch = 0; > + cfqq->nr_sectors = 0; > > cfq_clear_cfqq_wait_request(cfqq); > cfq_clear_cfqq_must_dispatch(cfqq); > @@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) > elv_dispatch_sort(q, rq); > > cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++; > + cfqq->nr_sectors += blk_rq_sectors(rq); > cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq), > rq_data_dir(rq), rq_is_sync(rq)); > } > -- > 1.7.1.1 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH] cfq-iosched: Implement group idle V2 @ 2010-07-19 17:14 Vivek Goyal 2010-07-19 17:14 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal 0 siblings, 1 reply; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 17:14 UTC (permalink / raw) To: linux-kernel, jens.axboe Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal Hi, This is V2 of the group_idle implementation patchset. I have done some more testing since V1 and fixed couple of bugs since V1. What's the problem ------------------ On high end storage (I got on HP EVA storage array with 12 SATA disks in RAID 5), CFQ's model of dispatching requests from a single queue at a time (sequential readers/write sync writers etc), becomes a bottleneck. Often we don't drive enough request queue depth to keep all the disks busy and suffer a lot in terms of overall throughput. All these problems primarily originate from two things. Idling on per cfq queue and quantum (dispatching limited number of requests from a single queue) and till then not allowing dispatch from other queues. Once you set the slice_idle=0 and quantum to higher value, most of the CFQ's problem on higher end storage disappear. This problem also becomes visible in IO controller where one creates multiple groups and gets the fairness but overall throughput is less. In the following table, I am running increasing number of sequential readers (1,2,4,8) in 8 groups of weight 100 to 800. Kernel=2.6.35-rc5-gi-sl-accounting+ GROUPMODE=1 NRGRP=8 DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4 Workload=bsr iosched=cfq Filesz=512M bs=4K group_isolation=1 slice_idle=8 group_idle=8 quantum=8 ========================================================================= AVERAGE[bsr] [bw in KB/s] ------- job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total --- --- -- --------------------------------------------------------------- bsr 1 1 6245 12776 16591 23471 28746 36799 43031 49778 217437 bsr 1 2 5100 11063 17216 23136 23738 30391 35910 40874 187428 bsr 1 4 4623 9718 14746 18356 22875 30407 33215 38073 172013 bsr 1 8 4720 10143 13499 19115 22157 29126 31688 30784 161232 Notice that overall throughput is just around 160MB/s with 8 sequential reader in each group. With this patch set, I have set slice_idle=0 and re-ran same test. Kernel=2.6.35-rc5-gi-sl-accounting+ GROUPMODE=1 NRGRP=8 DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4 Workload=bsr iosched=cfq Filesz=512M bs=4K group_isolation=1 slice_idle=0 group_idle=8 quantum=8 ========================================================================= AVERAGE[bsr] [bw in KB/s] ------- job Set NR test1 test2 test3 test4 test5 test6 test7 test8 total --- --- -- --------------------------------------------------------------- bsr 1 1 6789 12764 17174 23111 28528 36807 42753 48826 216752 bsr 1 2 9845 20617 30521 39061 45514 51607 63683 63770 324618 bsr 1 4 14835 24241 42444 55207 45914 51318 54661 60318 348938 bsr 1 8 12022 24478 36732 48651 54333 60974 64856 72930 374976 Notice how overall throughput has shot upto 374MB/s while retaining the ability to do the IO control. So this is not the default mode. This new tunable group_idle, allows one to set slice_idle=0 to disable some of the CFQ features and and use primarily group service differentation feature. If you have thoughts on other ways of solving the problem, I am all ears to it. Thanks Vivek ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] cfq-iosched: Improve time slice charging logic 2010-07-19 17:14 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal @ 2010-07-19 17:14 ` Vivek Goyal 0 siblings, 0 replies; 15+ messages in thread From: Vivek Goyal @ 2010-07-19 17:14 UTC (permalink / raw) To: linux-kernel, jens.axboe Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal - Currently in CFQ there are many situations where don't know how much time slice has been consumed by a queue. For example, all the random reader/writer queues where we don't idle on individual queues and we expire the queue either immediately after the request dispatch. - In this case time consumed by a queue is just a memory copy operation. Actually time measurement is possible only if we idle on a queue and allow dispatch from a queue for significant amount of time. - As of today, in such cases we calculate the time since the dispatch from the queue started and charge all that time. Generally this rounds to 1 jiffy but in some cases it can be more. For example, if we are driving high request queue depth and driver is too busy and does not ask for new requests for 8-10 jiffies. In such cases, the active queue gets charged very unfairly. - So fundamentally, whole notion of charging for time slice is valid only if we have been idling on the queue. Otherwise in an NCQ queue, there might be other requests on the queue and we can not do the time slice calculation. - This patch tweaks the slice charging logic a bit so that in the cases where we can't know the amount of time, we start charging in terms of number of requests dispatched (IOPS). This practically switching CFQ fairness model to fairness in terms of IOPS with slice_idle=0. - As of today this will primarily be useful only with group_idle patches so that we get fairness in terms of IOPS across groups. The idea is that on fast storage one can run CFQ with slice_idle=0 and still get IO controller working without losing too much of throughput. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 24 +++++++++++++++++++++--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 7982b83..f44064c 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq) * if there are mutiple queues in the group, each can dispatch * a single request on seeky media and cause lots of seek time * and group will never know it. + * + * If drive is NCQ and we are driving deep queue depths, then + * it is not reasonable to charge the slice since dispatch + * started because this time will include time taken by all + * the other requests in the queue. + * + * Actually there is no reasonable way to know the disk time + * here and we need to come up with some approximation. If + * disk is non NCQ, we should be driving request queue depth + * 1, then charge for time since dispatch start and this will + * account for seek time properly on seeky media. If request + * queue depth is high, then charge for number of requests + * dispatched from the queue. This will sort of becoming + * charging in terms of IOPS. */ - slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start), - 1); + if (cfqq->cfqd->hw_tag == 0) + slice_used = max_t(unsigned, + (jiffies - cfqq->dispatch_start), 1); + else + slice_used = cfqq->slice_dispatch; } else { slice_used = jiffies - cfqq->slice_start; if (slice_used > cfqq->allocated_slice) slice_used = cfqq->allocated_slice; } - cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used); + cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used, + cfqq->slice_dispatch); return slice_used; } -- 1.7.1.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-07-19 22:16 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal 2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal 2010-07-19 18:47 ` Jeff Moyer 2010-07-19 18:58 ` Vivek Goyal 2010-07-19 20:32 ` Divyesh Shah 2010-07-19 20:44 ` Vivek Goyal 2010-07-19 21:19 ` Corrado Zoccolo 2010-07-19 22:05 ` Vivek Goyal 2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal 2010-07-19 18:58 ` Jeff Moyer 2010-07-19 20:20 ` Vivek Goyal 2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal 2010-07-19 18:59 ` Jeff Moyer 2010-07-19 22:16 ` Divyesh Shah -- strict thread matches above, loose matches on Subject: below -- 2010-07-19 17:14 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal 2010-07-19 17:14 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox