[RFC PATCH] cfq-iosched: Implement group idle V2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] cfq-iosched: Implement group idle V2
@ 2010-07-19 17:20 Vivek Goyal
  2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

[ Got Jens's mail id wrong in last post hence reposting. Sorry for cluttering
your mailboxes.]

Hi,

This is V2 of the group_idle implementation patchset. I have done some more
testing since V1 and fixed couple of bugs since V1.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in 
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1          NRGRP=8             
DIR=/mnt/iostestmnt/fio        DEV=/dev/dm-4                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
group_isolation=1 slice_idle=8    group_idle=8    quantum=8    
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  test1  test2  test3  test4  test5  test6  test7  test8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6245   12776  16591  23471  28746  36799  43031  49778  217437 
bsr     1   2   5100   11063  17216  23136  23738  30391  35910  40874  187428 
bsr     1   4   4623   9718   14746  18356  22875  30407  33215  38073  172013 
bsr     1   8   4720   10143  13499  19115  22157  29126  31688  30784  161232 

Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1          NRGRP=8             
DIR=/mnt/iostestmnt/fio        DEV=/dev/dm-4                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
group_isolation=1 slice_idle=0    group_idle=8    quantum=8    
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  test1  test2  test3  test4  test5  test6  test7  test8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6789   12764  17174  23111  28528  36807  42753  48826  216752 
bsr     1   2   9845   20617  30521  39061  45514  51607  63683  63770  324618 
bsr     1   4   14835  24241  42444  55207  45914  51318  54661  60318  348938 
bsr     1   8   12022  24478  36732  48651  54333  60974  64856  72930  374976 

Notice how overall throughput has shot upto 374MB/s while retaining the ability
to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

If you have thoughts on other ways of solving the problem, I am all ears
to it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
@ 2010-07-19 17:20 ` Vivek Goyal
  2010-07-19 18:47   ` Jeff Moyer
  2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal
  2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal
  2 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

- Currently in CFQ there are many situations where don't know how
  much time slice has been consumed by a queue. For example, all
  the random reader/writer queues where we don't idle on
  individual queues and we expire the queue either immediately
  after the request dispatch.

- In this case time consumed by a queue is just a memory copy
  operation. Actually time measurement is possible only if we
  idle on a queue and allow dispatch from a queue for significant
  amount of time.

- As of today, in such cases we calculate the time since the
  dispatch from the queue started and charge all that time.
  Generally this rounds to 1 jiffy but in some cases it can
  be more. For example, if we are driving high request queue
  depth and driver is too busy and does not ask for new
  requests for 8-10 jiffies. In such cases, the active queue
  gets charged very unfairly.

- So fundamentally, whole notion of charging for time slice
  is valid only if we have been idling on the queue. Otherwise
  in an NCQ queue, there might be other requests on the queue
  and we can not do the time slice calculation.

- This patch tweaks the slice charging logic a bit so that
  in the cases where we can't know the amount of time, we
  start charging in terms of number of requests dispatched
  (IOPS). This practically switching CFQ fairness model to
  fairness in terms of IOPS with slice_idle=0.

- As of today this will primarily be useful only with
  group_idle patches so that we get fairness in terms of
  IOPS across groups. The idea is that on fast storage
  one can run CFQ with slice_idle=0 and still get IO
  controller working without losing too much of
  throughput.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..f44064c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 		 * if there are mutiple queues in the group, each can dispatch
 		 * a single request on seeky media and cause lots of seek time
 		 * and group will never know it.
+		 *
+		 * If drive is NCQ and we are driving deep queue depths, then
+		 * it is not reasonable to charge the slice since dispatch
+		 * started because this time will include time taken by all
+		 * the other requests in the queue.
+		 *
+		 * Actually there is no reasonable way to know the disk time
+		 * here and we need to come up with some approximation. If
+		 * disk is non NCQ, we should be driving request queue depth
+		 * 1, then charge for time since dispatch start and this will
+		 * account for seek time properly on seeky media. If request
+		 * queue depth is high, then charge for number of requests
+		 * dispatched from the queue. This will sort of becoming
+		 * charging in terms of IOPS.
 		 */
-		slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
-					1);
+		if (cfqq->cfqd->hw_tag == 0)
+			slice_used = max_t(unsigned,
+					(jiffies - cfqq->dispatch_start), 1);
+		else
+			slice_used = cfqq->slice_dispatch;
 	} else {
 		slice_used = jiffies - cfqq->slice_start;
 		if (slice_used > cfqq->allocated_slice)
 			slice_used = cfqq->allocated_slice;
 	}
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
+			cfqq->slice_dispatch);
 	return slice_used;
 }
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
@ 2010-07-19 18:47   ` Jeff Moyer
  2010-07-19 18:58     ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Moyer @ 2010-07-19 18:47 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo

Vivek Goyal <vgoyal@redhat.com> writes:

> - Currently in CFQ there are many situations where don't know how
>   much time slice has been consumed by a queue. For example, all
>   the random reader/writer queues where we don't idle on
>   individual queues and we expire the queue either immediately
>   after the request dispatch.
>
> - In this case time consumed by a queue is just a memory copy
>   operation. Actually time measurement is possible only if we
>   idle on a queue and allow dispatch from a queue for significant
>   amount of time.
>
> - As of today, in such cases we calculate the time since the
>   dispatch from the queue started and charge all that time.
>   Generally this rounds to 1 jiffy but in some cases it can
>   be more. For example, if we are driving high request queue
>   depth and driver is too busy and does not ask for new
>   requests for 8-10 jiffies. In such cases, the active queue
>   gets charged very unfairly.
>
> - So fundamentally, whole notion of charging for time slice
>   is valid only if we have been idling on the queue. Otherwise
>   in an NCQ queue, there might be other requests on the queue
>   and we can not do the time slice calculation.
>
> - This patch tweaks the slice charging logic a bit so that
>   in the cases where we can't know the amount of time, we
>   start charging in terms of number of requests dispatched
>   (IOPS). This practically switching CFQ fairness model to
>   fairness in terms of IOPS with slice_idle=0.
>
> - As of today this will primarily be useful only with
>   group_idle patches so that we get fairness in terms of
>   IOPS across groups. The idea is that on fast storage
>   one can run CFQ with slice_idle=0 and still get IO
>   controller working without losing too much of
>   throughput.

I'm not fluent in the cgroup code, my apologies for that.  However, just
trying to make sense of this is giving me a headache.  Now, in some
cases you are using IOPS *in place of* jiffies.  How are we to know
which is which and in what cases?

It sounds like this is addressing an important problem, but I'm having a
hard time picking out what that problem is.  Is this problem noticable
for competing sync-noidle workloads (competing between groups, that is)?
If not, then what?

Thanks,
Jeff

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |   24 +++++++++++++++++++++---
>  1 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 7982b83..f44064c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>  		 * if there are mutiple queues in the group, each can dispatch
>  		 * a single request on seeky media and cause lots of seek time
>  		 * and group will never know it.
> +		 *
> +		 * If drive is NCQ and we are driving deep queue depths, then
> +		 * it is not reasonable to charge the slice since dispatch
> +		 * started because this time will include time taken by all
> +		 * the other requests in the queue.
> +		 *
> +		 * Actually there is no reasonable way to know the disk time
> +		 * here and we need to come up with some approximation. If
> +		 * disk is non NCQ, we should be driving request queue depth
> +		 * 1, then charge for time since dispatch start and this will
> +		 * account for seek time properly on seeky media. If request
> +		 * queue depth is high, then charge for number of requests
> +		 * dispatched from the queue. This will sort of becoming
> +		 * charging in terms of IOPS.
>  		 */
> -		slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> -					1);
> +		if (cfqq->cfqd->hw_tag == 0)
> +			slice_used = max_t(unsigned,
> +					(jiffies - cfqq->dispatch_start), 1);
> +		else
> +			slice_used = cfqq->slice_dispatch;
>  	} else {
>  		slice_used = jiffies - cfqq->slice_start;
>  		if (slice_used > cfqq->allocated_slice)
>  			slice_used = cfqq->allocated_slice;
>  	}
>  
> -	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> +	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> +			cfqq->slice_dispatch);
>  	return slice_used;
>  }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 18:47   ` Jeff Moyer
@ 2010-07-19 18:58     ` Vivek Goyal
  2010-07-19 20:32       ` Divyesh Shah
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 18:58 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo

On Mon, Jul 19, 2010 at 02:47:20PM -0400, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > - Currently in CFQ there are many situations where don't know how
> >   much time slice has been consumed by a queue. For example, all
> >   the random reader/writer queues where we don't idle on
> >   individual queues and we expire the queue either immediately
> >   after the request dispatch.
> >
> > - In this case time consumed by a queue is just a memory copy
> >   operation. Actually time measurement is possible only if we
> >   idle on a queue and allow dispatch from a queue for significant
> >   amount of time.
> >
> > - As of today, in such cases we calculate the time since the
> >   dispatch from the queue started and charge all that time.
> >   Generally this rounds to 1 jiffy but in some cases it can
> >   be more. For example, if we are driving high request queue
> >   depth and driver is too busy and does not ask for new
> >   requests for 8-10 jiffies. In such cases, the active queue
> >   gets charged very unfairly.
> >
> > - So fundamentally, whole notion of charging for time slice
> >   is valid only if we have been idling on the queue. Otherwise
> >   in an NCQ queue, there might be other requests on the queue
> >   and we can not do the time slice calculation.
> >
> > - This patch tweaks the slice charging logic a bit so that
> >   in the cases where we can't know the amount of time, we
> >   start charging in terms of number of requests dispatched
> >   (IOPS). This practically switching CFQ fairness model to
> >   fairness in terms of IOPS with slice_idle=0.
> >
> > - As of today this will primarily be useful only with
> >   group_idle patches so that we get fairness in terms of
> >   IOPS across groups. The idea is that on fast storage
> >   one can run CFQ with slice_idle=0 and still get IO
> >   controller working without losing too much of
> >   throughput.
> 
> I'm not fluent in the cgroup code, my apologies for that.  However, just
> trying to make sense of this is giving me a headache.  Now, in some
> cases you are using IOPS *in place of* jiffies.  How are we to know
> which is which and in what cases?

Yes it is mixed now for default CFQ case. Whereever we don't have the
capability to determine the slice_used, we charge IOPS.

For slice_idle=0 case, we should charge IOPS almost all the time. Though
if there is a workload where single cfqq can keep the request queue
saturated, then current code will charge in terms of time.

I agree that this is little confusing. May be in case of slice_idle=0
we can always charge in terms of IOPS. 

> 
> It sounds like this is addressing an important problem, but I'm having a
> hard time picking out what that problem is.  Is this problem noticable
> for competing sync-noidle workloads (competing between groups, that is)?
> If not, then what?

I noticed problem during competing workloads in different groups. With
slice_idle 0, we will drive full queue depth of 32. Sometimes when we
hit high queue depth, say 32, for few jiffies, driver did not ask for
new requests. So say for 10-12 ms, requests only completed and new
requests did not get issued. In that case, all this 10-12 ms gets charged
to active queue and the fact is that this active queue did not even get
to dispatch more than 1 request. This queue was just unfortunate to be
there at that time. The higher weight queue ofen run into this situation
because CFQ tries to keep them as active queue more often.

So if you are driving full queue depth where in NCQ request queue there
are requests pending from multiple queues and groups, you have no way to
measure the time. My impression is that on fast devices, we can no longer
stick to the model of measuring the time. If we switch to IOPS model,
then we can drive deeper requests queue depths and keep the device
saturated, at the same time achieve group IO control.

Thanks
Vivek

> 
> Thanks,
> Jeff
> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/cfq-iosched.c |   24 +++++++++++++++++++++---
> >  1 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 7982b83..f44064c 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> >  		 * if there are mutiple queues in the group, each can dispatch
> >  		 * a single request on seeky media and cause lots of seek time
> >  		 * and group will never know it.
> > +		 *
> > +		 * If drive is NCQ and we are driving deep queue depths, then
> > +		 * it is not reasonable to charge the slice since dispatch
> > +		 * started because this time will include time taken by all
> > +		 * the other requests in the queue.
> > +		 *
> > +		 * Actually there is no reasonable way to know the disk time
> > +		 * queue depth is high, then charge for number of requests
> > +		 * dispatched from the queue. This will sort of becoming
> > +		 * charging in terms of IOPS.
> >  		 */
> > -		slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
> > -					1);
> > +		if (cfqq->cfqd->hw_tag == 0)
> > +			slice_used = max_t(unsigned,
> > +					(jiffies - cfqq->dispatch_start), 1);
> > +		else
> > +			slice_used = cfqq->slice_dispatch;
> >  	} else {
> >  		slice_used = jiffies - cfqq->slice_start;
> >  		if (slice_used > cfqq->allocated_slice)
> >  			slice_used = cfqq->allocated_slice;
> >  	}
> >  
> > -	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
> > +	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> > +			cfqq->slice_dispatch);
> >  	return slice_used;
> >  }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 18:58     ` Vivek Goyal
@ 2010-07-19 20:32       ` Divyesh Shah
  2010-07-19 20:44         ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Divyesh Shah @ 2010-07-19 20:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng, czoccolo

On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> Yes it is mixed now for default CFQ case. Whereever we don't have the
> capability to determine the slice_used, we charge IOPS.
>
> For slice_idle=0 case, we should charge IOPS almost all the time. Though
> if there is a workload where single cfqq can keep the request queue
> saturated, then current code will charge in terms of time.
>
> I agree that this is little confusing. May be in case of slice_idle=0
> we can always charge in terms of IOPS.

I agree with Jeff that this is very confusing. Also there are
absolutely no bets that one job may end up getting charged in IOPs for
this behavior while other jobs continue getting charged in timefor
their IOs. Depending on the speed of the disk, this could be a huge
advantage or disadvantage for the cgroup being charged in IOPs.

It should be black or white, time or IOPs and also very clearly called
out not just in code comments but in the Documentation too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 20:32       ` Divyesh Shah
@ 2010-07-19 20:44         ` Vivek Goyal
  2010-07-19 21:19           ` Corrado Zoccolo
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 20:44 UTC (permalink / raw)
  To: Divyesh Shah
  Cc: Jeff Moyer, linux-kernel, axboe, nauman, guijianfeng, czoccolo

On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> > capability to determine the slice_used, we charge IOPS.
> >
> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> > if there is a workload where single cfqq can keep the request queue
> > saturated, then current code will charge in terms of time.
> >
> > I agree that this is little confusing. May be in case of slice_idle=0
> > we can always charge in terms of IOPS.
> 
> I agree with Jeff that this is very confusing. Also there are
> absolutely no bets that one job may end up getting charged in IOPs for
> this behavior while other jobs continue getting charged in timefor
> their IOs. Depending on the speed of the disk, this could be a huge
> advantage or disadvantage for the cgroup being charged in IOPs.
> 
> It should be black or white, time or IOPs and also very clearly called
> out not just in code comments but in the Documentation too.

Ok, how about always charging in IOPS when slice_idle=0?

So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
starts doing accounting in IOPS instead of time. On slow devices we
continue to run with slice_idle=8 and nothing changes.

Personally I feel that it is hard to sustain time based logic on high end
devices and still get good throughput. We could make CFQ a dual mode kind
of scheduler which is capable of doing accouting both in terms of time as
well as IOPS. When slice_idle !=0, we do accounting in terms of time and
it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
in terms of IOPS.

I think this change should bring us one step closer to our goal of one
IO sheduler for all devices.

Jens, what do you think?

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 20:44         ` Vivek Goyal
@ 2010-07-19 21:19           ` Corrado Zoccolo
  2010-07-19 22:05             ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Corrado Zoccolo @ 2010-07-19 21:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Divyesh Shah, Jeff Moyer, linux-kernel, axboe, nauman,
	guijianfeng

On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
>> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > Yes it is mixed now for default CFQ case. Whereever we don't have the
>> > capability to determine the slice_used, we charge IOPS.
>> >
>> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
>> > if there is a workload where single cfqq can keep the request queue
>> > saturated, then current code will charge in terms of time.
>> >
>> > I agree that this is little confusing. May be in case of slice_idle=0
>> > we can always charge in terms of IOPS.
>>
>> I agree with Jeff that this is very confusing. Also there are
>> absolutely no bets that one job may end up getting charged in IOPs for
>> this behavior while other jobs continue getting charged in timefor
>> their IOs. Depending on the speed of the disk, this could be a huge
>> advantage or disadvantage for the cgroup being charged in IOPs.
>>
>> It should be black or white, time or IOPs and also very clearly called
>> out not just in code comments but in the Documentation too.
>
> Ok, how about always charging in IOPS when slice_idle=0?
>
> So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> starts doing accounting in IOPS instead of time. On slow devices we
> continue to run with slice_idle=8 and nothing changes.
>
> Personally I feel that it is hard to sustain time based logic on high end
> devices and still get good throughput. We could make CFQ a dual mode kind
> of scheduler which is capable of doing accouting both in terms of time as
> well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> in terms of IOPS.
There is an other mode in which cfq can operate: for ncq ssds, it
basically ignores slice_idle, and operates as if it was 0.
This mode should also be handled as an IOPS counting mode.
SSD mode, though, differs from rotational mode for the definition of
"seekyness", and we should think if this mode is appropriate also for
the other hardware where slice_idle=0 is beneficial.
>
> I think this change should bring us one step closer to our goal of one
> IO sheduler for all devices.

I think this is an interesting instance of a more general problem: cfq
needs a cost function applicable to all requests on any hardware. The
current function is a concrete one (measured time), but unfortunately
it is not always applicable, because:
- for fast hardware the resolution is too coarse (this can be fixed
using higher resolution timers)
- for hardware that allows parallel dispatching, we can't measure the
cost of a single request (can we try something like average cost of
the requests executed in parallel?).
IOPS, instead, is a synthetic cost measure. It is a simplified model,
that will approximate some devices (SSDs) better than others
(multi-spindle rotational disks). But if we want to go for the
synthetic path, we can have more complex measures, that also take into
account other parameters, as sequentiality of the requests, their size
and so on, all parameters that may have still some impact on high-end
devices.

Thanks,
Corrado
>
> Jens, what do you think?
>
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 21:19           ` Corrado Zoccolo
@ 2010-07-19 22:05             ` Vivek Goyal
  0 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 22:05 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Divyesh Shah, Jeff Moyer, linux-kernel, axboe, nauman,
	guijianfeng

On Mon, Jul 19, 2010 at 11:19:21PM +0200, Corrado Zoccolo wrote:
> On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> >> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> >> > capability to determine the slice_used, we charge IOPS.
> >> >
> >> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> >> > if there is a workload where single cfqq can keep the request queue
> >> > saturated, then current code will charge in terms of time.
> >> >
> >> > I agree that this is little confusing. May be in case of slice_idle=0
> >> > we can always charge in terms of IOPS.
> >>
> >> I agree with Jeff that this is very confusing. Also there are
> >> absolutely no bets that one job may end up getting charged in IOPs for
> >> this behavior while other jobs continue getting charged in timefor
> >> their IOs. Depending on the speed of the disk, this could be a huge
> >> advantage or disadvantage for the cgroup being charged in IOPs.
> >>
> >> It should be black or white, time or IOPs and also very clearly called
> >> out not just in code comments but in the Documentation too.
> >
> > Ok, how about always charging in IOPS when slice_idle=0?
> >
> > So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> > starts doing accounting in IOPS instead of time. On slow devices we
> > continue to run with slice_idle=8 and nothing changes.
> >
> > Personally I feel that it is hard to sustain time based logic on high end
> > devices and still get good throughput. We could make CFQ a dual mode kind
> > of scheduler which is capable of doing accouting both in terms of time as
> > well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> > it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> > in terms of IOPS.
> There is an other mode in which cfq can operate: for ncq ssds, it
> basically ignores slice_idle, and operates as if it was 0.
> This mode should also be handled as an IOPS counting mode.
> SSD mode, though, differs from rotational mode for the definition of
> "seekyness", and we should think if this mode is appropriate also for
> the other hardware where slice_idle=0 is beneficial.

I am always wondering that in practice, what is the difference between
slice_idle=0 and rotational=0. I think the only difference is NCQ queue
detection. slice_idle=0 will always not idle, irrespective of the fact
whether queue is NCQ or not and rotational=0 will disable idling only
if device supports NCQ.

If that's the case, then we can probably internally switch the
slice_idle=0 once we have detected that an SSD supports NCQ and we can
get rid of this confusion.

Well looking more closely, there seems to be one more difference. With
SSD, and NCQ, we still idle on sync-noidle tree. This seemingly, will
provide us protection from WRITES. Not sure if this is true for good
SSDs also. I am assuming they should be giving priority to reads and
balancing things out. cfq_should_idle() is interesting though, that
we disable idling for sync-idle tree. So we idle on sync-noidle tree
but do not provide any protection to sequential readers. Anyway, that's
a minor detail....

In fact we can switch to IOPS model for NCQ SSD also.

> >
> > I think this change should bring us one step closer to our goal of one
> > IO sheduler for all devices.
> 
> I think this is an interesting instance of a more general problem: cfq
> needs a cost function applicable to all requests on any hardware. The
> current function is a concrete one (measured time), but unfortunately
> it is not always applicable, because:
> - for fast hardware the resolution is too coarse (this can be fixed
> using higher resolution timers)

Yes this is fixable.

> - for hardware that allows parallel dispatching, we can't measure the
> cost of a single request (can we try something like average cost of
> the requests executed in parallel?).

This is the biggest problem. How to get right estimate of time when
a request queue can have requests from multiple processes at the same
time.

> IOPS, instead, is a synthetic cost measure. It is a simplified model,
> that will approximate some devices (SSDs) better than others
> (multi-spindle rotational disks).

Agreed that IOPS is a simplified model.

> But if we want to go for the
> synthetic path, we can have more complex measures, that also take into
> account other parameters, as sequentiality of the requests,

Once we start dispatching requests from multiple cfq queues at a time,
notion of sequentiality is lost (at least on the device).

> their size
> and so on, all parameters that may have still some impact on high-end
> devices.

size is an interesting factor though. Again we can only come up with
some kind of approximation only as this cost will vary from device to
device

I think we can begin with something simple (IOPS) and if it works fine,
then we can take into account additional factors (especially size of
request) and factor that into the cost.

The only thing to keep in mind is that group scheduling will benefit
most from it. The notion of ioprio is fairly weak currently in CFQ
(especially on SSD and with slice_idle=0).

Thanks
Vivek

> 
> Thanks,
> Corrado
> >
> > Jens, what do you think?
> >
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle
  2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
  2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
@ 2010-07-19 17:20 ` Vivek Goyal
  2010-07-19 18:58   ` Jeff Moyer
  2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal
  2 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

o Implement a new tunable group_idle, which allows idling on the group
  instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
  on the individual queues but idle on the group. This way on fast storage
  we can get fairness between groups at the same time overall throughput
  improves.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   60 +++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f44064c..b23d7f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
 static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
@@ -198,6 +199,8 @@ struct cfq_group {
 	struct hlist_node cfqd_node;
 	atomic_t ref;
 #endif
+	/* number of requests that are on the dispatch list or inside driver */
+	int dispatched;
 };
 
 /*
@@ -271,6 +274,7 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 	unsigned int cfq_group_isolation;
 
@@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(!service_tree);
 	BUG_ON(!service_tree->count);
 
+	if (!cfqd->cfq_slice_idle)
+		return false;
+
 	/* We never do for idle class queues. */
 	if (prio == IDLE_WORKLOAD)
 		return false;
@@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 {
 	struct cfq_queue *cfqq = cfqd->active_queue;
 	struct cfq_io_context *cic;
-	unsigned long sl;
+	unsigned long sl, group_idle = 0;
 
 	/*
 	 * SSD device without seek penalty, disable idling. But only do so
@@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
-		return;
+	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {
+		/* no queue idling. Check for group idling */
+		if (cfqd->cfq_group_idle)
+			group_idle = cfqd->cfq_group_idle;
+		else
+			return;
+	}
 
 	/*
 	 * still active requests from this queue, don't idle
@@ -1924,13 +1936,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 		return;
 	}
 
+	/* There are other queues in the group, don't do group idle */
+	if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+		return;
+
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	sl = cfqd->cfq_slice_idle;
+	if (group_idle)
+		sl = cfqd->cfq_group_idle;
+	else
+		sl = cfqd->cfq_slice_idle;
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
-	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+			group_idle ? 1 : 0);
 }
 
 /*
@@ -1946,6 +1966,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
+	(RQ_CFQG(rq))->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 			cfqq = NULL;
 			goto keep_queue;
 		} else
-			goto expire;
+			goto check_group_idle;
 	}
 
 	/*
@@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto keep_queue;
 	}
 
+	/*
+	 * If group idle is enabled and there are requests dispatched from
+	 * this group, wait for requests to complete.
+	 */
+check_group_idle:
+	if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+	    && cfqq->cfqg->dispatched) {
+		cfqq = NULL;
+		goto keep_queue;
+	}
+
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
@@ -3391,6 +3423,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	WARN_ON(!cfqq->dispatched);
 	cfqd->rq_in_driver--;
 	cfqq->dispatched--;
+	(RQ_CFQG(rq))->dispatched--;
 	cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
 			rq_start_time_ns(rq), rq_io_start_time_ns(rq),
 			rq_data_dir(rq), rq_is_sync(rq));
@@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 		 * the queue.
 		 */
 		if (cfq_should_wait_busy(cfqd, cfqq)) {
-			cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+			unsigned long extend_sl = cfqd->cfq_slice_idle;
+			if (!cfqd->cfq_slice_idle)
+				extend_sl = cfqd->cfq_group_idle;
+			cfqq->slice_end = jiffies + extend_sl;
 			cfq_mark_cfqq_wait_busy(cfqq);
 			cfq_log_cfqq(cfqd, cfqq, "will busy wait");
 		}
@@ -3865,6 +3901,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_group_idle = cfq_group_idle;
 	cfqd->cfq_latency = 1;
 	cfqd->cfq_group_isolation = 0;
 	cfqd->hw_tag = -1;
@@ -3937,6 +3974,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
 SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3969,6 +4007,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3990,6 +4029,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(group_idle),
 	CFQ_ATTR(low_latency),
 	CFQ_ATTR(group_isolation),
 	__ATTR_NULL
@@ -4043,6 +4083,12 @@ static int __init cfq_init(void)
 	if (!cfq_slice_idle)
 		cfq_slice_idle = 1;
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+	if (!cfq_group_idle)
+		cfq_group_idle = 1;
+#else
+		cfq_group_idle = 0;
+#endif
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle
  2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal
@ 2010-07-19 18:58   ` Jeff Moyer
  2010-07-19 20:20     ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Moyer @ 2010-07-19 18:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo

Vivek Goyal <vgoyal@redhat.com> writes:

> o Implement a new tunable group_idle, which allows idling on the group
>   instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
>   on the individual queues but idle on the group. This way on fast storage
>   we can get fairness between groups at the same time overall throughput
>   improves.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |   60 +++++++++++++++++++++++++++++++++++++++++++++------
>  1 files changed, 53 insertions(+), 7 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index f44064c..b23d7f4 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
>  static int cfq_slice_async = HZ / 25;
>  static const int cfq_slice_async_rq = 2;
>  static int cfq_slice_idle = HZ / 125;
> +static int cfq_group_idle = HZ / 125;
>  static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
>  static const int cfq_hist_divisor = 4;
>  
> @@ -198,6 +199,8 @@ struct cfq_group {
>  	struct hlist_node cfqd_node;
>  	atomic_t ref;
>  #endif
> +	/* number of requests that are on the dispatch list or inside driver */
> +	int dispatched;
>  };
>  
>  /*
> @@ -271,6 +274,7 @@ struct cfq_data {
>  	unsigned int cfq_slice[2];
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
> +	unsigned int cfq_group_idle;
>  	unsigned int cfq_latency;
>  	unsigned int cfq_group_isolation;
>  
> @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  	BUG_ON(!service_tree);
>  	BUG_ON(!service_tree->count);
>  
> +	if (!cfqd->cfq_slice_idle)
> +		return false;
> +
>  	/* We never do for idle class queues. */
>  	if (prio == IDLE_WORKLOAD)
>  		return false;
> @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  {
>  	struct cfq_queue *cfqq = cfqd->active_queue;
>  	struct cfq_io_context *cic;
> -	unsigned long sl;
> +	unsigned long sl, group_idle = 0;
>  
>  	/*
>  	 * SSD device without seek penalty, disable idling. But only do so
> @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  	/*
>  	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> -		return;
> +	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {

The check for cfqd->cfq_slice_idle is now redundant (as it's done in
cfq_should_idle).

> @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>  			cfqq = NULL;
>  			goto keep_queue;
>  		} else
> -			goto expire;
> +			goto check_group_idle;
>  	}
>  
>  	/*
> @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>  		goto keep_queue;
>  	}
>  
> +	/*
> +	 * If group idle is enabled and there are requests dispatched from
> +	 * this group, wait for requests to complete.
> +	 */
> +check_group_idle:
> +	if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> +	    && cfqq->cfqg->dispatched) {
> +		cfqq = NULL;
> +		goto keep_queue;
> +	}

I really wish we could roll all of this logic into cfq_should_idle.

> @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  		 * the queue.
>  		 */
>  		if (cfq_should_wait_busy(cfqd, cfqq)) {
> -			cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> +			unsigned long extend_sl = cfqd->cfq_slice_idle;
> +			if (!cfqd->cfq_slice_idle)
> +				extend_sl = cfqd->cfq_group_idle;
> +			cfqq->slice_end = jiffies + extend_sl;

So, if slice_idle and group_idle are both set, slice_idle trumps
group_idle?  Did you give that case any thought?  If it doesn't make
sense to configure both, then we should probably make sure they can't
both be set.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle
  2010-07-19 18:58   ` Jeff Moyer
@ 2010-07-19 20:20     ` Vivek Goyal
  0 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 20:20 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo

On Mon, Jul 19, 2010 at 02:58:41PM -0400, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > o Implement a new tunable group_idle, which allows idling on the group
> >   instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
> >   on the individual queues but idle on the group. This way on fast storage
> >   we can get fairness between groups at the same time overall throughput
> >   improves.
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/cfq-iosched.c |   60 +++++++++++++++++++++++++++++++++++++++++++++------
> >  1 files changed, 53 insertions(+), 7 deletions(-)
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index f44064c..b23d7f4 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
> >  static int cfq_slice_async = HZ / 25;
> >  static const int cfq_slice_async_rq = 2;
> >  static int cfq_slice_idle = HZ / 125;
> > +static int cfq_group_idle = HZ / 125;
> >  static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
> >  static const int cfq_hist_divisor = 4;
> >  
> > @@ -198,6 +199,8 @@ struct cfq_group {
> >  	struct hlist_node cfqd_node;
> >  	atomic_t ref;
> >  #endif
> > +	/* number of requests that are on the dispatch list or inside driver */
> > +	int dispatched;
> >  };
> >  
> >  /*
> > @@ -271,6 +274,7 @@ struct cfq_data {
> >  	unsigned int cfq_slice[2];
> >  	unsigned int cfq_slice_async_rq;
> >  	unsigned int cfq_slice_idle;
> > +	unsigned int cfq_group_idle;
> >  	unsigned int cfq_latency;
> >  	unsigned int cfq_group_isolation;
> >  
> > @@ -1856,6 +1860,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >  	BUG_ON(!service_tree);
> >  	BUG_ON(!service_tree->count);
> >  
> > +	if (!cfqd->cfq_slice_idle)
> > +		return false;
> > +
> >  	/* We never do for idle class queues. */
> >  	if (prio == IDLE_WORKLOAD)
> >  		return false;
> > @@ -1880,7 +1887,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> >  {
> >  	struct cfq_queue *cfqq = cfqd->active_queue;
> >  	struct cfq_io_context *cic;
> > -	unsigned long sl;
> > +	unsigned long sl, group_idle = 0;
> >  
> >  	/*
> >  	 * SSD device without seek penalty, disable idling. But only do so
> > @@ -1896,8 +1903,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> >  	/*
> >  	 * idle is disabled, either manually or by past process history
> >  	 */
> > -	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> > -		return;
> > +	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq)) {
> 
> The check for cfqd->cfq_slice_idle is now redundant (as it's done in
> cfq_should_idle).

Yep. I will get rid of extra check.

> 
> > @@ -2215,7 +2236,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> >  			cfqq = NULL;
> >  			goto keep_queue;
> >  		} else
> > -			goto expire;
> > +			goto check_group_idle;
> >  	}
> >  
> >  	/*
> > @@ -2249,6 +2270,17 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> >  		goto keep_queue;
> >  	}
> >  
> > +	/*
> > +	 * If group idle is enabled and there are requests dispatched from
> > +	 * this group, wait for requests to complete.
> > +	 */
> > +check_group_idle:
> > +	if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> > +	    && cfqq->cfqg->dispatched) {
> > +		cfqq = NULL;
> > +		goto keep_queue;
> > +	}
> 
> I really wish we could roll all of this logic into cfq_should_idle.

Currently whether to idle on queue or not logic is also part of select_queue().
So to keep it same it makes sense to keep group logic also in select_queue().

> 
> > @@ -3420,7 +3453,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> >  		 * the queue.
> >  		 */
> >  		if (cfq_should_wait_busy(cfqd, cfqq)) {
> > -			cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> > +			unsigned long extend_sl = cfqd->cfq_slice_idle;
> > +			if (!cfqd->cfq_slice_idle)
> > +				extend_sl = cfqd->cfq_group_idle;
> > +			cfqq->slice_end = jiffies + extend_sl;
> 
> So, if slice_idle and group_idle are both set, slice_idle trumps
> group_idle?  Did you give that case any thought?  If it doesn't make
> sense to configure both, then we should probably make sure they can't
> both be set.

Actually, the wait busy logic makes sense in case of slice_idle, where we
can give an extended slice to a queue and to make sure a queue/group does
not get deleted immediately after completing the slice and hence losing
the share.

Now in case of slice_idle=0, if there is only one queue in the group then
it becomes the same case as one queue in the group with slice_idle=8.

So yes, slice_idle trumps group_idle. group_idle kicks in only if
slice_idle=0. I think by default it being set it does not harm. What good
it will do if slice_idle=8 and group_idle=0. Instead it will become
more work for user to also set one more tunable if he plans to set
slice_idle=0.

Thanks
Vivek



> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace
  2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
  2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
  2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal
@ 2010-07-19 17:20 ` Vivek Goyal
  2010-07-19 18:59   ` Jeff Moyer
  2010-07-19 22:16   ` Divyesh Shah
  2 siblings, 2 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:20 UTC (permalink / raw)
  To: linux-kernel, axboe; +Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

- Divyesh had gotten rid of this code in the past. I want to re-introduce it
  back as it helps me a lot during debugging.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b23d7f4..1f0d4ea 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
 	struct cfq_queue *new_cfqq;
 	struct cfq_group *cfqg;
 	struct cfq_group *orig_cfqg;
+	/* Number of sectors dispatched from queue in single dispatch round */
+	unsigned long nr_sectors;
 };
 
 /*
@@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 			slice_used = cfqq->allocated_slice;
 	}
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
-			cfqq->slice_dispatch);
+	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u",
+			slice_used, cfqq->slice_dispatch, cfqq->nr_sectors);
 	return slice_used;
 }
 
@@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 		cfqq->allocated_slice = 0;
 		cfqq->slice_end = 0;
 		cfqq->slice_dispatch = 0;
+		cfqq->nr_sectors = 0;
 
 		cfq_clear_cfqq_wait_request(cfqq);
 		cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	elv_dispatch_sort(q, rq);
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+	cfqq->nr_sectors += blk_rq_sectors(rq);
 	cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
 					rq_data_dir(rq), rq_is_sync(rq));
 }
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace
  2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal
@ 2010-07-19 18:59   ` Jeff Moyer
  2010-07-19 22:16   ` Divyesh Shah
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff Moyer @ 2010-07-19 18:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, dpshah, guijianfeng, czoccolo

Vivek Goyal <vgoyal@redhat.com> writes:

> - Divyesh had gotten rid of this code in the past. I want to re-introduce it
>   back as it helps me a lot during debugging.

Fine with me.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in  blktrace
  2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal
  2010-07-19 18:59   ` Jeff Moyer
@ 2010-07-19 22:16   ` Divyesh Shah
  1 sibling, 0 replies; 15+ messages in thread
From: Divyesh Shah @ 2010-07-19 22:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, nauman, guijianfeng, jmoyer, czoccolo

Looks good to me.

Reviewed-by: Divyesh Shah <dpshah@google.com>

On Mon, Jul 19, 2010 at 10:20 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> - Divyesh had gotten rid of this code in the past. I want to re-introduce it
>  back as it helps me a lot during debugging.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index b23d7f4..1f0d4ea 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -148,6 +148,8 @@ struct cfq_queue {
>        struct cfq_queue *new_cfqq;
>        struct cfq_group *cfqg;
>        struct cfq_group *orig_cfqg;
> +       /* Number of sectors dispatched from queue in single dispatch round */
> +       unsigned long nr_sectors;
>  };
>
>  /*
> @@ -926,8 +928,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>                        slice_used = cfqq->allocated_slice;
>        }
>
> -       cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
> -                       cfqq->slice_dispatch);
> +       cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, disp=%u sect=%u",
> +                       slice_used, cfqq->slice_dispatch, cfqq->nr_sectors);
>        return slice_used;
>  }
>
> @@ -1608,6 +1610,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>                cfqq->allocated_slice = 0;
>                cfqq->slice_end = 0;
>                cfqq->slice_dispatch = 0;
> +               cfqq->nr_sectors = 0;
>
>                cfq_clear_cfqq_wait_request(cfqq);
>                cfq_clear_cfqq_must_dispatch(cfqq);
> @@ -1970,6 +1973,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
>        elv_dispatch_sort(q, rq);
>
>        cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> +       cfqq->nr_sectors += blk_rq_sectors(rq);
>        cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
>                                        rq_data_dir(rq), rq_is_sync(rq));
>  }
> --
> 1.7.1.1
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH] cfq-iosched: Implement group idle V2
@ 2010-07-19 17:14 Vivek Goyal
  2010-07-19 17:14 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:14 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

Hi,

This is V2 of the group_idle implementation patchset. I have done some more
testing since V1 and fixed couple of bugs since V1.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in 
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1          NRGRP=8             
DIR=/mnt/iostestmnt/fio        DEV=/dev/dm-4                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
group_isolation=1 slice_idle=8    group_idle=8    quantum=8    
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  test1  test2  test3  test4  test5  test6  test7  test8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6245   12776  16591  23471  28746  36799  43031  49778  217437 
bsr     1   2   5100   11063  17216  23136  23738  30391  35910  40874  187428 
bsr     1   4   4623   9718   14746  18356  22875  30407  33215  38073  172013 
bsr     1   8   4720   10143  13499  19115  22157  29126  31688  30784  161232 

Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-rc5-gi-sl-accounting+
GROUPMODE=1          NRGRP=8             
DIR=/mnt/iostestmnt/fio        DEV=/dev/dm-4                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
group_isolation=1 slice_idle=0    group_idle=8    quantum=8    
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  test1  test2  test3  test4  test5  test6  test7  test8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6789   12764  17174  23111  28528  36807  42753  48826  216752 
bsr     1   2   9845   20617  30521  39061  45514  51607  63683  63770  324618 
bsr     1   4   14835  24241  42444  55207  45914  51318  54661  60318  348938 
bsr     1   8   12022  24478  36732  48651  54333  60974  64856  72930  374976 

Notice how overall throughput has shot upto 374MB/s while retaining the ability
to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

If you have thoughts on other ways of solving the problem, I am all ears
to it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] cfq-iosched: Improve time slice charging logic
  2010-07-19 17:14 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
@ 2010-07-19 17:14 ` Vivek Goyal
  0 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2010-07-19 17:14 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: nauman, dpshah, guijianfeng, jmoyer, czoccolo, vgoyal

- Currently in CFQ there are many situations where don't know how
  much time slice has been consumed by a queue. For example, all
  the random reader/writer queues where we don't idle on
  individual queues and we expire the queue either immediately
  after the request dispatch.

- In this case time consumed by a queue is just a memory copy
  operation. Actually time measurement is possible only if we
  idle on a queue and allow dispatch from a queue for significant
  amount of time.

- As of today, in such cases we calculate the time since the
  dispatch from the queue started and charge all that time.
  Generally this rounds to 1 jiffy but in some cases it can
  be more. For example, if we are driving high request queue
  depth and driver is too busy and does not ask for new
  requests for 8-10 jiffies. In such cases, the active queue
  gets charged very unfairly.

- So fundamentally, whole notion of charging for time slice
  is valid only if we have been idling on the queue. Otherwise
  in an NCQ queue, there might be other requests on the queue
  and we can not do the time slice calculation.

- This patch tweaks the slice charging logic a bit so that
  in the cases where we can't know the amount of time, we
  start charging in terms of number of requests dispatched
  (IOPS). This practically switching CFQ fairness model to
  fairness in terms of IOPS with slice_idle=0.

- As of today this will primarily be useful only with
  group_idle patches so that we get fairness in terms of
  IOPS across groups. The idea is that on fast storage
  one can run CFQ with slice_idle=0 and still get IO
  controller working without losing too much of
  throughput.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7982b83..f44064c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -896,16 +896,34 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 		 * if there are mutiple queues in the group, each can dispatch
 		 * a single request on seeky media and cause lots of seek time
 		 * and group will never know it.
+		 *
+		 * If drive is NCQ and we are driving deep queue depths, then
+		 * it is not reasonable to charge the slice since dispatch
+		 * started because this time will include time taken by all
+		 * the other requests in the queue.
+		 *
+		 * Actually there is no reasonable way to know the disk time
+		 * here and we need to come up with some approximation. If
+		 * disk is non NCQ, we should be driving request queue depth
+		 * 1, then charge for time since dispatch start and this will
+		 * account for seek time properly on seeky media. If request
+		 * queue depth is high, then charge for number of requests
+		 * dispatched from the queue. This will sort of becoming
+		 * charging in terms of IOPS.
 		 */
-		slice_used = max_t(unsigned, (jiffies - cfqq->dispatch_start),
-					1);
+		if (cfqq->cfqd->hw_tag == 0)
+			slice_used = max_t(unsigned,
+					(jiffies - cfqq->dispatch_start), 1);
+		else
+			slice_used = cfqq->slice_dispatch;
 	} else {
 		slice_used = jiffies - cfqq->slice_start;
 		if (slice_used > cfqq->allocated_slice)
 			slice_used = cfqq->allocated_slice;
 	}
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
+	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u, sl_disp=%u", slice_used,
+			cfqq->slice_dispatch);
 	return slice_used;
 }
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-07-19 22:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-19 17:20 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
2010-07-19 17:20 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal
2010-07-19 18:47   ` Jeff Moyer
2010-07-19 18:58     ` Vivek Goyal
2010-07-19 20:32       ` Divyesh Shah
2010-07-19 20:44         ` Vivek Goyal
2010-07-19 21:19           ` Corrado Zoccolo
2010-07-19 22:05             ` Vivek Goyal
2010-07-19 17:20 ` [PATCH 2/3] cfq-iosched: Implement a new tunable group_idle Vivek Goyal
2010-07-19 18:58   ` Jeff Moyer
2010-07-19 20:20     ` Vivek Goyal
2010-07-19 17:20 ` [PATCH 3/3] cfq-iosched: Print per slice sectors dispatched in blktrace Vivek Goyal
2010-07-19 18:59   ` Jeff Moyer
2010-07-19 22:16   ` Divyesh Shah
  -- strict thread matches above, loose matches on Subject: below --
2010-07-19 17:14 [RFC PATCH] cfq-iosched: Implement group idle V2 Vivek Goyal
2010-07-19 17:14 ` [PATCH 1/3] cfq-iosched: Improve time slice charging logic Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox