Re: IO scheduler based IO controller V10

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: IO scheduler based IO controller V10
@ 2009-10-02 10:55 Corrado Zoccolo
  2009-10-02 11:04 ` Jens Axboe
  2009-10-02 12:49 ` Vivek Goyal
  0 siblings, 2 replies; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 10:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

Hi Jens,
On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
>>
>> * Jens Axboe <jens.axboe@oracle.com> wrote:
>>
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more. You can't say it's black and white latency
> vs throughput issue, that's just not how the real world works. The
> server folks would be most unpleased.
Could we be more selective when the latency optimization is introduced?

The code that is currently touched by Vivek's patch is:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
basically, when fairness=1, it becomes just:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
                enable_idle = 0;

Note that, even if we enable idling here, the cfq_arm_slice_timer will use
a different idle window for seeky (2ms) than for normal I/O.

I think that the 2ms idle window is good for a single rotational SATA disk scenario,
even if it supports NCQ. Realistic access times for those disks are still around 8ms
(but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
request may pay off, not only in latency and fairness, but also in throughput.

What we don't want to do is to enable idling for NCQ enabled SSDs
(and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
If we agree that hardware RAIDs should be marked as non-rotational, then that
code could become:

        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
        else if (sample_valid(cic->ttime_samples)) {
		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
		if (cic->ttime_mean > idle_time)
                        enable_idle = 0;
                else
                        enable_idle = 1;
        }

Thanks,
Corrado

>
> --
> Jens Axboe
>

-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
@ 2009-10-02 11:04 ` Jens Axboe
  2009-10-02 12:49 ` Vivek Goyal
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2009-10-02 11:04 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 
> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA
> disk scenario, even if it supports NCQ. Realistic access times for
> those disks are still around 8ms (but it is proportional to seek
> lenght), and waiting 2ms to see if we get a nearby request may pay
> off, not only in latency and fairness, but also in throughput.

I agree, that change looks good.

> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.

Right, it was part of the bigger SSD optimization stuff I did a few
revisions back.

> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }

Yes agree on that too. We probably should make a different flag for
hardware raids, telling the io scheduler that this device is really
composed if several others. If it's composited only by SSD's (or has a
frontend similar to that), then non-rotational applies.

But yes, we should pass that information down.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
  2009-10-02 11:04 ` Jens Axboe
@ 2009-10-02 12:49 ` Vivek Goyal
  2009-10-02 15:27   ` Corrado Zoccolo
  1 sibling, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Ingo Molnar, Mike Galbraith, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 

Actually I am not touching this code. Looking at the V10, I have not
changed anything here in idling code.

I think we are seeing latency improvements with fairness=1 because, CFQ
does pure roundrobin and once a seeky reader expires, it is put at the
end of the queue.

I retained the same behavior if fairness=0 but if fairness=1, then I don't
put the seeky reader at the end of queue, instead it gets vdisktime based
on the disk it has used. So it should get placed ahead of sync readers.

I think following is the code snippet in "elevator-fq.c" which is making a
difference.

        /*
         * We don't want to charge more than allocated slice otherwise
         * this
         * queue can miss one dispatch round doubling max latencies. On
         * the
         * other hand we don't want to charge less than allocated slice as
         * we stick to CFQ theme of queue loosing its share if it does not
         * use the slice and moves to the back of service tree (almost).
         */
        if (!ioq->efqd->fairness)
                queue_charge = allocated_slice;

So if a sync readers consumes 100ms and an seeky reader dispatches only
one request, then in CFQ, seeky reader gets to dispatch next request after
another 100ms.

With fairness=1, it should get a lower vdisktime when it comes with a new
request because its last slice usage was less (like CFS sleepers as mike
said). But this will make a difference only if there are more than one
processes in the system otherwise a vtime jump will take place by the time
seeky readers gets backlogged.

Anyway, once I started timestamping the queues and started keeping a cache
of expired queues, then any queue which got new request almost
immediately, should get a lower vdisktime assigned if it did not use the
full time slice in the previous dispatch round. Hence  with fairness=1,
seeky readers kind of get more share of disk (fair share), because these
are now placed ahead of streaming readers and hence get better latencies.

In short, most likely, better latencies are being experienced because
seeky reader is getting lower time stamp (vdisktime), because it did not
use its full time slice in previous dispatch round, and not because we kept
the idling enabled on seeky reader.

Thanks
Vivek

> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA disk scenario,
> even if it supports NCQ. Realistic access times for those disks are still around 8ms
> (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
> request may pay off, not only in latency and fairness, but also in throughput.
> 
> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }
> 
> Thanks,
> Corrado
> 
> >
> > --
> > Jens Axboe
> >
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 12:49 ` Vivek Goyal
@ 2009-10-02 15:27   ` Corrado Zoccolo
  2009-10-02 15:31     ` Vivek Goyal
  2009-10-02 15:32     ` Mike Galbraith
  0 siblings, 2 replies; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 15:27 UTC (permalink / raw)
  To: Vivek Goyal, Mike Galbraith
  Cc: Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, riel

On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
>
> Actually I am not touching this code. Looking at the V10, I have not
> changed anything here in idling code.

I based my analisys on the original patch:
http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html

Mike, can you confirm which version of the fairness patch did you use
in your tests?

Corrado

> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:27   ` Corrado Zoccolo
@ 2009-10-02 15:31     ` Vivek Goyal
  2009-10-02 15:32     ` Mike Galbraith
  1 sibling, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 

Oh.., you are talking about fairness for seeky process patch. I thought
you are talking about current IO controller patches. Actually they both
have this notion of "fairness=1" parameter but do different things in 
patches, hence the confusion. 

Thanks
Vivek


> Mike, can you confirm which version of the fairness patch did you use
> in your tests?
> 
> Corrado
> 
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:27   ` Corrado Zoccolo
  2009-10-02 15:31     ` Vivek Goyal
@ 2009-10-02 15:32     ` Mike Galbraith
  2009-10-02 15:40       ` Vivek Goyal
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 
> Mike, can you confirm which version of the fairness patch did you use
> in your tests?

That would be this one-liner.

o CFQ provides fair access to disk in terms of disk time used to processes.
  Fairness is provided for the applications which have their think time with
  in slice_idle (8ms default) limit.

o CFQ currently disables idling for seeky processes. So even if a process
  has think time with-in slice_idle limits, it will still not get fair share
  of disk. Disabling idling for a seeky process seems good from throughput
  perspective but not necessarily from fairness perspecitve.

0 Do not disable idling based on seek pattern of process if a user has set
  /sys/block/<disk>/queue/iosched/fairness = 1.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:32     ` Mike Galbraith
@ 2009-10-02 15:40       ` Vivek Goyal
  2009-10-02 16:03         ` Mike Galbraith
  2009-10-02 16:50         ` Valdis.Kletnieks
  0 siblings, 2 replies; 25+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > >
> > > Actually I am not touching this code. Looking at the V10, I have not
> > > changed anything here in idling code.
> > 
> > I based my analisys on the original patch:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > 
> > Mike, can you confirm which version of the fairness patch did you use
> > in your tests?
> 
> That would be this one-liner.
> 

Ok. Thanks. Sorry, I got confused and thought that you are using "io
controller patches" with fairness=1.

In that case, Corrado's suggestion of refining it further and disabling idling
for seeky process only on non-rotational media (SSD and hardware RAID), makes
sense to me.

Thanks
Vivek
  
> o CFQ provides fair access to disk in terms of disk time used to processes.
>   Fairness is provided for the applications which have their think time with
>   in slice_idle (8ms default) limit.
> 
> o CFQ currently disables idling for seeky processes. So even if a process
>   has think time with-in slice_idle limits, it will still not get fair share
>   of disk. Disabling idling for a seeky process seems good from throughput
>   perspective but not necessarily from fairness perspecitve.
> 
> 0 Do not disable idling based on seek pattern of process if a user has set
>   /sys/block/<disk>/queue/iosched/fairness = 1.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:40       ` Vivek Goyal
@ 2009-10-02 16:03         ` Mike Galbraith
  2009-10-02 16:50         ` Valdis.Kletnieks
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2009-10-02 16:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > > >
> > > > Actually I am not touching this code. Looking at the V10, I have not
> > > > changed anything here in idling code.
> > > 
> > > I based my analisys on the original patch:
> > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > > 
> > > Mike, can you confirm which version of the fairness patch did you use
> > > in your tests?
> > 
> > That would be this one-liner.
> > 
> 
> Ok. Thanks. Sorry, I got confused and thought that you are using "io
> controller patches" with fairness=1.
> 
> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

One thing that might help with that is to have new tasks start out life
meeting the seeky criteria.  If there's anything going on, they will be.

	-Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:40       ` Vivek Goyal
  2009-10-02 16:03         ` Mike Galbraith
@ 2009-10-02 16:50         ` Valdis.Kletnieks
  2009-10-02 19:58           ` Vivek Goyal
  1 sibling, 1 reply; 25+ messages in thread
From: Valdis.Kletnieks @ 2009-10-02 16:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

[-- Attachment #1: Type: text/plain, Size: 563 bytes --]

On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:

> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

Umm... I got petabytes of hardware RAID across the hall that very definitely
*is* rotating.  Did you mean "SSD and disk systems with big honking caches
that cover up the rotation"?  Because "RAID" and "big honking caches" are
not *quite* the same thing, and I can just see that corner case coming out
to bite somebody on the ass...


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 16:50         ` Valdis.Kletnieks
@ 2009-10-02 19:58           ` Vivek Goyal
  2009-10-02 22:14             ` Corrado Zoccolo
  0 siblings, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> 
> > In that case, Corrado's suggestion of refining it further and disabling idling
> > for seeky process only on non-rotational media (SSD and hardware RAID), makes
> > sense to me.
> 
> Umm... I got petabytes of hardware RAID across the hall that very definitely
> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> not *quite* the same thing, and I can just see that corner case coming out
> to bite somebody on the ass...
>

I guess both. The systems which have big caches and cover up for rotation,
we probably need not idle for seeky process. An in case of big hardware
RAID, having multiple rotating disks, instead of idling and keeping rest
of the disks free, we probably are better off dispatching requests from
next queue (hoping it is going to a different disk altogether).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 19:58           ` Vivek Goyal
@ 2009-10-02 22:14             ` Corrado Zoccolo
  2009-10-02 22:27               ` Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>>
>> Umm... I got petabytes of hardware RAID across the hall that very definitely
>> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
>> that cover up the rotation"?  Because "RAID" and "big honking caches" are
>> not *quite* the same thing, and I can just see that corner case coming out
>> to bite somebody on the ass...
>>
>
> I guess both. The systems which have big caches and cover up for rotation,
> we probably need not idle for seeky process. An in case of big hardware
> RAID, having multiple rotating disks, instead of idling and keeping rest
> of the disks free, we probably are better off dispatching requests from
> next queue (hoping it is going to a different disk altogether).

In fact I think that the 'rotating' flag name is misleading.
All the checks we are doing are actually checking if the device truly
supports multiple parallel operations, and this feature is shared by
hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
NCQ-enabled SATA disk.

If we really wanted a "seek is cheap" flag, we could measure seek time
in the io-scheduler itself, but in the current code base we don't have
it used in this meaning anywhere.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 22:14             ` Corrado Zoccolo
@ 2009-10-02 22:27               ` Vivek Goyal
  2009-10-03 12:43                 ` Corrado Zoccolo
  0 siblings, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> >>
> >> Umm... I got petabytes of hardware RAID across the hall that very definitely
> >> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> >> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> >> not *quite* the same thing, and I can just see that corner case coming out
> >> to bite somebody on the ass...
> >>
> >
> > I guess both. The systems which have big caches and cover up for rotation,
> > we probably need not idle for seeky process. An in case of big hardware
> > RAID, having multiple rotating disks, instead of idling and keeping rest
> > of the disks free, we probably are better off dispatching requests from
> > next queue (hoping it is going to a different disk altogether).
> 
> In fact I think that the 'rotating' flag name is misleading.
> All the checks we are doing are actually checking if the device truly
> supports multiple parallel operations, and this feature is shared by
> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> NCQ-enabled SATA disk.
> 

While we are at it, what happens to notion of priority of tasks on SSDs?
Without idling there is not continuous time slice and there is no
fairness. So ioprio is out of the window for SSDs?

On SSDs, will it make more sense to provide fairness in terms of number or
IO or size of IO and not in terms of time slices.

Thanks
Vivek

> If we really wanted a "seek is cheap" flag, we could measure seek time
> in the io-scheduler itself, but in the current code base we don't have
> it used in this meaning anywhere.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 22:27               ` Vivek Goyal
@ 2009-10-03 12:43                 ` Corrado Zoccolo
  2009-10-03 13:38                   ` Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-03 12:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> In fact I think that the 'rotating' flag name is misleading.
>> All the checks we are doing are actually checking if the device truly
>> supports multiple parallel operations, and this feature is shared by
>> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> NCQ-enabled SATA disk.
>>
>
> While we are at it, what happens to notion of priority of tasks on SSDs?
This is not changed by proposed patch w.r.t. current CFQ.
> Without idling there is not continuous time slice and there is no
> fairness. So ioprio is out of the window for SSDs?
I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
me that the way in which queues are sorted in the rr tree may still
provide some sort of fairness and service differentiation for
priorities, in terms of number of IOs.
Non-NCQ SSDs, instead, will still have the idle window enabled, so it
is not an issue for them.
>
> On SSDs, will it make more sense to provide fairness in terms of number or
> IO or size of IO and not in terms of time slices.
Not on all SSDs. There are still ones that have a non-negligible
penalty on non-sequential access pattern (hopefully the ones without
NCQ, but if we find otherwise, then we will have to benchmark access
time in I/O scheduler to select the best policy). For those, time
based may still be needed.

Thanks,
Corrado

>
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-03 12:43                 ` Corrado Zoccolo
@ 2009-10-03 13:38                   ` Vivek Goyal
  2009-10-04  9:15                     ` Corrado Zoccolo
  0 siblings, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> In fact I think that the 'rotating' flag name is misleading.
> >> All the checks we are doing are actually checking if the device truly
> >> supports multiple parallel operations, and this feature is shared by
> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> NCQ-enabled SATA disk.
> >>
> >
> > While we are at it, what happens to notion of priority of tasks on SSDs?
> This is not changed by proposed patch w.r.t. current CFQ.

This is a general question irrespective of current patch. Want to know
what is our statement w.r.t ioprio and what it means for user? When do
we support it and when do we not.

> > Without idling there is not continuous time slice and there is no
> > fairness. So ioprio is out of the window for SSDs?
> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> me that the way in which queues are sorted in the rr tree may still
> provide some sort of fairness and service differentiation for
> priorities, in terms of number of IOs.

I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
not. I guess this happens because sometimes idling is enabled and sometmes
not because of dyanamic nature of hw_tag.

I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
third prio7.

(prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
(prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
(prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec

Note there is almost no difference between prio 0 and prio 4 job and prio7
job has been penalized heavily (gets less than 10% BW of prio 4 job).

> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> is not an issue for them.

Agree.

> >
> > On SSDs, will it make more sense to provide fairness in terms of number or
> > IO or size of IO and not in terms of time slices.
> Not on all SSDs. There are still ones that have a non-negligible
> penalty on non-sequential access pattern (hopefully the ones without
> NCQ, but if we find otherwise, then we will have to benchmark access
> time in I/O scheduler to select the best policy). For those, time
> based may still be needed.

Ok.

So on better SSDs out there with NCQ, we probably don't support the notion of
ioprio? Or, I am missing something.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-03 13:38                   ` Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) Vivek Goyal
@ 2009-10-04  9:15                     ` Corrado Zoccolo
  2009-10-04 12:11                       ` Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-04  9:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

Hi Vivek,
On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
>> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> >> In fact I think that the 'rotating' flag name is misleading.
>> >> All the checks we are doing are actually checking if the device truly
>> >> supports multiple parallel operations, and this feature is shared by
>> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> >> NCQ-enabled SATA disk.
>> >>
>> >
>> > While we are at it, what happens to notion of priority of tasks on SSDs?
>> This is not changed by proposed patch w.r.t. current CFQ.
>
> This is a general question irrespective of current patch. Want to know
> what is our statement w.r.t ioprio and what it means for user? When do
> we support it and when do we not.
>
>> > Without idling there is not continuous time slice and there is no
>> > fairness. So ioprio is out of the window for SSDs?
>> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
>> me that the way in which queues are sorted in the rr tree may still
>> provide some sort of fairness and service differentiation for
>> priorities, in terms of number of IOs.
>
> I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> not. I guess this happens because sometimes idling is enabled and sometmes
> not because of dyanamic nature of hw_tag.
>
My guess is that the formula that is used to handle this case is not
very stable.
The culprit code is (in cfq_service_tree_add):
        } else if (!add_front) {
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
        } else

cfq_slice_offset is defined as:

static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
                                      struct cfq_queue *cfqq)
{
        /*
         * just an approximation, should be ok.
         */
	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
                       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

Can you try changing the latter to a simpler (we already observed that
busy_queues is unstable, and I think that here it is not needed at
all):
	return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
and remove the 'rb_key += cfqq->slice_resid; ' from the former.

This should give a higher probability of being first on the queue to
larger slice tasks, so it will work if we don't idle, but it needs
some adjustment if we idle.

> I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> third prio7.
>
> (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
>
> Note there is almost no difference between prio 0 and prio 4 job and prio7
> job has been penalized heavily (gets less than 10% BW of prio 4 job).
>
>> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
>> is not an issue for them.
>
> Agree.
>
>> >
>> > On SSDs, will it make more sense to provide fairness in terms of number or
>> > IO or size of IO and not in terms of time slices.
>> Not on all SSDs. There are still ones that have a non-negligible
>> penalty on non-sequential access pattern (hopefully the ones without
>> NCQ, but if we find otherwise, then we will have to benchmark access
>> time in I/O scheduler to select the best policy). For those, time
>> based may still be needed.
>
> Ok.
>
> So on better SSDs out there with NCQ, we probably don't support the notion of
> ioprio? Or, I am missing something.

I think we try, but the current formula is simply not good enough.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04  9:15                     ` Corrado Zoccolo
@ 2009-10-04 12:11                       ` Vivek Goyal
  2009-10-04 12:46                         ` Corrado Zoccolo
  0 siblings, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2009-10-04 12:11 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> >> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> >> In fact I think that the 'rotating' flag name is misleading.
> >> >> All the checks we are doing are actually checking if the device truly
> >> >> supports multiple parallel operations, and this feature is shared by
> >> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> >> NCQ-enabled SATA disk.
> >> >>
> >> >
> >> > While we are at it, what happens to notion of priority of tasks on SSDs?
> >> This is not changed by proposed patch w.r.t. current CFQ.
> >
> > This is a general question irrespective of current patch. Want to know
> > what is our statement w.r.t ioprio and what it means for user? When do
> > we support it and when do we not.
> >
> >> > Without idling there is not continuous time slice and there is no
> >> > fairness. So ioprio is out of the window for SSDs?
> >> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> >> me that the way in which queues are sorted in the rr tree may still
> >> provide some sort of fairness and service differentiation for
> >> priorities, in terms of number of IOs.
> >
> > I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> > not. I guess this happens because sometimes idling is enabled and sometmes
> > not because of dyanamic nature of hw_tag.
> >
> My guess is that the formula that is used to handle this case is not
> very stable.

In general I agree that formula to calculate the slice offset is very 
puzzling as busy_queues varies and that changes the position of the task
sometimes.

> The culprit code is (in cfq_service_tree_add):
>         } else if (!add_front) {
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
>         } else
> 
> cfq_slice_offset is defined as:
> 
> static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
>                                       struct cfq_queue *cfqq)
> {
>         /*
>          * just an approximation, should be ok.
>          */
> 	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
>                        cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> }
> 
> Can you try changing the latter to a simpler (we already observed that
> busy_queues is unstable, and I think that here it is not needed at
> all):
> 	return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> and remove the 'rb_key += cfqq->slice_resid; ' from the former.
> 
> This should give a higher probability of being first on the queue to
> larger slice tasks, so it will work if we don't idle, but it needs
> some adjustment if we idle.

I am not sure what's the intent here by removing busy_queues stuff. I have
got two questions though.

- Why don't we keep it simple round robin where a task is simply placed at
  the end of service tree.

- Secondly, CFQ provides full slice length to queues only which are
  idling (in case of sequenatial reader). If we do not enable idling, as
  in case of NCQ enabled SSDs, then CFQ will expire the queue almost 
  immediately and put the queue at the end of service tree (almost).

So if we don't enable idling, at max we can provide fairness, we
esseitially just let every queue dispatch one request and put  at the end
of the end of service tree. Hence no fairness....

Thanks
Vivek

> 
> > I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> > third prio7.
> >
> > (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> > (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> > (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
> >
> > Note there is almost no difference between prio 0 and prio 4 job and prio7
> > job has been penalized heavily (gets less than 10% BW of prio 4 job).
> >
> >> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> >> is not an issue for them.
> >
> > Agree.
> >
> >> >
> >> > On SSDs, will it make more sense to provide fairness in terms of number or
> >> > IO or size of IO and not in terms of time slices.
> >> Not on all SSDs. There are still ones that have a non-negligible
> >> penalty on non-sequential access pattern (hopefully the ones without
> >> NCQ, but if we find otherwise, then we will have to benchmark access
> >> time in I/O scheduler to select the best policy). For those, time
> >> based may still be needed.
> >
> > Ok.
> >
> > So on better SSDs out there with NCQ, we probably don't support the notion of
> > ioprio? Or, I am missing something.
> 
> I think we try, but the current formula is simply not good enough.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-04 12:11                       ` Vivek Goyal
@ 2009-10-04 12:46                         ` Corrado Zoccolo
  2009-10-04 16:20                           ` Fabio Checconi
                                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-04 12:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

Hi Vivek,
On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> My guess is that the formula that is used to handle this case is not
>> very stable.
>
> In general I agree that formula to calculate the slice offset is very
> puzzling as busy_queues varies and that changes the position of the task
> sometimes.
>
> I am not sure what's the intent here by removing busy_queues stuff. I have
> got two questions though.

In the ideal case steady state, busy_queues will be a constant. Since
we are just comparing the values between themselves, we can just
remove this constant completely.

Whenever it is not constant, it seems to me that it can cause wrong
behaviour, i.e. when the number of processes with ready I/O reduces, a
later coming request can jump before older requests.
So it seems it does more harm than good, hence I suggest to remove it.

Moreover, I suggest removing also the slice_resid part, since its
semantics doesn't seem consistent.
When computed, it is not the residency, but the remaining time slice.
Then it is used to postpone, instead of anticipate, the position of
the queue in the RR, that seems counterintuitive (it would be
intuitive, though, if it was actually a residency, not a remaining
slice, i.e. you already got your full share, so you can wait longer to
be serviced again).

>
> - Why don't we keep it simple round robin where a task is simply placed at
>  the end of service tree.

This should work for the idling case, since we provide service
differentiation by means of time slice.
For non-idling case, though, the appropriate placement of queues in
the tree (as given by my formula) can still provide it.

>
> - Secondly, CFQ provides full slice length to queues only which are
>  idling (in case of sequenatial reader). If we do not enable idling, as
>  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
>  immediately and put the queue at the end of service tree (almost).
>
> So if we don't enable idling, at max we can provide fairness, we
> esseitially just let every queue dispatch one request and put  at the end
> of the end of service tree. Hence no fairness....

We should distinguish the two terms fairness and service
differentiation. Fairness is when every queue gets the same amount of
service share. This is not what we want when priorities are different
(we want the service differentiation, instead), but is what we get if
we do just round robin without idling.

To fix this, we can alter the placement in the tree, so that if we
have Q1 with slice S1, and Q2 with slice S2, always ready to perform
I/O, we get that Q1 is in front of the three with probability
S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
This is what my formula should achieve.

Thanks,
Corrado

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04 12:46                         ` Corrado Zoccolo
@ 2009-10-04 16:20                           ` Fabio Checconi
  2009-10-05 21:21                             ` Corrado Zoccolo
  2009-10-05 15:06                           ` Jeff Moyer
  2009-10-06 21:36                           ` Vivek Goyal
  2 siblings, 1 reply; 25+ messages in thread
From: Fabio Checconi @ 2009-10-04 16:20 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

> From: Corrado Zoccolo <czoccolo@gmail.com>
> Date: Sun, Oct 04, 2009 02:46:44PM +0200
>
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 
> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 
> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share. This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.
> 

But if the ``always ready to perform I/O'' assumption held then even RR
would have provided service differentiation, always seeing backlogged
queues and serving them according to their weights.

In this case the problem is what Vivek described some time ago as the
interlocked service of sync queues, where the scheduler is trying to
differentiate between the queues, but they are not always asking for
service (as they are synchronous and they are backlogged only for short
time intervals).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-04 16:20                           ` Fabio Checconi
@ 2009-10-05 21:21                             ` Corrado Zoccolo
  0 siblings, 0 replies; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
> But if the ``always ready to perform I/O'' assumption held then even RR
> would have provided service differentiation, always seeing backlogged
> queues and serving them according to their weights.

Right, this property is too strong. But also a weaker "the two queues
have think times less than the disk access time" will be enough to
achieve the same goal by means of proper placement in the RR tree.

If both think times are greater than access time, then each queue will
get a service level equivalent to it being the only queue in the
system, so in this case service differentiation will not apply (do we
need to differentiate when everyone gets exactly what he needs?).

If one think time is less, and the other is more than the access time,
then we should decide what kind of fairness we want to have,
especially if the one with larger think time has also higher priority.

> In this case the problem is what Vivek described some time ago as the
> interlocked service of sync queues, where the scheduler is trying to
> differentiate between the queues, but they are not always asking for
> service (as they are synchronous and they are backlogged only for short
> time intervals).

Corrado

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-04 12:46                         ` Corrado Zoccolo
  2009-10-04 16:20                           ` Fabio Checconi
@ 2009-10-05 15:06                           ` Jeff Moyer
  2009-10-05 21:09                             ` Corrado Zoccolo
  2009-10-06 21:36                           ` Vivek Goyal
  2 siblings, 1 reply; 25+ messages in thread
From: Jeff Moyer @ 2009-10-05 15:06 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

Corrado Zoccolo <czoccolo@gmail.com> writes:

> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.

It stands for residual, not residency.  Make more sense?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-05 15:06                           ` Jeff Moyer
@ 2009-10-05 21:09                             ` Corrado Zoccolo
  2009-10-06  8:41                               ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Corrado Zoccolo <czoccolo@gmail.com> writes:
>
>> Moreover, I suggest removing also the slice_resid part, since its
>> semantics doesn't seem consistent.
>> When computed, it is not the residency, but the remaining time slice.
>
> It stands for residual, not residency.  Make more sense?
It makes sense when computed, but not when used in rb_key computation.
Why should we postpone queues that where preempted, instead of giving
them a boost?

Thanks,
Corrado

>
> Cheers,
> Jeff
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-05 21:09                             ` Corrado Zoccolo
@ 2009-10-06  8:41                               ` Jens Axboe
  2009-10-06  9:00                                 ` Corrado Zoccolo
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2009-10-06  8:41 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Corrado Zoccolo <czoccolo@gmail.com> writes:
> >
> >> Moreover, I suggest removing also the slice_resid part, since its
> >> semantics doesn't seem consistent.
> >> When computed, it is not the residency, but the remaining time slice.
> >
> > It stands for residual, not residency.  Make more sense?
> It makes sense when computed, but not when used in rb_key computation.
> Why should we postpone queues that where preempted, instead of giving
> them a boost?

We should not, if it is/was working correctly, it should allow both for
increase/descrease of tree position (hence it's a long and can go
negative) to account for both over and under time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-06  8:41                               ` Jens Axboe
@ 2009-10-06  9:00                                 ` Corrado Zoccolo
  2009-10-06 18:53                                   ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Corrado Zoccolo @ 2009-10-06  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Mon, Oct 05 2009, Corrado Zoccolo wrote:
>> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> > It stands for residual, not residency.  Make more sense?
>> It makes sense when computed, but not when used in rb_key computation.
>> Why should we postpone queues that where preempted, instead of giving
>> them a boost?
>
> We should not, if it is/was working correctly, it should allow both for
> increase/descrease of tree position (hence it's a long and can go
> negative) to account for both over and under time.

I'm doing some tests with and without it.
How it is working now is:
definition:
        if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
                cfqq->slice_resid = cfqq->slice_end - jiffies;
                cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
cfqq->slice_resid);
        }
* here resid is > 0 if there was residual time, and < 0 if the queue
overrun its slice.
use:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
* here if residual is > 0, we postpone, i.e. penalize.  If residual is
< 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.

So this is likely not what we want.
I did some tests with and without it, or changing the sign, and it
doesn't matter at all for pure sync workloads.

The only case in which it matters a little, from my experiments, is
for sync vs async workload. Here, since async queues are preempted,
the current form of the code penalizes them, so they get larger
delays, and we get more bandwidth for sync.
This is, btw, the only positive outcome (I can think of) from the
current form of the code, and I think we could obtain it more easily
by unconditionally adding a delay for async queues:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
		if (!cfq_cfqq_sync(cfqq)) {
                        rb_key += CFQ_ASYNC_DELAY;
	        }

removing completely the resid stuff (or at least leaving us with the
ability of using it with the proper sign).

Corrado
>
> --
> Jens Axboe
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-06  9:00                                 ` Corrado Zoccolo
@ 2009-10-06 18:53                                   ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Tue, Oct 06 2009, Corrado Zoccolo wrote:
> On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> > It stands for residual, not residency.  Make more sense?
> >> It makes sense when computed, but not when used in rb_key computation.
> >> Why should we postpone queues that where preempted, instead of giving
> >> them a boost?
> >
> > We should not, if it is/was working correctly, it should allow both for
> > increase/descrease of tree position (hence it's a long and can go
> > negative) to account for both over and under time.
> 
> I'm doing some tests with and without it.
> How it is working now is:
> definition:
>         if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
>                 cfqq->slice_resid = cfqq->slice_end - jiffies;
>                 cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
> cfqq->slice_resid);
>         }
> * here resid is > 0 if there was residual time, and < 0 if the queue
> overrun its slice.
> use:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
> * here if residual is > 0, we postpone, i.e. penalize.  If residual is
> < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.
> 
> So this is likely not what we want.

Indeed, that should be -= cfqq->slice_resid.

> I did some tests with and without it, or changing the sign, and it
> doesn't matter at all for pure sync workloads.

For most cases it will not change things a lot, but it should be
technically correct.

> The only case in which it matters a little, from my experiments, is
> for sync vs async workload. Here, since async queues are preempted,
> the current form of the code penalizes them, so they get larger
> delays, and we get more bandwidth for sync.

Right

> This is, btw, the only positive outcome (I can think of) from the
> current form of the code, and I think we could obtain it more easily
> by unconditionally adding a delay for async queues:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> 		if (!cfq_cfqq_sync(cfqq)) {
>                         rb_key += CFQ_ASYNC_DELAY;
> 	        }
> 
> removing completely the resid stuff (or at least leaving us with the
> ability of using it with the proper sign).

It's more likely for the async queue to overrun, but it can happen for
others as well. I'm keeping the residual count, but making the sign
change of course.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04 12:46                         ` Corrado Zoccolo
  2009-10-04 16:20                           ` Fabio Checconi
  2009-10-05 15:06                           ` Jeff Moyer
@ 2009-10-06 21:36                           ` Vivek Goyal
  2 siblings, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 

I agree here. busy_queues can vary, especially given the fact that CFQ
removes the queue from service tree immediately after the dispatch, if the
queue is empty, and then it waits for request completion from the queue
and idles on the queue.

So consider following scenration where two thinking readers and one writer
are executing. Readers preempt the writers and writers gets back into the
tree. When writer gets backlogged, at that point of time busy_queues=2 
and when a readers gets backlogged, busy_queues=1 (most of the time,
because a reader is idling), and hence many a time readers gets placed ahead
of writer.

This is so subtle, that I am not sure it was the designed that way.

So dependence on busy_queues can change queue ordering in unpredicatable
ways.

> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 

So for non-idling case, instead of providing service differentiation by 
number of times queue is scheduled to run then by providing a bigger slice
to the queue?

This will work only to an extent and depends on size of IO being
dispatched from each queue. If some queue is having bigger requests size
and some smaller size (can be easily driven by changing block size), then
again you will not see fairness numbers? In that case it might make sense
to provide fairness in terms of size of IO/number of IO. 

So to me it boils down to what is the seek cose of the underlying media.
If seek cost is high, provide fairness in terms of time slice and if seek
cost is really low, one can afford to faster switching of queues without
loosing too much on throughput side and in that case fairness in terms of
size of IO should be good.

Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will
make sense to tweak CFQ to change mode dynamically and start providing
fairness in terms of size of IO/number of IO?

> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share.

Will it not be "proportionate amount of service share" instead of "same
amount of service share"

> This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.

I have yet to get into details but as I said, this sounds like fairness 
by frequency or by number of times a queue is scheduled to dispatch. So it
will help up to some extent on NCQ enabled SSDs but will become unfair is
size of IO each queue dispatches is very different.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-10-06 21:39 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
2009-10-02 11:04 ` Jens Axboe
2009-10-02 12:49 ` Vivek Goyal
2009-10-02 15:27   ` Corrado Zoccolo
2009-10-02 15:31     ` Vivek Goyal
2009-10-02 15:32     ` Mike Galbraith
2009-10-02 15:40       ` Vivek Goyal
2009-10-02 16:03         ` Mike Galbraith
2009-10-02 16:50         ` Valdis.Kletnieks
2009-10-02 19:58           ` Vivek Goyal
2009-10-02 22:14             ` Corrado Zoccolo
2009-10-02 22:27               ` Vivek Goyal
2009-10-03 12:43                 ` Corrado Zoccolo
2009-10-03 13:38                   ` Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) Vivek Goyal
2009-10-04  9:15                     ` Corrado Zoccolo
2009-10-04 12:11                       ` Vivek Goyal
2009-10-04 12:46                         ` Corrado Zoccolo
2009-10-04 16:20                           ` Fabio Checconi
2009-10-05 21:21                             ` Corrado Zoccolo
2009-10-05 15:06                           ` Jeff Moyer
2009-10-05 21:09                             ` Corrado Zoccolo
2009-10-06  8:41                               ` Jens Axboe
2009-10-06  9:00                                 ` Corrado Zoccolo
2009-10-06 18:53                                   ` Jens Axboe
2009-10-06 21:36                           ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).