performance "regression" in cfq compared to anticipatory, deadline and noop

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* performance "regression" in cfq compared to anticipatory, deadline and noop
@ 2008-05-10 19:18 Matthew
       [not found] ` <20080510200053.GA78555@gandalf.sssup.it>
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew @ 2008-05-10 19:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Hi Ingo, hi everybody,

I've encountered sort of a performance "regression" in using cfq (and
the cfq-based bfq) in comparison with the other io-schedulers:

1) interactivity during load is much better compared to the others
(thanks a lot for that, that made me appreciate this scheduler) BUT
2) everything seems to take somewhat longer to load (big applications
like firefox, etc. )
3) hdparm shows the same behavior

since I've started using cfq only for a few days (approx. 1-2 weeks
now) I didn't really notice it until I tested "performance" via
hdparm:


/dev/sdd:
 Timing buffered disk reads:  308 MB in  3.01 seconds = 102.22 MB/sec

/dev/sdd:
 Timing buffered disk reads:  306 MB in  3.01 seconds = 101.66 MB/sec

/dev/sdd:
 Timing buffered disk reads:  304 MB in  3.02 seconds = 100.77 MB/sec
noop [anticipatory] deadline cfq

deadline & noop are similar, the test of noop finishes pretty fast ...


/dev/sdd:
 Timing buffered disk reads:  170 MB in  3.02 seconds =  56.27 MB/sec

/dev/sdd:
 Timing buffered disk reads:  176 MB in  3.02 seconds =  58.21 MB/sec

/dev/sdd:
 Timing buffered disk reads:  176 MB in  3.02 seconds =  58.22 MB/sec
noop anticipatory deadline [cfq]

this behavior occurs on an jmicron sata-controller (JMB363/361) and
the probably the 4th port of the Intel ICH7R but only with cfq
selected, the first (?) and second (?) port of the Intel ICH7R are
fine performance-wise, don't know why it's that picky
with the other schedulers it's fine

I've tested: 2.6.24-gentoo-r7 (+ 2.6.24.7), 2.6.24-gentoo-r3, 2.6.25,
2.6.25.2 (+ 2.6.25-zen1), 2.6.25-rc8, the kernel of the ubuntu
desktop-livecd amd64 (ubuntu 8.04) (cfq enabled)
all show this worse "performance" compared to the other schedulers
all kernels are amd64 on gentoo ~amd64, glibc-2.7.1, gcc-4.2.3 hardened

hardware:
Asus P5W DH Deluxe

I unfortuantely can't test earlier kernel-versions due to the fact
that I'm using reiser4 for /(root) and the earlier kernels + reiser4
aren't that stable in terms of data safety

hopefully this is reproducable & you guys can explain if this is
something to "worry" about (performance) and/or a real regression or
just some kind of placebo effect

Many thanks in advance & thanks a lot for this great scheduler (cfq;
I'm looking forward to bfq in mainline which seems to work even better
under load)

Regards

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
       [not found] ` <20080510200053.GA78555@gandalf.sssup.it>
@ 2008-05-10 20:39   ` Matthew
  2008-05-10 21:56     ` Fabio Checconi
  2008-05-11  0:00     ` Aaron Carroll
  0 siblings, 2 replies; 35+ messages in thread
From: Matthew @ 2008-05-10 20:39 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Linux Kernel Mailing List, jens.axboe

> Hi, I'm experiencing some cfq/bfq performance issues too, but I'm
> still not able to track down the reasons, so before posting on this
> topic on the mailing list I'd ask you a couple of questions.
>
> 1) Are you running the hdparm performance test under some cpu load?
>   (Even two hdparm instances ran in parallel could do.)
>
> 2) Does using a bigger value of slice_idle increase the throughput?
>

Hi,

1) no it was always in (almost) complete idle

2) a bigger value even made it worse, setting it to "0" however
seemingly "fixed" it, I however don't know how the overall
effect/impact is, this will need some more real-world testing ;)

cat /sys/block/sdd/queue/iosched/slice_idle
0

hdparm -t /dev/sdd

/dev/sdd:
 Timing buffered disk reads:  314 MB in  3.01 seconds = 104.32 MB/sec

hdparm -t /dev/sdd

/dev/sdd:
 Timing buffered disk reads:  312 MB in  3.00 seconds = 103.86 MB/sec

hdparm -t /dev/sdd

/dev/sdd:
 Timing buffered disk reads:  314 MB in  3.01 seconds = 104.24 MB/sec

one side-node / question:

will this cause more wakeups on the cpu and/or decrease battery
runtime on, e.g. laptops ?

> Thank you very much, I'll try hdparm on my test boxes and come back
> to the list if I find something on that.
>
> As a sidenote, Ingo is not the author/maintainer of cfq, maybe the
> next time CC: Jens Axboe for that.
>

oops, didn't know that, thanks - didn't want to give the wrong person
the "credits"
hi & kudos to Jens ;)

here's a nice site which explains all of the settings:

http://www.nextre.it/oracledocs/ioscheduler_03.html

Regards

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-10 20:39   ` Matthew
@ 2008-05-10 21:56     ` Fabio Checconi
  2008-05-11  0:00     ` Aaron Carroll
  1 sibling, 0 replies; 35+ messages in thread
From: Fabio Checconi @ 2008-05-10 21:56 UTC (permalink / raw)
  To: Matthew; +Cc: Linux Kernel Mailing List, jens.axboe

> From: Matthew <jackdachef@gmail.com>
> Date: Sat, May 10, 2008 10:39:50PM +0200
>
> > 2) Does using a bigger value of slice_idle increase the throughput?
> >
> 
> Hi,
> 
> 2) a bigger value even made it worse, setting it to "0" however
> seemingly "fixed" it, I however don't know how the overall
> effect/impact is, this will need some more real-world testing ;)
> 

Well, it's not a fix... the overall effect is that you should end
up with more seeks (and so reduced throughput) on loads consisting
of more than one process, and at least one of those processes is a
synchronous sequential reader.

> one side-node / question:
> 
> will this cause more wakeups on the cpu and/or decrease battery
> runtime on, e.g. laptops ?
> 

I don't know the overall effect on battery life, btw with no idling
you have one less timer active in the system (that however, depending
on the load, does not fire frequently) and more continuous disk
activity.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-10 20:39   ` Matthew
  2008-05-10 21:56     ` Fabio Checconi
@ 2008-05-11  0:00     ` Aaron Carroll
  1 sibling, 0 replies; 35+ messages in thread
From: Aaron Carroll @ 2008-05-11  0:00 UTC (permalink / raw)
  To: Matthew; +Cc: Fabio Checconi, Linux Kernel Mailing List, jens.axboe

Matthew wrote:
>> 2) Does using a bigger value of slice_idle increase the throughput?
> 
> [..]
>
> 2) a bigger value even made it worse, setting it to "0" however
> seemingly "fixed" it, I however don't know how the overall
> effect/impact is, this will need some more real-world testing ;)

As Fabio said, you may lose throughput if you have multiple processes
with at least one sync. seq. reader.  However, for other workloads, you
should see a large global throughput improvement.  This is because CFQ
tends to idle without too much regard to thinktime or seekiness, often
wasting a few ms.  The trade-off is that your slow sync. processes may
suffer a little.

 -- Aaron

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
@ 2008-05-11 13:14 Daniel J Blueman
  2008-05-11 14:02 ` Kasper Sandberg
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-05-11 13:14 UTC (permalink / raw)
  To: axboe; +Cc: Linux Kernel, Matthew

I've been experiencing this for a while also; an almost 50% regression
is seen for single-process reads (ie sync) if slice_idle is 1ms or
more (eg default of 8) [1], which seems phenomenal.

Jens, is this the expected price to pay for optimal busy-spindle
scheduling, a design issue, bug or am I missing something totally?

Thanks,
  Daniel

--- [1]

# cat /sys/block/sda/queue/iosched/slice_idle
8
# echo 1 >/proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 4.92922 s, 66.5 MB/s

# echo 0 >/sys/block/sda/queue/iosched/slice_idle
# echo 1 >/proc/sys/vm/drop_caches
# dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 2.74098 s, 120 MB/s

# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   15464 MB in  2.00 seconds = 7741.05 MB/sec
 Timing buffered disk reads:  342 MB in  3.01 seconds = 113.70 MB/sec

[120MB/s is known platter-rate for this disc, so expected]
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-11 13:14 performance "regression" in cfq compared to anticipatory, deadline and noop Daniel J Blueman
@ 2008-05-11 14:02 ` Kasper Sandberg
  2008-05-13 12:20   ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Kasper Sandberg @ 2008-05-11 14:02 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: axboe, Linux Kernel, Matthew

On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> I've been experiencing this for a while also; an almost 50% regression
> is seen for single-process reads (ie sync) if slice_idle is 1ms or
> more (eg default of 8) [1], which seems phenomenal.
> 
> Jens, is this the expected price to pay for optimal busy-spindle
> scheduling, a design issue, bug or am I missing something totally?
> 
> Thanks,
>   Daniel
> 
> --- [1]
> 
> # cat /sys/block/sda/queue/iosched/slice_idle
> 8
> # echo 1 >/proc/sys/vm/drop_caches
> # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 4.92922 s, 66.5 MB/s
> 
> # echo 0 >/sys/block/sda/queue/iosched/slice_idle
> # echo 1 >/proc/sys/vm/drop_caches
> # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 2.74098 s, 120 MB/s
> 
> # hdparm -Tt /dev/sda
> 
> /dev/sda:
>  Timing cached reads:   15464 MB in  2.00 seconds = 7741.05 MB/sec
>  Timing buffered disk reads:  342 MB in  3.01 seconds = 113.70 MB/sec
> 
> [120MB/s is known platter-rate for this disc, so expected]

This appears to be what i get aswell..

root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 5.48209 s, 59.8 MB/s
root@quadstation # echo 0 >/sys/block/sda/queue/iosched/slice_idle
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 2.93932 s, 111 MB/s
root@quadstation # hdparm -Tt /dev/sda
 Timing cached reads:   7264 MB in  2.00 seconds = 3633.82 MB/sec
 Timing buffered disk reads:  322 MB in  3.01 seconds = 107.00 MB/se
root@quadstation # echo 0 >/sys/block/sda/queue/iosched/slice_idle
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # hdparm -Tt /dev/sda
 Timing cached reads:   15268 MB in  2.00 seconds = 7643.54 MB/sec
 Timing buffered disk reads:  328 MB in  3.01 seconds = 108.85 MB/sec


To be sure, i did it all again:
noop:
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 2.85503 s, 115 MB/s
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # hdparm -tT /dev/sda
 Timing cached reads:   14076 MB in  2.00 seconds = 7045.78 MB/sec
 Timing buffered disk reads:  328 MB in  3.01 seconds = 109.12 MB/sec

anticipatory:
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 2.96948 s, 110 MB/s
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # hdparm -tT /dev/sda
 Timing cached reads:   13424 MB in  2.00 seconds = 6719.29 MB/sec
 Timing buffered disk reads:  328 MB in  3.01 seconds = 109.13 MB/sec

cfq:
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
5000+0 records in
5000+0 records out
327680000 bytes (328 MB) copied, 5.25252 s, 62.4 MB/s
root@quadstation # echo 1 >/proc/sys/vm/drop_caches
root@quadstation # hdparm -tT /dev/sda
 Timing cached reads:   13434 MB in  2.00 seconds = 6723.59 MB/sec
 Timing buffered disk reads:  188 MB in  3.00 seconds =  62.57 MB/sec

Thisd would appear to be quite a considerable performance difference.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory,  deadline and noop
  2008-05-11 14:02 ` Kasper Sandberg
@ 2008-05-13 12:20   ` Jens Axboe
  2008-05-13 12:58     ` Matthew
  2008-05-13 13:51     ` Kasper Sandberg
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2008-05-13 12:20 UTC (permalink / raw)
  To: Kasper Sandberg; +Cc: Daniel J Blueman, Linux Kernel, Matthew

On Sun, May 11 2008, Kasper Sandberg wrote:
> On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> > I've been experiencing this for a while also; an almost 50% regression
> > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> > more (eg default of 8) [1], which seems phenomenal.
> > 
> > Jens, is this the expected price to pay for optimal busy-spindle
> > scheduling, a design issue, bug or am I missing something totally?
> > 
> > Thanks,
> >   Daniel
> > 
> > --- [1]
> > 
> > # cat /sys/block/sda/queue/iosched/slice_idle
> > 8
> > # echo 1 >/proc/sys/vm/drop_caches
> > # dd if=/dev/sda of=/dev/null bs=64k count=5000
> > 5000+0 records in
> > 5000+0 records out
> > 327680000 bytes (328 MB) copied, 4.92922 s, 66.5 MB/s
> > 
> > # echo 0 >/sys/block/sda/queue/iosched/slice_idle
> > # echo 1 >/proc/sys/vm/drop_caches
> > # dd if=/dev/sda of=/dev/null bs=64k count=5000
> > 5000+0 records in
> > 5000+0 records out
> > 327680000 bytes (328 MB) copied, 2.74098 s, 120 MB/s
> > 
> > # hdparm -Tt /dev/sda
> > 
> > /dev/sda:
> >  Timing cached reads:   15464 MB in  2.00 seconds = 7741.05 MB/sec
> >  Timing buffered disk reads:  342 MB in  3.01 seconds = 113.70 MB/sec
> > 
> > [120MB/s is known platter-rate for this disc, so expected]
> 
> This appears to be what i get aswell..
> 
> root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 5.48209 s, 59.8 MB/s
> root@quadstation # echo 0 >/sys/block/sda/queue/iosched/slice_idle
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 2.93932 s, 111 MB/s
> root@quadstation # hdparm -Tt /dev/sda
>  Timing cached reads:   7264 MB in  2.00 seconds = 3633.82 MB/sec
>  Timing buffered disk reads:  322 MB in  3.01 seconds = 107.00 MB/se
> root@quadstation # echo 0 >/sys/block/sda/queue/iosched/slice_idle
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # hdparm -Tt /dev/sda
>  Timing cached reads:   15268 MB in  2.00 seconds = 7643.54 MB/sec
>  Timing buffered disk reads:  328 MB in  3.01 seconds = 108.85 MB/sec
> 
> 
> To be sure, i did it all again:
> noop:
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 2.85503 s, 115 MB/s
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # hdparm -tT /dev/sda
>  Timing cached reads:   14076 MB in  2.00 seconds = 7045.78 MB/sec
>  Timing buffered disk reads:  328 MB in  3.01 seconds = 109.12 MB/sec
> 
> anticipatory:
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 2.96948 s, 110 MB/s
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # hdparm -tT /dev/sda
>  Timing cached reads:   13424 MB in  2.00 seconds = 6719.29 MB/sec
>  Timing buffered disk reads:  328 MB in  3.01 seconds = 109.13 MB/sec
> 
> cfq:
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # dd if=/dev/sda of=/dev/null bs=64k count=5000
> 5000+0 records in
> 5000+0 records out
> 327680000 bytes (328 MB) copied, 5.25252 s, 62.4 MB/s
> root@quadstation # echo 1 >/proc/sys/vm/drop_caches
> root@quadstation # hdparm -tT /dev/sda
>  Timing cached reads:   13434 MB in  2.00 seconds = 6723.59 MB/sec
>  Timing buffered disk reads:  188 MB in  3.00 seconds =  62.57 MB/sec
> 
> Thisd would appear to be quite a considerable performance difference.

Indeed, that is of course a bug. The initial mail here mentions this as
a regression - which kernel was the last that worked ok?

If someone would send me a blktrace of such a slow run, that would be
nice. Basically just do a blktrace /dev/sda (or whatever device) while
doing the hdparm, preferably storing output files on a difference
device. Then send the raw sda.blktrace.* files to me. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 12:20   ` Jens Axboe
@ 2008-05-13 12:58     ` Matthew
  2008-05-13 13:05       ` Jens Axboe
  2008-05-13 13:51     ` Kasper Sandberg
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew @ 2008-05-13 12:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>
> On Sun, May 11 2008, Kasper Sandberg wrote:
>  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
>  > > I've been experiencing this for a while also; an almost 50% regression
>  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
>  > > more (eg default of 8) [1], which seems phenomenal.
>  > >
>  > > Jens, is this the expected price to pay for optimal busy-spindle
>  > > scheduling, a design issue, bug or am I missing something totally?
>  > >
>  > > Thanks,
>  > >   Daniel
[snip]
...
[snip]
>  >
>  > Thisd would appear to be quite a considerable performance difference.
>
>  Indeed, that is of course a bug. The initial mail here mentions this as
>  a regression - which kernel was the last that worked ok?
>
>  If someone would send me a blktrace of such a slow run, that would be
>  nice. Basically just do a blktrace /dev/sda (or whatever device) while
>  doing the hdparm, preferably storing output files on a difference
>  device. Then send the raw sda.blktrace.* files to me. Thanks!
>
>  --
>  Jens Axboe
>
>

Hi Jens,

I called this a "regression" since I wasn't sure if this is a real bug
or just something introduced recently, I just started to use cfq as
main io-scheduler so I can't tell ...

testing 2.6.17 unfortunately is somewhat impossible for me (reiser4;
too new hardware - problems with jmicron)

google "says" that it seemingly already existed since at least 2.6.18
(Ubuntu DapperDrake) [see:
http://ubuntuforums.org/showpost.php?p=1484633&postcount=12]

well - back to topic:

for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
blktrace can be obtained from your git-repo ?

Thanks

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 12:58     ` Matthew
@ 2008-05-13 13:05       ` Jens Axboe
       [not found]         ` <e85b9d30805130842p3a34305l4ab1e7926e4b0dba@mail.gmail.com>
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-13 13:05 UTC (permalink / raw)
  To: Matthew; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13 2008, Matthew wrote:
> On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >
> > On Sun, May 11 2008, Kasper Sandberg wrote:
> >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> >  > > I've been experiencing this for a while also; an almost 50% regression
> >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> >  > > more (eg default of 8) [1], which seems phenomenal.
> >  > >
> >  > > Jens, is this the expected price to pay for optimal busy-spindle
> >  > > scheduling, a design issue, bug or am I missing something totally?
> >  > >
> >  > > Thanks,
> >  > >   Daniel
> [snip]
> ...
> [snip]
> >  >
> >  > Thisd would appear to be quite a considerable performance difference.
> >
> >  Indeed, that is of course a bug. The initial mail here mentions this as
> >  a regression - which kernel was the last that worked ok?
> >
> >  If someone would send me a blktrace of such a slow run, that would be
> >  nice. Basically just do a blktrace /dev/sda (or whatever device) while
> >  doing the hdparm, preferably storing output files on a difference
> >  device. Then send the raw sda.blktrace.* files to me. Thanks!
> >
> >  --
> >  Jens Axboe
> >
> >
> 
> Hi Jens,
> 
> I called this a "regression" since I wasn't sure if this is a real bug
> or just something introduced recently, I just started to use cfq as
> main io-scheduler so I can't tell ...
> 
> testing 2.6.17 unfortunately is somewhat impossible for me (reiser4;
> too new hardware - problems with jmicron)
> 
> google "says" that it seemingly already existed since at least 2.6.18
> (Ubuntu DapperDrake) [see:
> http://ubuntuforums.org/showpost.php?p=1484633&postcount=12]

Funky :/

> well - back to topic:
> 
> for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
> blktrace can be obtained from your git-repo ?

Yes on both accounts, or just grab a blktrace snapshot from:

http://brick.kernel.dk/snaps/blktrace-git-latest.tar.gz

if you don't use git.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory,  deadline and noop
  2008-05-13 12:20   ` Jens Axboe
  2008-05-13 12:58     ` Matthew
@ 2008-05-13 13:51     ` Kasper Sandberg
  2008-05-14  0:33       ` Kasper Sandberg
  1 sibling, 1 reply; 35+ messages in thread
From: Kasper Sandberg @ 2008-05-13 13:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel J Blueman, Linux Kernel, Matthew

On Tue, 2008-05-13 at 14:20 +0200, Jens Axboe wrote:
> On Sun, May 11 2008, Kasper Sandberg wrote:
> > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> > > I've been experiencing this for a while also; an almost 50% regression
> > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> > > more (eg default of 8) [1], which seems phenomenal.
<snip>
> > 
> > Thisd would appear to be quite a considerable performance difference.
> 
> Indeed, that is of course a bug. The initial mail here mentions this as
> a regression - which kernel was the last that worked ok?

I am afraid i cannot exactly tell you..

But i do have some additional information for you.

I have a server running with identical disk to mine, however, with an
older intel ahci controller..

This one gets 80mb/s with cfq, and 100mb/s with
anticipatory/deadline/noop with hdparm..

This server is running debian stable with a .18 kernel. I am sad to say
however, that i will be unable to do any testing on this box, since it
is a production server, and i can not shut it down.

haltek:~/blktrace# ./blktrace  /dev/sda
BLKTRACESETUP: Inappropriate ioctl for device
Failed to start trace on /dev/sda

However, on the box where you saw the previous numbers, i sure will be
able to provide you with the data you need.

i expect to get around to doing this this afternoon, or tonight at
~02:00
(im GMT+1).

> 
> If someone would send me a blktrace of such a slow run, that would be
> nice. Basically just do a blktrace /dev/sda (or whatever device) while
> doing the hdparm, preferably storing output files on a difference
> device. Then send the raw sda.blktrace.* files to me. Thanks!
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
       [not found]         ` <e85b9d30805130842p3a34305l4ab1e7926e4b0dba@mail.gmail.com>
@ 2008-05-13 18:03           ` Jens Axboe
  2008-05-13 18:40             ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-13 18:03 UTC (permalink / raw)
  To: Matthew; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13 2008, Matthew wrote:
> On Tue, May 13, 2008 at 3:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >
> > On Tue, May 13 2008, Matthew wrote:
> >  > On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >  > >
> >  > > On Sun, May 11 2008, Kasper Sandberg wrote:
> >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> >  > >  > > I've been experiencing this for a while also; an almost 50% regression
> >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> >  > >  > > more (eg default of 8) [1], which seems phenomenal.
> >  > >  > >
> >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
> >  > >  > > scheduling, a design issue, bug or am I missing something totally?
> >  > >  > >
> >  > >  > > Thanks,
> >  > >  > >   Daniel
> >  > [snip]
> >  > ...
> >  > [snip]
> >  > >  >
> [snip]
> 
> ...
> 
> [snip]
> >  > well - back to topic:
> >  >
> >  > for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
> >  > blktrace can be obtained from your git-repo ?
> >
> >  Yes on both accounts, or just grab a blktrace snapshot from:
> >
> >  http://brick.kernel.dk/snaps/blktrace-git-latest.tar.gz
> >
> >  if you don't use git.
> >
> >  --
> >  Jens Axboe
> >
> >
> 
> unfortunately that snapshot wouldn't compile for me because of an error,
> I used the in-tree provided snapshot from portage: 0.0.20071210202527
> I hope that's ok, too;

That's fine, it doesn't really matter. But I'd appreciate if you sent me
the compile error in private, so that I can fix it :-)

> attached you'll fine the btrace (2 files) as a tar.bz2 package
> 
> from cfq
> 
> here the corresponding hdparm-output:
> hdparm -t /dev/sdd
> 
> /dev/sdd:
>  Timing buffered disk reads:  152 MB in  3.02 seconds =  50.38 MB/sec
> 
> blktrace /dev/sdd
> Device: /dev/sdd
>   CPU  0:                    0 events,     4136 KiB data
>   CPU  1:                    0 events,       11 KiB data
>   Total:                     0 events (dropped 0),     4146 KiB data
> 
> and the corresponding output of anticipatory and attached the btrace of it:
> 
> hdparm -t /dev/sdd
> 
> /dev/sdd:
>  Timing buffered disk reads:  310 MB in  3.02 seconds = 102.76 MB/sec
> 
> blktrace /dev/sdd
> Device: /dev/sdd
>   CPU  0:                    0 events,     7831 KiB data
>   CPU  1:                    0 events,      132 KiB data
>   Total:                     0 events (dropped 0),     7962 KiB data

They seem to start out the same, but then CFQ gets interrupted by a
timer unplug (which is also odd) and after that the request size drops.
On most devices you don't notice, but some are fairly picky about
request sizes. The end result is that CFQ has an average dispatch
request size of 142kb, where AS is more than double that at 306kb. I'll
need to analyze the data and look at the code a bit more to see WHY this
happens.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 18:03           ` Jens Axboe
@ 2008-05-13 18:40             ` Jens Axboe
  2008-05-13 19:23               ` Matthew
  2008-05-14  8:05               ` Daniel J Blueman
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2008-05-13 18:40 UTC (permalink / raw)
  To: Matthew; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13 2008, Jens Axboe wrote:
> On Tue, May 13 2008, Matthew wrote:
> > On Tue, May 13, 2008 at 3:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > >
> > > On Tue, May 13 2008, Matthew wrote:
> > >  > On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > >  > >
> > >  > > On Sun, May 11 2008, Kasper Sandberg wrote:
> > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
> > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
> > >  > >  > >
> > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
> > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
> > >  > >  > >
> > >  > >  > > Thanks,
> > >  > >  > >   Daniel
> > >  > [snip]
> > >  > ...
> > >  > [snip]
> > >  > >  >
> > [snip]
> > 
> > ...
> > 
> > [snip]
> > >  > well - back to topic:
> > >  >
> > >  > for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
> > >  > blktrace can be obtained from your git-repo ?
> > >
> > >  Yes on both accounts, or just grab a blktrace snapshot from:
> > >
> > >  http://brick.kernel.dk/snaps/blktrace-git-latest.tar.gz
> > >
> > >  if you don't use git.
> > >
> > >  --
> > >  Jens Axboe
> > >
> > >
> > 
> > unfortunately that snapshot wouldn't compile for me because of an error,
> > I used the in-tree provided snapshot from portage: 0.0.20071210202527
> > I hope that's ok, too;
> 
> That's fine, it doesn't really matter. But I'd appreciate if you sent me
> the compile error in private, so that I can fix it :-)
> 
> > attached you'll fine the btrace (2 files) as a tar.bz2 package
> > 
> > from cfq
> > 
> > here the corresponding hdparm-output:
> > hdparm -t /dev/sdd
> > 
> > /dev/sdd:
> >  Timing buffered disk reads:  152 MB in  3.02 seconds =  50.38 MB/sec
> > 
> > blktrace /dev/sdd
> > Device: /dev/sdd
> >   CPU  0:                    0 events,     4136 KiB data
> >   CPU  1:                    0 events,       11 KiB data
> >   Total:                     0 events (dropped 0),     4146 KiB data
> > 
> > and the corresponding output of anticipatory and attached the btrace of it:
> > 
> > hdparm -t /dev/sdd
> > 
> > /dev/sdd:
> >  Timing buffered disk reads:  310 MB in  3.02 seconds = 102.76 MB/sec
> > 
> > blktrace /dev/sdd
> > Device: /dev/sdd
> >   CPU  0:                    0 events,     7831 KiB data
> >   CPU  1:                    0 events,      132 KiB data
> >   Total:                     0 events (dropped 0),     7962 KiB data
> 
> They seem to start out the same, but then CFQ gets interrupted by a
> timer unplug (which is also odd) and after that the request size drops.
> On most devices you don't notice, but some are fairly picky about
> request sizes. The end result is that CFQ has an average dispatch
> request size of 142kb, where AS is more than double that at 306kb. I'll
> need to analyze the data and look at the code a bit more to see WHY this
> happens.

Here's a test patch, I think we get into this situation due to CFQ being
a bit too eager to start queuing again. Not tested, I'll need to spend
some testing time on this. But I'd appreciate some feedback on whether
this changes the situation! The final patch will be a little more
involved.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b399c62..ebd8ce2 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
 
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * if we are waiting for a request for this queue, let it rip
-		 * immediately and flag that we must not expire this queue
-		 * just now
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
-			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
+	if ((cfqq != cfqd->active_queue) &&
+		   cfq_should_preempt(cfqd, cfqq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
 		 * idle and has expired it's mean thinktime or this new queue

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 18:40             ` Jens Axboe
@ 2008-05-13 19:23               ` Matthew
  2008-05-13 19:30                 ` Jens Axboe
  2008-05-14  8:05               ` Daniel J Blueman
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew @ 2008-05-13 19:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13, 2008 at 8:40 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>
> On Tue, May 13 2008, Jens Axboe wrote:
>  > On Tue, May 13 2008, Matthew wrote:
>  > > On Tue, May 13, 2008 at 3:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>  > > >
>  > > > On Tue, May 13 2008, Matthew wrote:
>  > > >  > On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
>  > > >  > >
>  > > >  > > On Sun, May 11 2008, Kasper Sandberg wrote:
>  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
>  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
>  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
>  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
>  > > >  > >  > >
>  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
>  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
>  > > >  > >  > >
>  > > >  > >  > > Thanks,
>  > > >  > >  > >   Daniel
>  > > >  > [snip]
>  > > >  > ...
>  > > >  > [snip]
>  > > >  > >  >
>  > > [snip]
>  > >
>  > > ...
>  > >
>  > > [snip]
>  > > >  > well - back to topic:
>  > > >  >
>  > > >  > for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
>  > > >  > blktrace can be obtained from your git-repo ?
>  > > >
>  > > >  Yes on both accounts, or just grab a blktrace snapshot from:
>  > > >
>  > > >  http://brick.kernel.dk/snaps/blktrace-git-latest.tar.gz
>  > > >
>  > > >  if you don't use git.
>  > > >
>  > > >  --
>  > > >  Jens Axboe
>  > > >
>  > > >
>  > >
[snip]
...
[snip]
>  >
>  > They seem to start out the same, but then CFQ gets interrupted by a
>  > timer unplug (which is also odd) and after that the request size drops.
>  > On most devices you don't notice, but some are fairly picky about
>  > request sizes. The end result is that CFQ has an average dispatch
>  > request size of 142kb, where AS is more than double that at 306kb. I'll
>  > need to analyze the data and look at the code a bit more to see WHY this
>  > happens.
>
>  Here's a test patch, I think we get into this situation due to CFQ being
>  a bit too eager to start queuing again. Not tested, I'll need to spend
>  some testing time on this. But I'd appreciate some feedback on whether
>  this changes the situation! The final patch will be a little more
>  involved.
[snip]
...
[snip]
>
>  --
>  Jens Axboe
>
>

unfortunately that patch didn't help:

hdparm -t /dev/sde

/dev/sde:
 Timing buffered disk reads:  178 MB in  3.03 seconds =  58.67 MB/sec


hdparm -t /dev/sdd

/dev/sdd:
 Timing buffered disk reads:  164 MB in  3.00 seconds =  54.61 MB/sec

-> the first should be around 74 MB/sec, the second around 102 MB/sec

Thanks

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 19:23               ` Matthew
@ 2008-05-13 19:30                 ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2008-05-13 19:30 UTC (permalink / raw)
  To: Matthew; +Cc: Kasper Sandberg, Daniel J Blueman, Linux Kernel

On Tue, May 13 2008, Matthew wrote:
> On Tue, May 13, 2008 at 8:40 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >
> > On Tue, May 13 2008, Jens Axboe wrote:
> >  > On Tue, May 13 2008, Matthew wrote:
> >  > > On Tue, May 13, 2008 at 3:05 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >  > > >
> >  > > > On Tue, May 13 2008, Matthew wrote:
> >  > > >  > On Tue, May 13, 2008 at 2:20 PM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >  > > >  > >
> >  > > >  > > On Sun, May 11 2008, Kasper Sandberg wrote:
> >  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> >  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
> >  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> >  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
> >  > > >  > >  > >
> >  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
> >  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
> >  > > >  > >  > >
> >  > > >  > >  > > Thanks,
> >  > > >  > >  > >   Daniel
> >  > > >  > [snip]
> >  > > >  > ...
> >  > > >  > [snip]
> >  > > >  > >  >
> >  > > [snip]
> >  > >
> >  > > ...
> >  > >
> >  > > [snip]
> >  > > >  > well - back to topic:
> >  > > >  >
> >  > > >  > for a blktrace one need to enable  CONFIG_BLK_DEV_IO_TRACE , right ?
> >  > > >  > blktrace can be obtained from your git-repo ?
> >  > > >
> >  > > >  Yes on both accounts, or just grab a blktrace snapshot from:
> >  > > >
> >  > > >  http://brick.kernel.dk/snaps/blktrace-git-latest.tar.gz
> >  > > >
> >  > > >  if you don't use git.
> >  > > >
> >  > > >  --
> >  > > >  Jens Axboe
> >  > > >
> >  > > >
> >  > >
> [snip]
> ...
> [snip]
> >  >
> >  > They seem to start out the same, but then CFQ gets interrupted by a
> >  > timer unplug (which is also odd) and after that the request size drops.
> >  > On most devices you don't notice, but some are fairly picky about
> >  > request sizes. The end result is that CFQ has an average dispatch
> >  > request size of 142kb, where AS is more than double that at 306kb. I'll
> >  > need to analyze the data and look at the code a bit more to see WHY this
> >  > happens.
> >
> >  Here's a test patch, I think we get into this situation due to CFQ being
> >  a bit too eager to start queuing again. Not tested, I'll need to spend
> >  some testing time on this. But I'd appreciate some feedback on whether
> >  this changes the situation! The final patch will be a little more
> >  involved.
> [snip]
> ...
> [snip]
> >
> >  --
> >  Jens Axboe
> >
> >
> 
> unfortunately that patch didn't help:
> 
> hdparm -t /dev/sde
> 
> /dev/sde:
>  Timing buffered disk reads:  178 MB in  3.03 seconds =  58.67 MB/sec
> 
> 
> hdparm -t /dev/sdd
> 
> /dev/sdd:
>  Timing buffered disk reads:  164 MB in  3.00 seconds =  54.61 MB/sec
> 
> -> the first should be around 74 MB/sec, the second around 102 MB/sec

Can you capture blktrace for that run as well, please? Just to have
something to compare with.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory,  deadline and noop
  2008-05-13 13:51     ` Kasper Sandberg
@ 2008-05-14  0:33       ` Kasper Sandberg
  0 siblings, 0 replies; 35+ messages in thread
From: Kasper Sandberg @ 2008-05-14  0:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel J Blueman, Linux Kernel, Matthew

On Tue, 2008-05-13 at 15:51 +0200, Kasper Sandberg wrote:
> On Tue, 2008-05-13 at 14:20 +0200, Jens Axboe wrote:
> > On Sun, May 11 2008, Kasper Sandberg wrote:
> > > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> > > > I've been experiencing this for a while also; an almost 50% regression
> > > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> > > > more (eg default of 8) [1], which seems phenomenal.
> <snip>
> > > 
<snip>
> 
> i expect to get around to doing this this afternoon, or tonight at
> ~02:00
> (im GMT+1).

Well :) not too far off(02:32 now)

http://62.242.235.92/~redeeman/blktrace.tar.bz2

it contains the blktrace with cfq, noop, anticipatory and deadline,
along with the output of blktrace and hdparm.

Hope it helps.

> 
> 
> > 
> > If someone would send me a blktrace of such a slow run, that would be
> > nice. Basically just do a blktrace /dev/sda (or whatever device) while
> > doing the hdparm, preferably storing output files on a difference
> > device. Then send the raw sda.blktrace.* files to me. Thanks!
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-13 18:40             ` Jens Axboe
  2008-05-13 19:23               ` Matthew
@ 2008-05-14  8:05               ` Daniel J Blueman
  2008-05-14  8:26                 ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-05-14  8:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matthew, Kasper Sandberg, Linux Kernel

>  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
>  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
>  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
>  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
>  > > >  > >  > >
>  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
>  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
[snip]
>  > They seem to start out the same, but then CFQ gets interrupted by a
>  > timer unplug (which is also odd) and after that the request size drops.
>  > On most devices you don't notice, but some are fairly picky about
>  > request sizes. The end result is that CFQ has an average dispatch
>  > request size of 142kb, where AS is more than double that at 306kb. I'll
>  > need to analyze the data and look at the code a bit more to see WHY this
>  > happens.
>
>  Here's a test patch, I think we get into this situation due to CFQ being
>  a bit too eager to start queuing again. Not tested, I'll need to spend
>  some testing time on this. But I'd appreciate some feedback on whether
>  this changes the situation! The final patch will be a little more
>  involved.
>
>  diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>  index b399c62..ebd8ce2 100644
>  --- a/block/cfq-iosched.c
>  +++ b/block/cfq-iosched.c
>  @@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>
>         cic->last_request_pos = rq->sector + rq->nr_sectors;
>
>  -       if (cfqq == cfqd->active_queue) {
>  -               /*
>  -                * if we are waiting for a request for this queue, let it rip
>  -                * immediately and flag that we must not expire this queue
>  -                * just now
>  -                */
>  -               if (cfq_cfqq_wait_request(cfqq)) {
>  -                       cfq_mark_cfqq_must_dispatch(cfqq);
>  -                       del_timer(&cfqd->idle_slice_timer);
>  -                       blk_start_queueing(cfqd->queue);
>  -               }
>  -       } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>  +       if ((cfqq != cfqd->active_queue) &&
>  +                  cfq_should_preempt(cfqd, cfqq, rq)) {
>                 /*
>                  * not the active queue - expire current slice if it is
>                  * idle and has expired it's mean thinktime or this new queue

I find this does address the issue (both with 64KB stride dd and
hdparm -t; presumably the requests getting merged). Tested on
2.6.26-rc2 on Ubuntu HH 804 x86-64, with slice_idle defaulting to 8
and AHCI on ICH9; disk is ST3320613AS.

Blktrace profiles from 'dd if=/dev/sda of=/dev/null bs=64k count=1000' are at:

http://quora.org/blktrace-profiles.tar.bz2

Let me know when you have an updated patch to test,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-14  8:05               ` Daniel J Blueman
@ 2008-05-14  8:26                 ` Jens Axboe
  2008-05-14 20:52                   ` Daniel J Blueman
       [not found]                   ` <e85b9d30805140332r3311b2d6r6831d37421ced757@mail.gmail.com>
  0 siblings, 2 replies; 35+ messages in thread
From: Jens Axboe @ 2008-05-14  8:26 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Matthew, Kasper Sandberg, Linux Kernel

On Wed, May 14 2008, Daniel J Blueman wrote:
> >  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> >  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
> >  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> >  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
> >  > > >  > >  > >
> >  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
> >  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
> [snip]
> >  > They seem to start out the same, but then CFQ gets interrupted by a
> >  > timer unplug (which is also odd) and after that the request size drops.
> >  > On most devices you don't notice, but some are fairly picky about
> >  > request sizes. The end result is that CFQ has an average dispatch
> >  > request size of 142kb, where AS is more than double that at 306kb. I'll
> >  > need to analyze the data and look at the code a bit more to see WHY this
> >  > happens.
> >
> >  Here's a test patch, I think we get into this situation due to CFQ being
> >  a bit too eager to start queuing again. Not tested, I'll need to spend
> >  some testing time on this. But I'd appreciate some feedback on whether
> >  this changes the situation! The final patch will be a little more
> >  involved.
> >
> >  diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >  index b399c62..ebd8ce2 100644
> >  --- a/block/cfq-iosched.c
> >  +++ b/block/cfq-iosched.c
> >  @@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >
> >         cic->last_request_pos = rq->sector + rq->nr_sectors;
> >
> >  -       if (cfqq == cfqd->active_queue) {
> >  -               /*
> >  -                * if we are waiting for a request for this queue, let it rip
> >  -                * immediately and flag that we must not expire this queue
> >  -                * just now
> >  -                */
> >  -               if (cfq_cfqq_wait_request(cfqq)) {
> >  -                       cfq_mark_cfqq_must_dispatch(cfqq);
> >  -                       del_timer(&cfqd->idle_slice_timer);
> >  -                       blk_start_queueing(cfqd->queue);
> >  -               }
> >  -       } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> >  +       if ((cfqq != cfqd->active_queue) &&
> >  +                  cfq_should_preempt(cfqd, cfqq, rq)) {
> >                 /*
> >                  * not the active queue - expire current slice if it is
> >                  * idle and has expired it's mean thinktime or this new queue
> 
> I find this does address the issue (both with 64KB stride dd and
> hdparm -t; presumably the requests getting merged). Tested on
> 2.6.26-rc2 on Ubuntu HH 804 x86-64, with slice_idle defaulting to 8
> and AHCI on ICH9; disk is ST3320613AS.
> 
> Blktrace profiles from 'dd if=/dev/sda of=/dev/null bs=64k count=1000' are at:
> 
> http://quora.org/blktrace-profiles.tar.bz2

Goodie! I think the below patch is better - we do want to schedule the
queue immediately, but we do not want to interrupt the queuer. So just
kick the workqueue handling of the queue instead of entering the
dispatcher directly. Can you test this one as well? Thanks!

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f4e1006..e8c1941 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1107,7 +1107,6 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 
 		cfq_clear_cfqq_must_dispatch(cfqq);
 		cfq_clear_cfqq_wait_request(cfqq);
-		del_timer(&cfqd->idle_slice_timer);
 
 		dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
 	}
@@ -1769,15 +1768,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
 
 	if (cfqq == cfqd->active_queue) {
-		/*
-		 * if we are waiting for a request for this queue, let it rip
-		 * immediately and flag that we must not expire this queue
-		 * just now
-		 */
 		if (cfq_cfqq_wait_request(cfqq)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
 			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
+			kblockd_schedule_work(&cfqd->unplug_work);
 		}
 	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
 		/*
@@ -1787,7 +1780,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 */
 		cfq_preempt_queue(cfqd, cfqq);
 		cfq_mark_cfqq_must_dispatch(cfqq);
-		blk_start_queueing(cfqd->queue);
+		kblockd_schedule_work(&cfqd->unplug_work);
 	}
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-14  8:26                 ` Jens Axboe
@ 2008-05-14 20:52                   ` Daniel J Blueman
  2008-05-14 21:37                     ` Matthew
       [not found]                   ` <e85b9d30805140332r3311b2d6r6831d37421ced757@mail.gmail.com>
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-05-14 20:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matthew, Kasper Sandberg, Linux Kernel

On Wed, May 14, 2008 at 9:26 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Wed, May 14 2008, Daniel J Blueman wrote:
>> >  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
>> >  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
>> >  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
>> >  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
>> >  > > >  > >  > >
>> >  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
>> >  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
>> [snip]
>> >  > They seem to start out the same, but then CFQ gets interrupted by a
>> >  > timer unplug (which is also odd) and after that the request size drops.
>> >  > On most devices you don't notice, but some are fairly picky about
>> >  > request sizes. The end result is that CFQ has an average dispatch
>> >  > request size of 142kb, where AS is more than double that at 306kb. I'll
>> >  > need to analyze the data and look at the code a bit more to see WHY this
>> >  > happens.
>> >
>> >  Here's a test patch, I think we get into this situation due to CFQ being
>> >  a bit too eager to start queuing again. Not tested, I'll need to spend
>> >  some testing time on this. But I'd appreciate some feedback on whether
>> >  this changes the situation! The final patch will be a little more
>> >  involved.
>> >
>> >  diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> >  index b399c62..ebd8ce2 100644
>> >  --- a/block/cfq-iosched.c
>> >  +++ b/block/cfq-iosched.c
>> >  @@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >
>> >         cic->last_request_pos = rq->sector + rq->nr_sectors;
>> >
>> >  -       if (cfqq == cfqd->active_queue) {
>> >  -               /*
>> >  -                * if we are waiting for a request for this queue, let it rip
>> >  -                * immediately and flag that we must not expire this queue
>> >  -                * just now
>> >  -                */
>> >  -               if (cfq_cfqq_wait_request(cfqq)) {
>> >  -                       cfq_mark_cfqq_must_dispatch(cfqq);
>> >  -                       del_timer(&cfqd->idle_slice_timer);
>> >  -                       blk_start_queueing(cfqd->queue);
>> >  -               }
>> >  -       } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>> >  +       if ((cfqq != cfqd->active_queue) &&
>> >  +                  cfq_should_preempt(cfqd, cfqq, rq)) {
>> >                 /*
>> >                  * not the active queue - expire current slice if it is
>> >                  * idle and has expired it's mean thinktime or this new queue
>>
>> I find this does address the issue (both with 64KB stride dd and
>> hdparm -t; presumably the requests getting merged). Tested on
>> 2.6.26-rc2 on Ubuntu HH 804 x86-64, with slice_idle defaulting to 8
>> and AHCI on ICH9; disk is ST3320613AS.
>>
>> Blktrace profiles from 'dd if=/dev/sda of=/dev/null bs=64k count=1000' are at:
>>
>> http://quora.org/blktrace-profiles.tar.bz2
>
> Goodie! I think the below patch is better - we do want to schedule the
> queue immediately, but we do not want to interrupt the queuer. So just
> kick the workqueue handling of the queue instead of entering the
> dispatcher directly. Can you test this one as well? Thanks!
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index f4e1006..e8c1941 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1107,7 +1107,6 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>
>                cfq_clear_cfqq_must_dispatch(cfqq);
>                cfq_clear_cfqq_wait_request(cfqq);
> -               del_timer(&cfqd->idle_slice_timer);
>
>                dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
>        }
> @@ -1769,15 +1768,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>        cic->last_request_pos = rq->sector + rq->nr_sectors;
>
>        if (cfqq == cfqd->active_queue) {
> -               /*
> -                * if we are waiting for a request for this queue, let it rip
> -                * immediately and flag that we must not expire this queue
> -                * just now
> -                */
>                if (cfq_cfqq_wait_request(cfqq)) {
> -                       cfq_mark_cfqq_must_dispatch(cfqq);
>                        del_timer(&cfqd->idle_slice_timer);
> -                       blk_start_queueing(cfqd->queue);
> +                       kblockd_schedule_work(&cfqd->unplug_work);
>                }
>        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>                /*
> @@ -1787,7 +1780,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>                 */
>                cfq_preempt_queue(cfqd, cfqq);
>                cfq_mark_cfqq_must_dispatch(cfqq);
> -               blk_start_queueing(cfqd->queue);
> +               kblockd_schedule_work(&cfqd->unplug_work);
>        }
>  }

Applied on top of 2.6.26-rc2, I get platter-speed (118MB/s) with 'dd
if=/dev/sda of=/dev/null bs=64k' and 'hdparm -t', so looks good.
Identical testing without the patch (ie pure mainline) consistently
yields 65MB/s.

Blktrace profile at:

http://quora.org/blktrace-profiles-2.tar.bz2

I'll check for performance regressions with postmark on XFS; anything
else worth running while I've got this in hand?

Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-14 20:52                   ` Daniel J Blueman
@ 2008-05-14 21:37                     ` Matthew
  2008-05-15  7:01                       ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew @ 2008-05-14 21:37 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Jens Axboe, Kasper Sandberg, Linux Kernel

On Wed, May 14, 2008 at 10:52 PM, Daniel J Blueman
<daniel.blueman@gmail.com> wrote:
> On Wed, May 14, 2008 at 9:26 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> On Wed, May 14 2008, Daniel J Blueman wrote:
>>> >  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
>>> >  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
>>> >  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
>>> >  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
>>> >  > > >  > >  > >
>>> >  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
>>> >  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
>>> [snip]
>>> >  > They seem to start out the same, but then CFQ gets interrupted by a
>>> >  > timer unplug (which is also odd) and after that the request size drops.
>>> >  > On most devices you don't notice, but some are fairly picky about
>>> >  > request sizes. The end result is that CFQ has an average dispatch
>>> >  > request size of 142kb, where AS is more than double that at 306kb. I'll
>>> >  > need to analyze the data and look at the code a bit more to see WHY this
>>> >  > happens.
>>> >
>>> >  Here's a test patch, I think we get into this situation due to CFQ being
>>> >  a bit too eager to start queuing again. Not tested, I'll need to spend
>>> >  some testing time on this. But I'd appreciate some feedback on whether
>>> >  this changes the situation! The final patch will be a little more
>>> >  involved.
>>> >
>>> >  diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>> >  index b399c62..ebd8ce2 100644
>>> >  --- a/block/cfq-iosched.c
>>> >  +++ b/block/cfq-iosched.c
>>> >  @@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>> >
>>> >         cic->last_request_pos = rq->sector + rq->nr_sectors;
>>> >
>>> >  -       if (cfqq == cfqd->active_queue) {
>>> >  -               /*
>>> >  -                * if we are waiting for a request for this queue, let it rip
>>> >  -                * immediately and flag that we must not expire this queue
>>> >  -                * just now
>>> >  -                */
>>> >  -               if (cfq_cfqq_wait_request(cfqq)) {
>>> >  -                       cfq_mark_cfqq_must_dispatch(cfqq);
>>> >  -                       del_timer(&cfqd->idle_slice_timer);
>>> >  -                       blk_start_queueing(cfqd->queue);
>>> >  -               }
>>> >  -       } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>>> >  +       if ((cfqq != cfqd->active_queue) &&
>>> >  +                  cfq_should_preempt(cfqd, cfqq, rq)) {
>>> >                 /*
>>> >                  * not the active queue - expire current slice if it is
>>> >                  * idle and has expired it's mean thinktime or this new queue
>>>
>>> I find this does address the issue (both with 64KB stride dd and
>>> hdparm -t; presumably the requests getting merged). Tested on
>>> 2.6.26-rc2 on Ubuntu HH 804 x86-64, with slice_idle defaulting to 8
>>> and AHCI on ICH9; disk is ST3320613AS.
>>>
>>> Blktrace profiles from 'dd if=/dev/sda of=/dev/null bs=64k count=1000' are at:
>>>
>>> http://quora.org/blktrace-profiles.tar.bz2
>>
>> Goodie! I think the below patch is better - we do want to schedule the
>> queue immediately, but we do not want to interrupt the queuer. So just
>> kick the workqueue handling of the queue instead of entering the
>> dispatcher directly. Can you test this one as well? Thanks!
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index f4e1006..e8c1941 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1107,7 +1107,6 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>>
>>                cfq_clear_cfqq_must_dispatch(cfqq);
>>                cfq_clear_cfqq_wait_request(cfqq);
>> -               del_timer(&cfqd->idle_slice_timer);
>>
>>                dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
>>        }
>> @@ -1769,15 +1768,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>        cic->last_request_pos = rq->sector + rq->nr_sectors;
>>
>>        if (cfqq == cfqd->active_queue) {
>> -               /*
>> -                * if we are waiting for a request for this queue, let it rip
>> -                * immediately and flag that we must not expire this queue
>> -                * just now
>> -                */
>>                if (cfq_cfqq_wait_request(cfqq)) {
>> -                       cfq_mark_cfqq_must_dispatch(cfqq);
>>                        del_timer(&cfqd->idle_slice_timer);
>> -                       blk_start_queueing(cfqd->queue);
>> +                       kblockd_schedule_work(&cfqd->unplug_work);
>>                }
>>        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>>                /*
>> @@ -1787,7 +1780,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>                 */
>>                cfq_preempt_queue(cfqd, cfqq);
>>                cfq_mark_cfqq_must_dispatch(cfqq);
>> -               blk_start_queueing(cfqd->queue);
>> +               kblockd_schedule_work(&cfqd->unplug_work);
>>        }
>>  }
>
> Applied on top of 2.6.26-rc2, I get platter-speed (118MB/s) with 'dd
> if=/dev/sda of=/dev/null bs=64k' and 'hdparm -t', so looks good.
> Identical testing without the patch (ie pure mainline) consistently
> yields 65MB/s.
>
> Blktrace profile at:
>
> http://quora.org/blktrace-profiles-2.tar.bz2
>
> I'll check for performance regressions with postmark on XFS; anything
> else worth running while I've got this in hand?
>
> Daniel
> --
> Daniel J Blueman
>

so it seems something specific to >=2.6.26-rc2 + the patch fixed it for you ?

were there any notable changes from 2.6.25 -> 2.6.26 in cfq or the VFS
in general ?

I'm curious if this also works with 2.6.25, could you please test that too ?

I'll give .26-rc2 a test-ride later

thanks

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-14 21:37                     ` Matthew
@ 2008-05-15  7:01                       ` Jens Axboe
  2008-05-15 12:21                         ` Fabio Checconi
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-15  7:01 UTC (permalink / raw)
  To: Matthew; +Cc: Daniel J Blueman, Kasper Sandberg, Linux Kernel

On Wed, May 14 2008, Matthew wrote:
> On Wed, May 14, 2008 at 10:52 PM, Daniel J Blueman
> <daniel.blueman@gmail.com> wrote:
> > On Wed, May 14, 2008 at 9:26 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> >> On Wed, May 14 2008, Daniel J Blueman wrote:
> >>> >  > > >  > >  > On Sun, 2008-05-11 at 14:14 +0100, Daniel J Blueman wrote:
> >>> >  > > >  > >  > > I've been experiencing this for a while also; an almost 50% regression
> >>> >  > > >  > >  > > is seen for single-process reads (ie sync) if slice_idle is 1ms or
> >>> >  > > >  > >  > > more (eg default of 8) [1], which seems phenomenal.
> >>> >  > > >  > >  > >
> >>> >  > > >  > >  > > Jens, is this the expected price to pay for optimal busy-spindle
> >>> >  > > >  > >  > > scheduling, a design issue, bug or am I missing something totally?
> >>> [snip]
> >>> >  > They seem to start out the same, but then CFQ gets interrupted by a
> >>> >  > timer unplug (which is also odd) and after that the request size drops.
> >>> >  > On most devices you don't notice, but some are fairly picky about
> >>> >  > request sizes. The end result is that CFQ has an average dispatch
> >>> >  > request size of 142kb, where AS is more than double that at 306kb. I'll
> >>> >  > need to analyze the data and look at the code a bit more to see WHY this
> >>> >  > happens.
> >>> >
> >>> >  Here's a test patch, I think we get into this situation due to CFQ being
> >>> >  a bit too eager to start queuing again. Not tested, I'll need to spend
> >>> >  some testing time on this. But I'd appreciate some feedback on whether
> >>> >  this changes the situation! The final patch will be a little more
> >>> >  involved.
> >>> >
> >>> >  diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >>> >  index b399c62..ebd8ce2 100644
> >>> >  --- a/block/cfq-iosched.c
> >>> >  +++ b/block/cfq-iosched.c
> >>> >  @@ -1775,18 +1775,8 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>> >
> >>> >         cic->last_request_pos = rq->sector + rq->nr_sectors;
> >>> >
> >>> >  -       if (cfqq == cfqd->active_queue) {
> >>> >  -               /*
> >>> >  -                * if we are waiting for a request for this queue, let it rip
> >>> >  -                * immediately and flag that we must not expire this queue
> >>> >  -                * just now
> >>> >  -                */
> >>> >  -               if (cfq_cfqq_wait_request(cfqq)) {
> >>> >  -                       cfq_mark_cfqq_must_dispatch(cfqq);
> >>> >  -                       del_timer(&cfqd->idle_slice_timer);
> >>> >  -                       blk_start_queueing(cfqd->queue);
> >>> >  -               }
> >>> >  -       } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> >>> >  +       if ((cfqq != cfqd->active_queue) &&
> >>> >  +                  cfq_should_preempt(cfqd, cfqq, rq)) {
> >>> >                 /*
> >>> >                  * not the active queue - expire current slice if it is
> >>> >                  * idle and has expired it's mean thinktime or this new queue
> >>>
> >>> I find this does address the issue (both with 64KB stride dd and
> >>> hdparm -t; presumably the requests getting merged). Tested on
> >>> 2.6.26-rc2 on Ubuntu HH 804 x86-64, with slice_idle defaulting to 8
> >>> and AHCI on ICH9; disk is ST3320613AS.
> >>>
> >>> Blktrace profiles from 'dd if=/dev/sda of=/dev/null bs=64k count=1000' are at:
> >>>
> >>> http://quora.org/blktrace-profiles.tar.bz2
> >>
> >> Goodie! I think the below patch is better - we do want to schedule the
> >> queue immediately, but we do not want to interrupt the queuer. So just
> >> kick the workqueue handling of the queue instead of entering the
> >> dispatcher directly. Can you test this one as well? Thanks!
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index f4e1006..e8c1941 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -1107,7 +1107,6 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> >>
> >>                cfq_clear_cfqq_must_dispatch(cfqq);
> >>                cfq_clear_cfqq_wait_request(cfqq);
> >> -               del_timer(&cfqd->idle_slice_timer);
> >>
> >>                dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
> >>        }
> >> @@ -1769,15 +1768,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>        cic->last_request_pos = rq->sector + rq->nr_sectors;
> >>
> >>        if (cfqq == cfqd->active_queue) {
> >> -               /*
> >> -                * if we are waiting for a request for this queue, let it rip
> >> -                * immediately and flag that we must not expire this queue
> >> -                * just now
> >> -                */
> >>                if (cfq_cfqq_wait_request(cfqq)) {
> >> -                       cfq_mark_cfqq_must_dispatch(cfqq);
> >>                        del_timer(&cfqd->idle_slice_timer);
> >> -                       blk_start_queueing(cfqd->queue);
> >> +                       kblockd_schedule_work(&cfqd->unplug_work);
> >>                }
> >>        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> >>                /*
> >> @@ -1787,7 +1780,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >>                 */
> >>                cfq_preempt_queue(cfqd, cfqq);
> >>                cfq_mark_cfqq_must_dispatch(cfqq);
> >> -               blk_start_queueing(cfqd->queue);
> >> +               kblockd_schedule_work(&cfqd->unplug_work);
> >>        }
> >>  }
> >
> > Applied on top of 2.6.26-rc2, I get platter-speed (118MB/s) with 'dd
> > if=/dev/sda of=/dev/null bs=64k' and 'hdparm -t', so looks good.
> > Identical testing without the patch (ie pure mainline) consistently
> > yields 65MB/s.
> >
> > Blktrace profile at:
> >
> > http://quora.org/blktrace-profiles-2.tar.bz2
> >
> > I'll check for performance regressions with postmark on XFS; anything
> > else worth running while I've got this in hand?
> >
> > Daniel
> > --
> > Daniel J Blueman
> >
> 
> so it seems something specific to >=2.6.26-rc2 + the patch fixed it for you ?
> 
> were there any notable changes from 2.6.25 -> 2.6.26 in cfq or the VFS
> in general ?
> 
> I'm curious if this also works with 2.6.25, could you please test that too ?
> 
> I'll give .26-rc2 a test-ride later

I don't think it's 2.6.25 vs 2.6.26-rc2, I can still reproduce some
request size offsets with the patch. So still fumbling around with this,
I'll be sending out another test patch when I'm confident it's solved
the size issue.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-15  7:01                       ` Jens Axboe
@ 2008-05-15 12:21                         ` Fabio Checconi
  2008-05-16  6:40                           ` Jens Axboe
  2008-08-24 20:24                           ` Daniel J Blueman
  0 siblings, 2 replies; 35+ messages in thread
From: Fabio Checconi @ 2008-05-15 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matthew, Daniel J Blueman, Kasper Sandberg, Linux Kernel

> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Thu, May 15, 2008 09:01:28AM +0200
>
> I don't think it's 2.6.25 vs 2.6.26-rc2, I can still reproduce some
> request size offsets with the patch. So still fumbling around with this,
> I'll be sending out another test patch when I'm confident it's solved
> the size issue.
> 

IMO an interesting thing is how/why anticipatory doesn't show the
issue.  The device is not put into ANTIC_WAIT_NEXT if there is no
dispatch returning no requests while the queue is not empty.  This
seems to be enough in the reported workloads.

I don't think this behavior is the correct one (it is still racy
WRT merges after breaking anticipation) anyway it should make things
a little bit better.  I fear that a complete solution would not
involve only the scheduler.

Introducing the very same behavior in cfq seems to be not so easy
(i.e., start idling only if there was a dispatch round while the
last request was being served) but an approximated version can be
introduced quite easily.  The patch below should do that, rescheduling
the dispatch only if necessary; it is not tested at all, just posted
for discussion.

---
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b399c62..41f1e0e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -169,6 +169,7 @@ enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_queue_new,	/* queue never been serviced */
 	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
 	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
+	CFQ_CFQQ_FLAG_dispatched,	/* empty dispatch while idling */
 };
 
 #define CFQ_CFQQ_FNS(name)						\
@@ -196,6 +197,7 @@ CFQ_CFQQ_FNS(prio_changed);
 CFQ_CFQQ_FNS(queue_new);
 CFQ_CFQQ_FNS(slice_new);
 CFQ_CFQQ_FNS(sync);
+CFQ_CFQQ_FNS(dispatched);
 #undef CFQ_CFQQ_FNS
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -749,6 +751,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 		cfqq->slice_end = 0;
 		cfq_clear_cfqq_must_alloc_slice(cfqq);
 		cfq_clear_cfqq_fifo_expire(cfqq);
+		cfq_clear_cfqq_dispatched(cfqq);
 		cfq_mark_cfqq_slice_new(cfqq);
 		cfq_clear_cfqq_queue_new(cfqq);
 	}
@@ -978,6 +981,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	 */
 	if (timer_pending(&cfqd->idle_slice_timer) ||
 	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
+		cfq_mark_cfqq_dispatched(cfqq);
 		cfqq = NULL;
 		goto keep_queue;
 	}
@@ -1784,7 +1788,10 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		if (cfq_cfqq_wait_request(cfqq)) {
 			cfq_mark_cfqq_must_dispatch(cfqq);
 			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
+			if (cfq_cfqq_dispatched(cfqq)) {
+				cfq_clear_cfqq_dispatched(cfqq);
+				cfq_schedule_dispatch(cfqd);
+			}
 		}
 	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
 		/*

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-15 12:21                         ` Fabio Checconi
@ 2008-05-16  6:40                           ` Jens Axboe
  2008-05-16  7:46                             ` Fabio Checconi
  2008-08-24 20:24                           ` Daniel J Blueman
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-16  6:40 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Matthew, Daniel J Blueman, Kasper Sandberg, Linux Kernel

On Thu, May 15 2008, Fabio Checconi wrote:
> > From: Jens Axboe <jens.axboe@oracle.com>
> > Date: Thu, May 15, 2008 09:01:28AM +0200
> >
> > I don't think it's 2.6.25 vs 2.6.26-rc2, I can still reproduce some
> > request size offsets with the patch. So still fumbling around with this,
> > I'll be sending out another test patch when I'm confident it's solved
> > the size issue.
> > 
> 
> IMO an interesting thing is how/why anticipatory doesn't show the
> issue.  The device is not put into ANTIC_WAIT_NEXT if there is no
> dispatch returning no requests while the queue is not empty.  This
> seems to be enough in the reported workloads.
> 
> I don't think this behavior is the correct one (it is still racy
> WRT merges after breaking anticipation) anyway it should make things
> a little bit better.  I fear that a complete solution would not
> involve only the scheduler.
> 
> Introducing the very same behavior in cfq seems to be not so easy
> (i.e., start idling only if there was a dispatch round while the
> last request was being served) but an approximated version can be
> introduced quite easily.  The patch below should do that, rescheduling
> the dispatch only if necessary; it is not tested at all, just posted
> for discussion.

Daniel (and others in this thread), can you give this a shot as well? It
looks promising, it'll allow greater buildup of the request. From my
testing, instead of getting nicely aligned 128k or 256k requests, we'd
end up in a nasty 4k+124k stream. Delaying the first queue kick should
fix that, since we wont dispatch that first 4k request until it has been
merged.

I think we can improve this further without getting too involved. If a
2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
since this likely means that we wont be doing more merges on the first
one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  6:40                           ` Jens Axboe
@ 2008-05-16  7:46                             ` Fabio Checconi
  2008-05-16  7:49                               ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Fabio Checconi @ 2008-05-16  7:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matthew, Daniel J Blueman, Kasper Sandberg, Linux Kernel

> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Fri, May 16, 2008 08:40:03AM +0200
>
...
> I think we can improve this further without getting too involved. If a
> 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
> since this likely means that we wont be doing more merges on the first
> one.
> 

But isn't there the risk that even the second request would be
dispatched, while it still could have grown?

Moreover I am still unsure about how to handle (and if it's worth
handling) the case in which we restart queueing after an empty
dispatch round due to idling, as it would still have the same
problem.

(Also anticipatory doesn't handle this case too well.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  7:46                             ` Fabio Checconi
@ 2008-05-16  7:49                               ` Jens Axboe
  2008-05-16  7:57                                 ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-16  7:49 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Matthew, Daniel J Blueman, Kasper Sandberg, Linux Kernel

On Fri, May 16 2008, Fabio Checconi wrote:
> > From: Jens Axboe <jens.axboe@oracle.com>
> > Date: Fri, May 16, 2008 08:40:03AM +0200
> >
> ...
> > I think we can improve this further without getting too involved. If a
> > 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
> > since this likely means that we wont be doing more merges on the first
> > one.
> > 
> 
> But isn't there the risk that even the second request would be
> dispatched, while it still could have grown?

Certainly, you'd only want to dispatch the first request. Ideally we'd
just get rid of this logic of 'did empty dispatch round' and only
dispatch requests once merging is done, it's basically the wrong thing
to do to make it visible to the io scheduler so soon. Well of course
even more ideally we'd always get big requests submitted, but
unfortunately many producers aren't that nice.

The per-process plugging actually solves this nicely, since we do the
merging outside of the io scheduler. Perhaps just not dispatch on a
plugged queue would help a bit. I'm somewhat against this principle of
messing too much with dispatch logic in the schedulers, it'd be nicer to
solve this higher up.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  7:49                               ` Jens Axboe
@ 2008-05-16  7:57                                 ` Jens Axboe
  2008-05-16  8:53                                   ` Daniel J Blueman
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-16  7:57 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Matthew, Daniel J Blueman, Kasper Sandberg, Linux Kernel

On Fri, May 16 2008, Jens Axboe wrote:
> On Fri, May 16 2008, Fabio Checconi wrote:
> > > From: Jens Axboe <jens.axboe@oracle.com>
> > > Date: Fri, May 16, 2008 08:40:03AM +0200
> > >
> > ...
> > > I think we can improve this further without getting too involved. If a
> > > 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
> > > since this likely means that we wont be doing more merges on the first
> > > one.
> > > 
> > 
> > But isn't there the risk that even the second request would be
> > dispatched, while it still could have grown?
> 
> Certainly, you'd only want to dispatch the first request. Ideally we'd
> just get rid of this logic of 'did empty dispatch round' and only
> dispatch requests once merging is done, it's basically the wrong thing
> to do to make it visible to the io scheduler so soon. Well of course
> even more ideally we'd always get big requests submitted, but
> unfortunately many producers aren't that nice.
> 
> The per-process plugging actually solves this nicely, since we do the
> merging outside of the io scheduler. Perhaps just not dispatch on a
> plugged queue would help a bit. I'm somewhat against this principle of
> messing too much with dispatch logic in the schedulers, it'd be nicer to
> solve this higher up.

Something like this...

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5dfb7b9..5ab1a17 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1775,6 +1775,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
 
+	if (blk_queue_plugged(cfqd->queue))
+		return;
+
 	if (cfqq == cfqd->active_queue) {
 		/*
 		 * if we are waiting for a request for this queue, let it rip
@@ -1784,7 +1787,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		if (cfq_cfqq_wait_request(cfqq)) {
 			cfq_mark_cfqq_must_dispatch(cfqq);
 			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
+			cfq_schedule_dispatch(cfqd);
 		}
 	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
 		/*
@@ -1794,7 +1797,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 */
 		cfq_preempt_queue(cfqd, cfqq);
 		cfq_mark_cfqq_must_dispatch(cfqq);
-		blk_start_queueing(cfqd->queue);
+		cfq_schedule_dispatch(cfqd);
 	}
 }
 
@@ -1997,11 +2000,10 @@ static void cfq_kick_queue(struct work_struct *work)
 	struct cfq_data *cfqd =
 		container_of(work, struct cfq_data, unplug_work);
 	struct request_queue *q = cfqd->queue;
-	unsigned long flags;
 
-	spin_lock_irqsave(q->queue_lock, flags);
+	spin_lock_irq(q->queue_lock);
 	blk_start_queueing(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	spin_unlock_irq(q->queue_lock);
 }
 
 /*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  7:57                                 ` Jens Axboe
@ 2008-05-16  8:53                                   ` Daniel J Blueman
  2008-05-16  8:57                                     ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-05-16  8:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Fabio Checconi, Matthew, Kasper Sandberg, Linux Kernel

On Fri, May 16, 2008 at 8:57 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Fri, May 16 2008, Jens Axboe wrote:
>> On Fri, May 16 2008, Fabio Checconi wrote:
>> > > From: Jens Axboe <jens.axboe@oracle.com>
>> > > Date: Fri, May 16, 2008 08:40:03AM +0200
>> > >
>> > ...
>> > > I think we can improve this further without getting too involved. If a
>> > > 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
>> > > since this likely means that we wont be doing more merges on the first
>> > > one.
>> > >
>> >
>> > But isn't there the risk that even the second request would be
>> > dispatched, while it still could have grown?
>>
>> Certainly, you'd only want to dispatch the first request. Ideally we'd
>> just get rid of this logic of 'did empty dispatch round' and only
>> dispatch requests once merging is done, it's basically the wrong thing
>> to do to make it visible to the io scheduler so soon. Well of course
>> even more ideally we'd always get big requests submitted, but
>> unfortunately many producers aren't that nice.
>>
>> The per-process plugging actually solves this nicely, since we do the
>> merging outside of the io scheduler. Perhaps just not dispatch on a
>> plugged queue would help a bit. I'm somewhat against this principle of
>> messing too much with dispatch logic in the schedulers, it'd be nicer to
>> solve this higher up.
>
> Something like this...
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 5dfb7b9..5ab1a17 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1775,6 +1775,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>
>        cic->last_request_pos = rq->sector + rq->nr_sectors;
>
> +       if (blk_queue_plugged(cfqd->queue))
> +               return;
> +
>        if (cfqq == cfqd->active_queue) {
>                /*
>                 * if we are waiting for a request for this queue, let it rip
> @@ -1784,7 +1787,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>                if (cfq_cfqq_wait_request(cfqq)) {
>                        cfq_mark_cfqq_must_dispatch(cfqq);
>                        del_timer(&cfqd->idle_slice_timer);
> -                       blk_start_queueing(cfqd->queue);
> +                       cfq_schedule_dispatch(cfqd);
>                }
>        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>                /*
> @@ -1794,7 +1797,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>                 */
>                cfq_preempt_queue(cfqd, cfqq);
>                cfq_mark_cfqq_must_dispatch(cfqq);
> -               blk_start_queueing(cfqd->queue);
> +               cfq_schedule_dispatch(cfqd);
>        }
>  }
>
> @@ -1997,11 +2000,10 @@ static void cfq_kick_queue(struct work_struct *work)
>        struct cfq_data *cfqd =
>                container_of(work, struct cfq_data, unplug_work);
>        struct request_queue *q = cfqd->queue;
> -       unsigned long flags;
>
> -       spin_lock_irqsave(q->queue_lock, flags);
> +       spin_lock_irq(q->queue_lock);
>        blk_start_queueing(q);
> -       spin_unlock_irqrestore(q->queue_lock, flags);
> +       spin_unlock_irq(q->queue_lock);
>  }
>
>  /*

Platter speed at 64KB stride, but 16% (101MB/s) less performance at
4KB stride - perhaps merging isn't quite right?

Both traces at http://quora.org/blktrace-profiles-3.tar.bz2 ; let me
know if you'd like me to test Fabio's patch still.

Thanks,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  8:53                                   ` Daniel J Blueman
@ 2008-05-16  8:57                                     ` Jens Axboe
  2008-05-16 15:23                                       ` Matthew
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2008-05-16  8:57 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Fabio Checconi, Matthew, Kasper Sandberg, Linux Kernel

On Fri, May 16 2008, Daniel J Blueman wrote:
> On Fri, May 16, 2008 at 8:57 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, May 16 2008, Jens Axboe wrote:
> >> On Fri, May 16 2008, Fabio Checconi wrote:
> >> > > From: Jens Axboe <jens.axboe@oracle.com>
> >> > > Date: Fri, May 16, 2008 08:40:03AM +0200
> >> > >
> >> > ...
> >> > > I think we can improve this further without getting too involved. If a
> >> > > 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
> >> > > since this likely means that we wont be doing more merges on the first
> >> > > one.
> >> > >
> >> >
> >> > But isn't there the risk that even the second request would be
> >> > dispatched, while it still could have grown?
> >>
> >> Certainly, you'd only want to dispatch the first request. Ideally we'd
> >> just get rid of this logic of 'did empty dispatch round' and only
> >> dispatch requests once merging is done, it's basically the wrong thing
> >> to do to make it visible to the io scheduler so soon. Well of course
> >> even more ideally we'd always get big requests submitted, but
> >> unfortunately many producers aren't that nice.
> >>
> >> The per-process plugging actually solves this nicely, since we do the
> >> merging outside of the io scheduler. Perhaps just not dispatch on a
> >> plugged queue would help a bit. I'm somewhat against this principle of
> >> messing too much with dispatch logic in the schedulers, it'd be nicer to
> >> solve this higher up.
> >
> > Something like this...
> >
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index 5dfb7b9..5ab1a17 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -1775,6 +1775,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >
> >        cic->last_request_pos = rq->sector + rq->nr_sectors;
> >
> > +       if (blk_queue_plugged(cfqd->queue))
> > +               return;
> > +
> >        if (cfqq == cfqd->active_queue) {
> >                /*
> >                 * if we are waiting for a request for this queue, let it rip
> > @@ -1784,7 +1787,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >                if (cfq_cfqq_wait_request(cfqq)) {
> >                        cfq_mark_cfqq_must_dispatch(cfqq);
> >                        del_timer(&cfqd->idle_slice_timer);
> > -                       blk_start_queueing(cfqd->queue);
> > +                       cfq_schedule_dispatch(cfqd);
> >                }
> >        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
> >                /*
> > @@ -1794,7 +1797,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> >                 */
> >                cfq_preempt_queue(cfqd, cfqq);
> >                cfq_mark_cfqq_must_dispatch(cfqq);
> > -               blk_start_queueing(cfqd->queue);
> > +               cfq_schedule_dispatch(cfqd);
> >        }
> >  }
> >
> > @@ -1997,11 +2000,10 @@ static void cfq_kick_queue(struct work_struct *work)
> >        struct cfq_data *cfqd =
> >                container_of(work, struct cfq_data, unplug_work);
> >        struct request_queue *q = cfqd->queue;
> > -       unsigned long flags;
> >
> > -       spin_lock_irqsave(q->queue_lock, flags);
> > +       spin_lock_irq(q->queue_lock);
> >        blk_start_queueing(q);
> > -       spin_unlock_irqrestore(q->queue_lock, flags);
> > +       spin_unlock_irq(q->queue_lock);
> >  }
> >
> >  /*
> 
> Platter speed at 64KB stride, but 16% (101MB/s) less performance at
> 4KB stride - perhaps merging isn't quite right?
> 
> Both traces at http://quora.org/blktrace-profiles-3.tar.bz2 ; let me
> know if you'd like me to test Fabio's patch still.

If you have time, please do test that one as well, thanks :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16  8:57                                     ` Jens Axboe
@ 2008-05-16 15:23                                       ` Matthew
  2008-05-16 18:39                                         ` Fabio Checconi
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew @ 2008-05-16 15:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel J Blueman, Kasper Sandberg, Linux Kernel

On Fri, May 16, 2008 at 10:57 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Fri, May 16 2008, Daniel J Blueman wrote:
>> On Fri, May 16, 2008 at 8:57 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
>> > On Fri, May 16 2008, Jens Axboe wrote:
>> >> On Fri, May 16 2008, Fabio Checconi wrote:
>> >> > > From: Jens Axboe <jens.axboe@oracle.com>
>> >> > > Date: Fri, May 16, 2008 08:40:03AM +0200
>> >> > >
>> >> > ...
>> >> > > I think we can improve this further without getting too involved. If a
>> >> > > 2nd request is seen in cfq_rq_enqueued(), then DO schedule a dispatch
>> >> > > since this likely means that we wont be doing more merges on the first
>> >> > > one.
>> >> > >
>> >> >
>> >> > But isn't there the risk that even the second request would be
>> >> > dispatched, while it still could have grown?
>> >>
>> >> Certainly, you'd only want to dispatch the first request. Ideally we'd
>> >> just get rid of this logic of 'did empty dispatch round' and only
>> >> dispatch requests once merging is done, it's basically the wrong thing
>> >> to do to make it visible to the io scheduler so soon. Well of course
>> >> even more ideally we'd always get big requests submitted, but
>> >> unfortunately many producers aren't that nice.
>> >>
>> >> The per-process plugging actually solves this nicely, since we do the
>> >> merging outside of the io scheduler. Perhaps just not dispatch on a
>> >> plugged queue would help a bit. I'm somewhat against this principle of
>> >> messing too much with dispatch logic in the schedulers, it'd be nicer to
>> >> solve this higher up.
>> >
>> > Something like this...
>> >
>> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> > index 5dfb7b9..5ab1a17 100644
>> > --- a/block/cfq-iosched.c
>> > +++ b/block/cfq-iosched.c
>> > @@ -1775,6 +1775,9 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >
>> >        cic->last_request_pos = rq->sector + rq->nr_sectors;
>> >
>> > +       if (blk_queue_plugged(cfqd->queue))
>> > +               return;
>> > +
>> >        if (cfqq == cfqd->active_queue) {
>> >                /*
>> >                 * if we are waiting for a request for this queue, let it rip
>> > @@ -1784,7 +1787,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >                if (cfq_cfqq_wait_request(cfqq)) {
>> >                        cfq_mark_cfqq_must_dispatch(cfqq);
>> >                        del_timer(&cfqd->idle_slice_timer);
>> > -                       blk_start_queueing(cfqd->queue);
>> > +                       cfq_schedule_dispatch(cfqd);
>> >                }
>> >        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>> >                /*
>> > @@ -1794,7 +1797,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> >                 */
>> >                cfq_preempt_queue(cfqd, cfqq);
>> >                cfq_mark_cfqq_must_dispatch(cfqq);
>> > -               blk_start_queueing(cfqd->queue);
>> > +               cfq_schedule_dispatch(cfqd);
>> >        }
>> >  }
>> >
>> > @@ -1997,11 +2000,10 @@ static void cfq_kick_queue(struct work_struct *work)
>> >        struct cfq_data *cfqd =
>> >                container_of(work, struct cfq_data, unplug_work);
>> >        struct request_queue *q = cfqd->queue;
>> > -       unsigned long flags;
>> >
>> > -       spin_lock_irqsave(q->queue_lock, flags);
>> > +       spin_lock_irq(q->queue_lock);
>> >        blk_start_queueing(q);
>> > -       spin_unlock_irqrestore(q->queue_lock, flags);
>> > +       spin_unlock_irq(q->queue_lock);
>> >  }
>> >
>> >  /*
>>
>> Platter speed at 64KB stride, but 16% (101MB/s) less performance at
>> 4KB stride - perhaps merging isn't quite right?
>>
>> Both traces at http://quora.org/blktrace-profiles-3.tar.bz2 ; let me
>> know if you'd like me to test Fabio's patch still.
>
> If you have time, please do test that one as well, thanks :-)
>
> --
> Jens Axboe
>
>

thanks for the 2 patches, please keep them coming :)

a short report (due to the time shortage):

I tested both patches this morning and got for both (still) around
52-58 MB/s (/dev/sdd & /dev/sde)

thanks & have a nice weekend :)

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-16 15:23                                       ` Matthew
@ 2008-05-16 18:39                                         ` Fabio Checconi
  0 siblings, 0 replies; 35+ messages in thread
From: Fabio Checconi @ 2008-05-16 18:39 UTC (permalink / raw)
  To: Matthew; +Cc: Jens Axboe, Daniel J Blueman, Kasper Sandberg, Linux Kernel

> From: Matthew <jackdachef@gmail.com>
> Date: Fri, May 16, 2008 05:23:12PM +0200
>
...
> thanks for the 2 patches, please keep them coming :)
> 
> a short report (due to the time shortage):
> 
> I tested both patches this morning and got for both (still) around
> 52-58 MB/s (/dev/sdd & /dev/sde)
> 
> thanks & have a nice weekend :)

Maybe I've missed it but I cannot find the blktrace output for your
original test and for the test with the first patch posted by Jens
in this thread (the one completely removing the blk_start_queueing()
call), may I ask you if you can point me to them?

>From what I understood that patch didn't solve your issue, so the
following ones, that adopt a similar approach, are unlikely to do
any better.

Thank you in advance.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
       [not found]                           ` <e85b9d30805161549o7c8f065do24b6567e2ade0afa@mail.gmail.com>
@ 2008-05-19 10:39                             ` Matthew
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew @ 2008-05-19 10:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: fchecconi, Daniel J Blueman, Kasper Sandberg, Linux Kernel,
	Aaron Carroll

Hi,

I just wrote a bug-report as a reminder for this problem:

http://bugzilla.kernel.org/show_bug.cgi?id=10746

I hope it's classified in the right subsystem & has enough information
(2 links to this discussion)

Jens, I just looked a little through lkml while searching for the
above mentioned 2 links and found the following post:
http://lkml.org/lkml/2006/12/7/319 is it anyhow related to this issue
?

Thanks

Mat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-05-15 12:21                         ` Fabio Checconi
  2008-05-16  6:40                           ` Jens Axboe
@ 2008-08-24 20:24                           ` Daniel J Blueman
  2008-08-25 20:29                             ` Fabio Checconi
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-08-24 20:24 UTC (permalink / raw)
  To: Fabio Checconi, Jens Axboe; +Cc: Matthew, Kasper Sandberg, Linux Kernel

Hi Fabio, Jens,

On Thu, May 15, 2008 at 1:21 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
>> From: Jens Axboe <jens.axboe@oracle.com>
>> Date: Thu, May 15, 2008 09:01:28AM +0200
>>
>> I don't think it's 2.6.25 vs 2.6.26-rc2, I can still reproduce some
>> request size offsets with the patch. So still fumbling around with this,
>> I'll be sending out another test patch when I'm confident it's solved
>> the size issue.
>
> IMO an interesting thing is how/why anticipatory doesn't show the
> issue.  The device is not put into ANTIC_WAIT_NEXT if there is no
> dispatch returning no requests while the queue is not empty.  This
> seems to be enough in the reported workloads.
>
> I don't think this behavior is the correct one (it is still racy
> WRT merges after breaking anticipation) anyway it should make things
> a little bit better.  I fear that a complete solution would not
> involve only the scheduler.
>
> Introducing the very same behavior in cfq seems to be not so easy
> (i.e., start idling only if there was a dispatch round while the
> last request was being served) but an approximated version can be
> introduced quite easily.  The patch below should do that, rescheduling
> the dispatch only if necessary; it is not tested at all, just posted
> for discussion.
>
> ---
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index b399c62..41f1e0e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -169,6 +169,7 @@ enum cfqq_state_flags {
>        CFQ_CFQQ_FLAG_queue_new,        /* queue never been serviced */
>        CFQ_CFQQ_FLAG_slice_new,        /* no requests dispatched in slice */
>        CFQ_CFQQ_FLAG_sync,             /* synchronous queue */
> +       CFQ_CFQQ_FLAG_dispatched,       /* empty dispatch while idling */
>  };
>
>  #define CFQ_CFQQ_FNS(name)                                             \
> @@ -196,6 +197,7 @@ CFQ_CFQQ_FNS(prio_changed);
>  CFQ_CFQQ_FNS(queue_new);
>  CFQ_CFQQ_FNS(slice_new);
>  CFQ_CFQQ_FNS(sync);
> +CFQ_CFQQ_FNS(dispatched);
>  #undef CFQ_CFQQ_FNS
>
>  static void cfq_dispatch_insert(struct request_queue *, struct request *);
> @@ -749,6 +751,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>                cfqq->slice_end = 0;
>                cfq_clear_cfqq_must_alloc_slice(cfqq);
>                cfq_clear_cfqq_fifo_expire(cfqq);
> +               cfq_clear_cfqq_dispatched(cfqq);
>                cfq_mark_cfqq_slice_new(cfqq);
>                cfq_clear_cfqq_queue_new(cfqq);
>        }
> @@ -978,6 +981,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>         */
>        if (timer_pending(&cfqd->idle_slice_timer) ||
>            (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
> +               cfq_mark_cfqq_dispatched(cfqq);
>                cfqq = NULL;
>                goto keep_queue;
>        }
> @@ -1784,7 +1788,10 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>                if (cfq_cfqq_wait_request(cfqq)) {
>                        cfq_mark_cfqq_must_dispatch(cfqq);
>                        del_timer(&cfqd->idle_slice_timer);
> -                       blk_start_queueing(cfqd->queue);
> +                       if (cfq_cfqq_dispatched(cfqq)) {
> +                               cfq_clear_cfqq_dispatched(cfqq);
> +                               cfq_schedule_dispatch(cfqd);
> +                       }
>                }
>        } else if (cfq_should_preempt(cfqd, cfqq, rq)) {
>                /*

This was the last test I didn't get around to. Alas, is did help, but
didn't give the merging required for full performance:

# echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=2000
262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s

# echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
Timing buffered disk reads:  308 MB in  3.01 seconds = 102.46 MB/sec

It is an improvement over the baseline performance of 2.6.27-rc4:

# echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=2000
262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s

# echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
Timing buffered disk reads:  294 MB in  3.02 seconds =  97.33 MB/sec

Note that platter speed is around 125MB/s (which I get near at smaller
read sizes).

I feel 128KB read requests are perhaps important, as this is a
commonly-used RAID stripe size, and may explain the read-performance
drop sometimes we see in hardware vs software RAID benchmarks.

How can we generate some ideas or movement on fixing/improving this behaviour?

Thanks!
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-08-25 20:29                             ` Fabio Checconi
@ 2008-08-25 15:39                               ` Daniel J Blueman
  2008-08-25 17:06                                 ` Fabio Checconi
  0 siblings, 1 reply; 35+ messages in thread
From: Daniel J Blueman @ 2008-08-25 15:39 UTC (permalink / raw)
  To: Fabio Checconi; +Cc: Jens Axboe, Matthew, Kasper Sandberg, Linux Kernel

On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
> Hi,
>
>> From: Daniel J Blueman <daniel.blueman@gmail.com>
>> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>>
>> Hi Fabio, Jens,
>>
> ...
>> This was the last test I didn't get around to. Alas, is did help, but
>> didn't give the merging required for full performance:
>>
>> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> bs=128k count=2000
>> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>>
>> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> Timing buffered disk reads:  308 MB in  3.01 seconds = 102.46 MB/sec
>>
>> It is an improvement over the baseline performance of 2.6.27-rc4:
>>
>> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> bs=128k count=2000
>> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>>
>> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> Timing buffered disk reads:  294 MB in  3.02 seconds =  97.33 MB/sec
>>
>> Note that platter speed is around 125MB/s (which I get near at smaller
>> read sizes).
>>
>> I feel 128KB read requests are perhaps important, as this is a
>> commonly-used RAID stripe size, and may explain the read-performance
>> drop sometimes we see in hardware vs software RAID benchmarks.
>>
>> How can we generate some ideas or movement on fixing/improving this behaviour?
>>
>
> Thank you for testing.  The blktrace output for this run should be
> interesting, esp. to compare it with a blktrace obtained from anticipatory
> with the same workload - IIRC anticipatory didn't suffer from the problem,
> and anticipatory has a slightly different dispatching mechanism that
> this patch tried to bring into cfq.
>
> Even if a proper fix may not belong to the elevator itself, I think
> that this couple (this last test + anticipatory) of traces should help
> in better understanding what is still going wrong.
>
> Thank you in advance.

See http://quora.org/blktrace-n.tar.bz2

Where n is:
 0 - 2.6.27-rc4 unpatched
 1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
 2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
 3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler

I have found it's not always possible to reproduce this issue, eg now,
with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
(as above), whereas I was seeing a consistent 95-103MB/s, so the
blktraces may not show the slower-performance pattern - even with
precisely the same (controlled) environment.

Thanks,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-08-25 15:39                               ` Daniel J Blueman
@ 2008-08-25 17:06                                 ` Fabio Checconi
  2008-12-09 15:14                                   ` Daniel J Blueman
  0 siblings, 1 reply; 35+ messages in thread
From: Fabio Checconi @ 2008-08-25 17:06 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Jens Axboe, Matthew, Kasper Sandberg, Linux Kernel

> From: Daniel J Blueman <daniel.blueman@gmail.com>
> Date: Mon, Aug 25, 2008 04:39:01PM +0100
>
> On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
> > Hi,
> >
> >> From: Daniel J Blueman <daniel.blueman@gmail.com>
> >> Date: Sun, Aug 24, 2008 09:24:37PM +0100
> >>
> >> Hi Fabio, Jens,
> >>
> > ...
> >> This was the last test I didn't get around to. Alas, is did help, but
> >> didn't give the merging required for full performance:
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> >> bs=128k count=2000
> >> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> >> Timing buffered disk reads:  308 MB in  3.01 seconds = 102.46 MB/sec
> >>
> >> It is an improvement over the baseline performance of 2.6.27-rc4:
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> >> bs=128k count=2000
> >> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
> >>
> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> >> Timing buffered disk reads:  294 MB in  3.02 seconds =  97.33 MB/sec
> >>
> >> Note that platter speed is around 125MB/s (which I get near at smaller
> >> read sizes).
> >>
> >> I feel 128KB read requests are perhaps important, as this is a
> >> commonly-used RAID stripe size, and may explain the read-performance
> >> drop sometimes we see in hardware vs software RAID benchmarks.
> >>
> >> How can we generate some ideas or movement on fixing/improving this behaviour?
> >>
> >
> > Thank you for testing.  The blktrace output for this run should be
> > interesting, esp. to compare it with a blktrace obtained from anticipatory
> > with the same workload - IIRC anticipatory didn't suffer from the problem,
> > and anticipatory has a slightly different dispatching mechanism that
> > this patch tried to bring into cfq.
> >
> > Even if a proper fix may not belong to the elevator itself, I think
> > that this couple (this last test + anticipatory) of traces should help
> > in better understanding what is still going wrong.
> >
> > Thank you in advance.
> 
> See http://quora.org/blktrace-n.tar.bz2
> 
> Where n is:
>  0 - 2.6.27-rc4 unpatched
>  1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
>  2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
>  3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler
> 
> I have found it's not always possible to reproduce this issue, eg now,
> with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
> (as above), whereas I was seeing a consistent 95-103MB/s, so the
> blktraces may not show the slower-performance pattern - even with
> precisely the same (controlled) environment.
> 

If I read them correctly, all the traces show dispatches with
requests still growing; the elevator cannot know if a request
will grow or not once it has been queued, and the heuristics
we tried so far to postpone dispatches gave no results.

I don't see any elevator-only solution to the problem...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-08-24 20:24                           ` Daniel J Blueman
@ 2008-08-25 20:29                             ` Fabio Checconi
  2008-08-25 15:39                               ` Daniel J Blueman
  0 siblings, 1 reply; 35+ messages in thread
From: Fabio Checconi @ 2008-08-25 20:29 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Jens Axboe, Matthew, Kasper Sandberg, Linux Kernel

Hi,

> From: Daniel J Blueman <daniel.blueman@gmail.com>
> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>
> Hi Fabio, Jens,
> 
...
> This was the last test I didn't get around to. Alas, is did help, but
> didn't give the merging required for full performance:
> 
> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> bs=128k count=2000
> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
> 
> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> Timing buffered disk reads:  308 MB in  3.01 seconds = 102.46 MB/sec
> 
> It is an improvement over the baseline performance of 2.6.27-rc4:
> 
> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
> bs=128k count=2000
> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
> 
> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
> Timing buffered disk reads:  294 MB in  3.02 seconds =  97.33 MB/sec
> 
> Note that platter speed is around 125MB/s (which I get near at smaller
> read sizes).
> 
> I feel 128KB read requests are perhaps important, as this is a
> commonly-used RAID stripe size, and may explain the read-performance
> drop sometimes we see in hardware vs software RAID benchmarks.
> 
> How can we generate some ideas or movement on fixing/improving this behaviour?
> 

Thank you for testing.  The blktrace output for this run should be
interesting, esp. to compare it with a blktrace obtained from anticipatory
with the same workload - IIRC anticipatory didn't suffer from the problem,
and anticipatory has a slightly different dispatching mechanism that
this patch tried to bring into cfq.

Even if a proper fix may not belong to the elevator itself, I think
that this couple (this last test + anticipatory) of traces should help
in better understanding what is still going wrong.

Thank you in advance.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance "regression" in cfq compared to anticipatory, deadline and noop
  2008-08-25 17:06                                 ` Fabio Checconi
@ 2008-12-09 15:14                                   ` Daniel J Blueman
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel J Blueman @ 2008-12-09 15:14 UTC (permalink / raw)
  To: Fabio Checconi, Jens Axboe; +Cc: Matthew, Kasper Sandberg, Linux Kernel

Hi Jens, Fabio,

On Mon, Aug 25, 2008 at 5:06 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
>> From: Daniel J Blueman <daniel.blueman@gmail.com>
>> Date: Mon, Aug 25, 2008 04:39:01PM +0100
>>
>> On Mon, Aug 25, 2008 at 9:29 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
>> > Hi,
>> >
>> >> From: Daniel J Blueman <daniel.blueman@gmail.com>
>> >> Date: Sun, Aug 24, 2008 09:24:37PM +0100
>> >>
>> >> Hi Fabio, Jens,
>> >>
>> > ...
>> >> This was the last test I didn't get around to. Alas, is did help, but
>> >> didn't give the merging required for full performance:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.47787 s, 106 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads:  308 MB in  3.01 seconds = 102.46 MB/sec
>> >>
>> >> It is an improvement over the baseline performance of 2.6.27-rc4:
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
>> >> bs=128k count=2000
>> >> 262144000 bytes (262 MB) copied, 2.56514 s, 102 MB/s
>> >>
>> >> # echo 1 >/proc/sys/vm/drop_caches; hdparm -t /dev/sda
>> >> Timing buffered disk reads:  294 MB in  3.02 seconds =  97.33 MB/sec
>> >>
>> >> Note that platter speed is around 125MB/s (which I get near at smaller
>> >> read sizes).
>> >>
>> >> I feel 128KB read requests are perhaps important, as this is a
>> >> commonly-used RAID stripe size, and may explain the read-performance
>> >> drop sometimes we see in hardware vs software RAID benchmarks.
>> >>
>> >> How can we generate some ideas or movement on fixing/improving this behaviour?
>> >>
>> >
>> > Thank you for testing.  The blktrace output for this run should be
>> > interesting, esp. to compare it with a blktrace obtained from anticipatory
>> > with the same workload - IIRC anticipatory didn't suffer from the problem,
>> > and anticipatory has a slightly different dispatching mechanism that
>> > this patch tried to bring into cfq.
>> >
>> > Even if a proper fix may not belong to the elevator itself, I think
>> > that this couple (this last test + anticipatory) of traces should help
>> > in better understanding what is still going wrong.
>> >
>> > Thank you in advance.
>>
>> See http://quora.org/blktrace-n.tar.bz2
>>
>> Where n is:
>>  0 - 2.6.27-rc4 unpatched
>>  1 - 2.6.27-rc4 with your CFQ patch, CFQ scheduler
>>  2 - 2.6.27-rc4 with your CFQ patch, anticipatory scheduler
>>  3 - 2.6.27-rc4 with your CFQ patch, deadline scheduler
>>
>> I have found it's not always possible to reproduce this issue, eg now,
>> with stock CFQ, I'm seeing consistent 117-123MB/s with hdparm and dd
>> (as above), whereas I was seeing a consistent 95-103MB/s, so the
>> blktraces may not show the slower-performance pattern - even with
>> precisely the same (controlled) environment.
>>
>
> If I read them correctly, all the traces show dispatches with
> requests still growing; the elevator cannot know if a request
> will grow or not once it has been queued, and the heuristics
> we tried so far to postpone dispatches gave no results.
>
> I don't see any elevator-only solution to the problem...

I was running into this performance issue again:

Everything same as before, 2.6.24, CFQ scheduler, Seagate 7200.11
320GB SATA (SD11 firmware) on a quiescent and well-powered system:

# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 2.24231 s, 58.5 MB/s

I found that tuning the AHCI SATA TCQ depth to 2 provides exactly the
performance we expect:

# echo 2 >/sys/block/sda/device/queue_depth
# sync; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/sda of=/dev/null
bs=128k count=1000
1000+0 records in
1000+0 records out
131072000 bytes (131 MB) copied, 0.98503 s, 133 MB/s

depth 1: 132 MB/s
depth 2: 133 MB/s
depth 3: 69.1 MB/s
depth 4: 59.7 MB/s
depth 8: 54.9 MB/s
depth 16: 57.1 MB/s
depth 31: 58.0 MB/s

Very interesting interaction, and the figures are very stable. Could
this be a product of the maximum time the drive waits to coalesce
requests before acting on them? If so, how can we diagnose this, apart
from you guys getting one of these disks?

Thanks,
  Daniel
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-12-09 15:14 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-11 13:14 performance "regression" in cfq compared to anticipatory, deadline and noop Daniel J Blueman
2008-05-11 14:02 ` Kasper Sandberg
2008-05-13 12:20   ` Jens Axboe
2008-05-13 12:58     ` Matthew
2008-05-13 13:05       ` Jens Axboe
     [not found]         ` <e85b9d30805130842p3a34305l4ab1e7926e4b0dba@mail.gmail.com>
2008-05-13 18:03           ` Jens Axboe
2008-05-13 18:40             ` Jens Axboe
2008-05-13 19:23               ` Matthew
2008-05-13 19:30                 ` Jens Axboe
2008-05-14  8:05               ` Daniel J Blueman
2008-05-14  8:26                 ` Jens Axboe
2008-05-14 20:52                   ` Daniel J Blueman
2008-05-14 21:37                     ` Matthew
2008-05-15  7:01                       ` Jens Axboe
2008-05-15 12:21                         ` Fabio Checconi
2008-05-16  6:40                           ` Jens Axboe
2008-05-16  7:46                             ` Fabio Checconi
2008-05-16  7:49                               ` Jens Axboe
2008-05-16  7:57                                 ` Jens Axboe
2008-05-16  8:53                                   ` Daniel J Blueman
2008-05-16  8:57                                     ` Jens Axboe
2008-05-16 15:23                                       ` Matthew
2008-05-16 18:39                                         ` Fabio Checconi
2008-08-24 20:24                           ` Daniel J Blueman
2008-08-25 20:29                             ` Fabio Checconi
2008-08-25 15:39                               ` Daniel J Blueman
2008-08-25 17:06                                 ` Fabio Checconi
2008-12-09 15:14                                   ` Daniel J Blueman
     [not found]                   ` <e85b9d30805140332r3311b2d6r6831d37421ced757@mail.gmail.com>
     [not found]                     ` <e85b9d30805140334q69cb5eacued9a719414e73d53@mail.gmail.com>
     [not found]                       ` <20080514103956.GD16217@kernel.dk>
     [not found]                         ` <e85b9d30805141239g5df9abc6i666b1f621d632b44@mail.gmail.com>
     [not found]                           ` <e85b9d30805161549o7c8f065do24b6567e2ade0afa@mail.gmail.com>
2008-05-19 10:39                             ` Matthew
2008-05-13 13:51     ` Kasper Sandberg
2008-05-14  0:33       ` Kasper Sandberg
  -- strict thread matches above, loose matches on Subject: below --
2008-05-10 19:18 Matthew
     [not found] ` <20080510200053.GA78555@gandalf.sssup.it>
2008-05-10 20:39   ` Matthew
2008-05-10 21:56     ` Fabio Checconi
2008-05-11  0:00     ` Aaron Carroll

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox