public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* IO scheduler benchmarking
@ 2003-02-21  5:23 Andrew Morton
  2003-02-21  6:51 ` David Lang
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2003-02-21  5:23 UTC (permalink / raw)
  To: linux-kernel


Following this email are the results of a number of tests of various I/O
schedulers:

- Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx)

- CFQ (as in 2.5.61-mm1)

- 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory
  scheduler - tweaks which fix the writes-starve-reads problem via a
  scheduling storm)

- 2.4.21-pre4

All these tests are simple things from the command line.

I stayed away from the standard benchmarks because they do not really touch
on areas where the Linux I/O scheduler has traditionally been bad.  (If they
did, perhaps it wouldn't have been so bad..)

Plus all the I/O schedulers perform similarly with the usual benchmarks. 
With the exception of some tiobench phases, where AS does very well.

Executive summary: the anticipatory scheduler is wiping the others off the
map, and 2.4 is a disaster.

I really have not sought to make the AS look good - I mainly concentrated on
things which we have traditonally been bad at.  If anyone wants to suggest
other tests, please let me know.

The known regressions from the anticipatory scheduler are:

1) 15% (ish) slowdown in David Mansfield's database run.  This appeared to
   go away in later versions of the scheduler.

2) 5% dropoff in single-threaded qsbench swapstorms

3) 30% dropoff in write bandwidth when there is a streaming read (this is
   actually good).

The test machine is a fast P4-HT with 256MB of memory.  Testing was against a
single fast IDE disk, using ext2.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  5:23 Andrew Morton
@ 2003-02-21  6:51 ` David Lang
  2003-02-21  8:16   ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: David Lang @ 2003-02-21  6:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

one other useful test would be the time to copy a large (multi-gig) file.
currently this takes forever and uses very little fo the disk bandwidth, I
suspect that the AS would give more preference to reads and therefor would
go faster.

for a real-world example, mozilla downloads files to a temp directory and
then copies it to the premanent location. When I download a video from my
tivo it takes ~20 min to download a 1G video, during which time the system
is perfectly responsive, then after the download completes when mozilla
copies it to the real destination (on a seperate disk so it is a copy, not
just a move) the system becomes completely unresponsive to anything
requireing disk IO for several min.

David Lang

On Thu, 20 Feb 2003, Andrew Morton wrote:

> Date: Thu, 20 Feb 2003 21:23:04 -0800
> From: Andrew Morton <akpm@digeo.com>
> To: linux-kernel@vger.kernel.org
> Subject: IO scheduler benchmarking
>
>
> Following this email are the results of a number of tests of various I/O
> schedulers:
>
> - Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx)
>
> - CFQ (as in 2.5.61-mm1)
>
> - 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory
>   scheduler - tweaks which fix the writes-starve-reads problem via a
>   scheduling storm)
>
> - 2.4.21-pre4
>
> All these tests are simple things from the command line.
>
> I stayed away from the standard benchmarks because they do not really touch
> on areas where the Linux I/O scheduler has traditionally been bad.  (If they
> did, perhaps it wouldn't have been so bad..)
>
> Plus all the I/O schedulers perform similarly with the usual benchmarks.
> With the exception of some tiobench phases, where AS does very well.
>
> Executive summary: the anticipatory scheduler is wiping the others off the
> map, and 2.4 is a disaster.
>
> I really have not sought to make the AS look good - I mainly concentrated on
> things which we have traditonally been bad at.  If anyone wants to suggest
> other tests, please let me know.
>
> The known regressions from the anticipatory scheduler are:
>
> 1) 15% (ish) slowdown in David Mansfield's database run.  This appeared to
>    go away in later versions of the scheduler.
>
> 2) 5% dropoff in single-threaded qsbench swapstorms
>
> 3) 30% dropoff in write bandwidth when there is a streaming read (this is
>    actually good).
>
> The test machine is a fast P4-HT with 256MB of memory.  Testing was against a
> single fast IDE disk, using ext2.
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  6:51 ` David Lang
@ 2003-02-21  8:16   ` Andrew Morton
  2003-02-21 10:31     ` Andrea Arcangeli
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2003-02-21  8:16 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel

David Lang <david.lang@digitalinsight.com> wrote:
>
> one other useful test would be the time to copy a large (multi-gig) file.
> currently this takes forever and uses very little fo the disk bandwidth, I
> suspect that the AS would give more preference to reads and therefor would
> go faster.

Yes, that's a test.

	time (cp 1-gig-file foo ; sync)

2.5.62-mm2,AS:		1:22.36
2.5.62-mm2,CFQ:		1:25.54
2.5.62-mm2,deadline:	1:11.03
2.4.21-pre4:		1:07.69

Well gee.


> for a real-world example, mozilla downloads files to a temp directory and
> then copies it to the premanent location. When I download a video from my
> tivo it takes ~20 min to download a 1G video, during which time the system
> is perfectly responsive, then after the download completes when mozilla
> copies it to the real destination (on a seperate disk so it is a copy, not
> just a move) the system becomes completely unresponsive to anything
> requireing disk IO for several min.

Well 2.4 is unreponsive period.  That's due to problems in the VM - processes
which are trying to allocate memory get continually DoS'ed by `cp' in page
reclaim.

For the reads-starved-by-writes problem which you describe, you'll see that
quite a few of the tests did cover that.  contest does as well.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  8:16   ` Andrew Morton
@ 2003-02-21 10:31     ` Andrea Arcangeli
  2003-02-21 10:51       ` William Lee Irwin III
  0 siblings, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 10:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Lang, linux-kernel

On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
> Yes, that's a test.
> 
> 	time (cp 1-gig-file foo ; sync)
> 
> 2.5.62-mm2,AS:		1:22.36
> 2.5.62-mm2,CFQ:		1:25.54
> 2.5.62-mm2,deadline:	1:11.03
> 2.4.21-pre4:		1:07.69
> 
> Well gee.

It's pointless to benchmark CFQ in a workload like that IMHO. if you
read and write to the same harddisk you want lots of unfariness to go
faster.  Your latency is the mixture of read and writes and the writes
are run by the kernel likely so CFQ will likely generate more seeks (it
also depends if you have the magic for the current->mm == NULL).

You should run something on these lines to measure the difference:

	dd if=/dev/zero of=readme bs=1M count=2000
	sync
	cp /dev/zero . & time cp readme /dev/null

And the best CFQ benchmark really is to run tiobench read test with 1
single thread during the `cp /dev/zero .`. That will measure the worst
case latency that `read` provided during the benchmark, and it should
make the most difference because that is definitely the only thing one
can care about if you need CFQ or SFQ. You don't care that much about
throughput if you enable CFQ, so it's not even correct to even benchmark in
function of real time, but only the worst case `read` latency matters.

> > for a real-world example, mozilla downloads files to a temp directory and
> > then copies it to the premanent location. When I download a video from my
> > tivo it takes ~20 min to download a 1G video, during which time the system
> > is perfectly responsive, then after the download completes when mozilla
> > copies it to the real destination (on a seperate disk so it is a copy, not
> > just a move) the system becomes completely unresponsive to anything
> > requireing disk IO for several min.
> 
> Well 2.4 is unreponsive period.  That's due to problems in the VM - processes
> which are trying to allocate memory get continually DoS'ed by `cp' in page
> reclaim.

this depends on the workload, you may not have that many allocations,
a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
by too much dirty cache. Furthmore elevator-lowlatency makes
the blkdev layer much more fair under load.

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 10:31     ` Andrea Arcangeli
@ 2003-02-21 10:51       ` William Lee Irwin III
  2003-02-21 11:08         ` Andrea Arcangeli
  0 siblings, 1 reply; 17+ messages in thread
From: William Lee Irwin III @ 2003-02-21 10:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
>> Well 2.4 is unreponsive period.  That's due to problems in the VM -
>> processes which are trying to allocate memory get continually DoS'ed
>> by `cp' in page reclaim.

On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote:
> this depends on the workload, you may not have that many allocations,
> a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
> by too much dirty cache. Furthmore elevator-lowlatency makes
> the blkdev layer much more fair under load.

Restricting io in flight doesn't actually repair the issues raised by
it, but rather avoids them by limiting functionality.

The issue raised here is streaming io competing with processes working
within bounded memory. It's unclear to me how 2.5.x mitigates this but
the effects are far less drastic there. The "fix" you're suggesting is
clamping off the entire machine's io just to contain the working set of
a single process that generates unbounded amounts of dirty data and
inadvertently penalizes other processes via page reclaim, where instead
it should be forced to fairly wait its turn for memory.

-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 10:51       ` William Lee Irwin III
@ 2003-02-21 11:08         ` Andrea Arcangeli
  2003-02-21 11:17           ` Nick Piggin
  2003-02-21 11:34           ` William Lee Irwin III
  0 siblings, 2 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 11:08 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
> >> Well 2.4 is unreponsive period.  That's due to problems in the VM -
> >> processes which are trying to allocate memory get continually DoS'ed
> >> by `cp' in page reclaim.
> 
> On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote:
> > this depends on the workload, you may not have that many allocations,
> > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
> > by too much dirty cache. Furthmore elevator-lowlatency makes
> > the blkdev layer much more fair under load.
> 
> Restricting io in flight doesn't actually repair the issues raised by

the amount of I/O that we allow in flight is purerly random, there is no
point to allow several dozen mbytes of I/O in flight on a 64M machine,
my patch fixes that and nothing more.

> it, but rather avoids them by limiting functionality.

If you can show a (throughput) benchmark where you see this limited
functionalty I'd be very interested.

Alternatively I can also claim that 2.4 and 2.5 are limiting
functionalty too by limiting the I/O in flight to some hundred megabytes
right?

it's like a dma ring buffer size of a soundcard, if you want low latency
it has to be small, it's as simple as that. It's a tradeoff between
latency and performance, but the point here is that apparently you gain
nothing with such an huge amount of I/O in flight. This has nothing to
do with the number of requests, the requests have to be a lot, or seeks
won't be reordered aggressively, but when everything merges using all
the requests is pointless and it only has the effect of locking
everything in ram, and this screw the write throttling too, because we
do write throttling on the dirty stuff, not on the locked stuff, and
this is what elevator-lowlatency address.

You may argue on the amount of in flight I/O limit I choosen, but really
the default in mainlines looks overkill to me for generic hardware.

> The issue raised here is streaming io competing with processes working
> within bounded memory. It's unclear to me how 2.5.x mitigates this but
> the effects are far less drastic there. The "fix" you're suggesting is
> clamping off the entire machine's io just to contain the working set of

show me this claimping off please. take 2.4.21pre4aa3 and trash it
compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
interested, if I've a problem I must fix it ASAP, but all the benchmarks
are in green so far and the behaviour was very bad before these fixes,
go ahead and show me red and you'll make me a big favour. Either that or
you're wrong that I'm claimping off anything.

Just to be clear, this whole thing has nothing to do with the elevator,
or the CFQ or whatever, it only is related to the worthwhile amount of
in flight I/O to keep the disk always running.

> a single process that generates unbounded amounts of dirty data and
> inadvertently penalizes other processes via page reclaim, where instead
> it should be forced to fairly wait its turn for memory.
> 
> -- wli


Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:08         ` Andrea Arcangeli
@ 2003-02-21 11:17           ` Nick Piggin
  2003-02-21 11:41             ` Andrea Arcangeli
  2003-02-21 11:34           ` William Lee Irwin III
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2003-02-21 11:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

Andrea Arcangeli wrote:

>it's like a dma ring buffer size of a soundcard, if you want low latency
>it has to be small, it's as simple as that. It's a tradeoff between
>
Although the dma buffer is strictly FIFO, so the situation isn't
quite so simple for disk IO.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:08         ` Andrea Arcangeli
  2003-02-21 11:17           ` Nick Piggin
@ 2003-02-21 11:34           ` William Lee Irwin III
  2003-02-21 12:38             ` Andrea Arcangeli
  1 sibling, 1 reply; 17+ messages in thread
From: William Lee Irwin III @ 2003-02-21 11:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> Restricting io in flight doesn't actually repair the issues raised by

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> the amount of I/O that we allow in flight is purerly random, there is no
> point to allow several dozen mbytes of I/O in flight on a 64M machine,
> my patch fixes that and nothing more.

I was arguing against having any preset limit whatsoever.


On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> it, but rather avoids them by limiting functionality.

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> If you can show a (throughput) benchmark where you see this limited
> functionalty I'd be very interested.
> Alternatively I can also claim that 2.4 and 2.5 are limiting
> functionalty too by limiting the I/O in flight to some hundred megabytes
> right?

This has nothing to do with benchmarks.

Counterexample: suppose the process generating dirty data is the only
one running. The machine's effective RAM capacity is then limited to
the dirty data limit plus some small constant by this io in flight
limitation.

This functionality is not to be dismissed lightly: changing the /proc/
business is root-only, hence it may not be within the power of a victim
of a poor setting to adjust it.


On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> it's like a dma ring buffer size of a soundcard, if you want low latency
> it has to be small, it's as simple as that. It's a tradeoff between
> latency and performance, but the point here is that apparently you gain
> nothing with such an huge amount of I/O in flight. This has nothing to
> do with the number of requests, the requests have to be a lot, or seeks
> won't be reordered aggressively, but when everything merges using all
> the requests is pointless and it only has the effect of locking
> everything in ram, and this screw the write throttling too, because we
> do write throttling on the dirty stuff, not on the locked stuff, and
> this is what elevator-lowlatency address.
> You may argue on the amount of in flight I/O limit I choosen, but really
> the default in mainlines looks overkill to me for generic hardware.

It's not a question of gain but rather immunity to reconfigurations.
Redoing it for all the hardware raises a tuning issue, and in truth
all I've ever wound up doing is turning it off because I've got so
much RAM that various benchmarks could literally be done in-core as a
first pass, then sorted, then sprayed out to disk in block-order. And
a bunch of open benchmarks are basically just in-core spinlock exercise.
(Ignore the fact there was a benchmark mentioned.)

Amortizing seeks and incrementally sorting and so on generally require
large buffers, and if you have the RAM, the kernel should use it.

But more seriously, global io in flight limits are truly worthless, if
anything it should be per-process, but even that's inadequate as it
requires retuning for varying io speeds. Limit enforcement needs to be
(1) localized
(2) self-tuned via block layer feedback

If I understand the code properly, 2.5.x has (2) but not (1).


On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> The issue raised here is streaming io competing with processes working
>> within bounded memory. It's unclear to me how 2.5.x mitigates this but
>> the effects are far less drastic there. The "fix" you're suggesting is
>> clamping off the entire machine's io just to contain the working set of

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> show me this claimping off please. take 2.4.21pre4aa3 and trash it
> compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
> interested, if I've a problem I must fix it ASAP, but all the benchmarks
> are in green so far and the behaviour was very bad before these fixes,
> go ahead and show me red and you'll make me a big favour. Either that or
> you're wrong that I'm claimping off anything.
> Just to be clear, this whole thing has nothing to do with the elevator,
> or the CFQ or whatever, it only is related to the worthwhile amount of
> in flight I/O to keep the disk always running.

You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on
2.4.21pre4. Some limit that's a hard upper bound but resettable via a
sysctl or /proc/ or something. Testing 2.4.x-based trees might be a
little painful since I'd have to debug why 2.4.x stopped booting on my
boxen, which would take me a bit far afield from my current hacking.


On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> a single process that generates unbounded amounts of dirty data and
>> inadvertently penalizes other processes via page reclaim, where instead
>> it should be forced to fairly wait its turn for memory.

I believe I said something important here. =)

The reason why this _should_ be the case is because processes stealing
from each other is the kind of mutual interference that leads to things
like Mozilla taking ages to swap in because other things were running
for a while and it wasn't and so on.


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:17           ` Nick Piggin
@ 2003-02-21 11:41             ` Andrea Arcangeli
  2003-02-21 21:25               ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 11:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 10:17:55PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> 
> >it's like a dma ring buffer size of a soundcard, if you want low latency
> >it has to be small, it's as simple as that. It's a tradeoff between
> >
> Although the dma buffer is strictly FIFO, so the situation isn't
> quite so simple for disk IO.

In genereal (w/o CFQ or the other side of it that is an extreme unfair
starving elevator where you're stuck regardless the size of the queue)
larger queue will mean higher latencies in presence of flood of async
load like in a dma buffer. This is obvious for the elevator noop for
example.

I'm speaking about a stable, non starving, fast, default elevator
(something like in 2.4 mainline incidentally) and for that the
similarity with dma buffer definitely applies, there will be a latency
effect coming from the size of the queue (even ignoring the other issues
that the load of locked buffers introduces).

The whole idea of CFQ is to make some workload work lowlatency
indipendent on the size of the async queue. But still (even with CFQ)
you have all the other problems about write throttling and worthless
amount of locked ram and even wasted time on lots of full just ordered
requests in the elevator (yeah I know you use elevator noop won't waste
almost any time, but again this is not most people will use). I don't
buy Andrew complaining about the write throttling when he still allows
several dozen mbytes of ram in flight and invisible to the VM, I mean,
before complaining about write throttling the excessive worthless amount
of locked buffers must be fixed and so I did and it works very well from
the feedback I had so far. 

You can take 2.4.21pre4aa3 and benchmark it as you want if you think I'm
totally wrong, the elevator-lowlatency should be trivial to apply and
backout (benchmarking against pre4 would be unfair).

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:34           ` William Lee Irwin III
@ 2003-02-21 12:38             ` Andrea Arcangeli
  0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 12:38 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 03:34:36AM -0800, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> Restricting io in flight doesn't actually repair the issues raised by
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > the amount of I/O that we allow in flight is purerly random, there is no
> > point to allow several dozen mbytes of I/O in flight on a 64M machine,
> > my patch fixes that and nothing more.
> 
> I was arguing against having any preset limit whatsoever.

the preset limit exists in every linux kernel out there.  It should be
mandated by the lowlevel device driver, I don't allow that yet, but it
should be trivial to extend with just an additional per-queue int, it's
just an implementation matter.

> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> it, but rather avoids them by limiting functionality.
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > If you can show a (throughput) benchmark where you see this limited
> > functionalty I'd be very interested.
> > Alternatively I can also claim that 2.4 and 2.5 are limiting
> > functionalty too by limiting the I/O in flight to some hundred megabytes
> > right?
> 
> This has nothing to do with benchmarks.

it has to, you claimed I limited functionalty, if you can't measure it
in any way (or at least demonstrate it with math), it doesn't exist.

> Counterexample: suppose the process generating dirty data is the only
> one running. The machine's effective RAM capacity is then limited to
> the dirty data limit plus some small constant by this io in flight
> limitation.

only the free memory and cache is accounted here, while this task allocates
ram with malloc, the amount of dirty ram will be reduced accordingly,
what you said is far from reality. We aren't 100% accurate in the cache
level accounting true, but we're 100% accurate in the anonymous memory
accounting.

> This functionality is not to be dismissed lightly: changing the /proc/
> business is root-only, hence it may not be within the power of a victim
> of a poor setting to adjust it.
> 
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > it's like a dma ring buffer size of a soundcard, if you want low latency
> > it has to be small, it's as simple as that. It's a tradeoff between
> > latency and performance, but the point here is that apparently you gain
> > nothing with such an huge amount of I/O in flight. This has nothing to
> > do with the number of requests, the requests have to be a lot, or seeks
> > won't be reordered aggressively, but when everything merges using all
> > the requests is pointless and it only has the effect of locking
> > everything in ram, and this screw the write throttling too, because we
> > do write throttling on the dirty stuff, not on the locked stuff, and
> > this is what elevator-lowlatency address.
> > You may argue on the amount of in flight I/O limit I choosen, but really
> > the default in mainlines looks overkill to me for generic hardware.
> 
> It's not a question of gain but rather immunity to reconfigurations.

You mean immunity of reconfigurations of machines with more than 4G of
ram maybe, and you are ok to ignore completely the latency effects of
the overkill queue size. Everything smaller can be affected by it not
only in terms of latency effect. Especially if you have multiple spindle
that literally multiply the fixed max amount of in flight I/O.

> Redoing it for all the hardware raises a tuning issue, and in truth
> all I've ever wound up doing is turning it off because I've got so
> much RAM that various benchmarks could literally be done in-core as a
> first pass, then sorted, then sprayed out to disk in block-order. And
> a bunch of open benchmarks are basically just in-core spinlock exercise.
> (Ignore the fact there was a benchmark mentioned.)
> 
> Amortizing seeks and incrementally sorting and so on generally require
> large buffers, and if you have the RAM, the kernel should use it.
> 
> But more seriously, global io in flight limits are truly worthless, if
> anything it should be per-process, but even that's inadequate as it

This doesn't make any sense, the limit alwyas exists, it has to, if you
drop it the machine will die deadlocking in a few milliseconds, the
whole plugging and write throttling logic that completely drives the
whole I/O subsystem totally depends on a limit on the in flight I/O.

> requires retuning for varying io speeds. Limit enforcement needs to be
> (1) localized
> (2) self-tuned via block layer feedback
> 
> If I understand the code properly, 2.5.x has (2) but not (1).

2.5 has the unplugging logic so it definitely has an high limit of in
flight I/O too, no matter what elevator or whatever, w/o the fixed limit
2.5 will die too like any other linux kernel out there I have ever seen.

> 
> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> The issue raised here is streaming io competing with processes working
> >> within bounded memory. It's unclear to me how 2.5.x mitigates this but
> >> the effects are far less drastic there. The "fix" you're suggesting is
> >> clamping off the entire machine's io just to contain the working set of
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > show me this claimping off please. take 2.4.21pre4aa3 and trash it
> > compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
> > interested, if I've a problem I must fix it ASAP, but all the benchmarks
> > are in green so far and the behaviour was very bad before these fixes,
> > go ahead and show me red and you'll make me a big favour. Either that or
> > you're wrong that I'm claimping off anything.
> > Just to be clear, this whole thing has nothing to do with the elevator,
> > or the CFQ or whatever, it only is related to the worthwhile amount of
> > in flight I/O to keep the disk always running.
> 
> You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on
> 2.4.21pre4. Some limit that's a hard upper bound but resettable via a
> sysctl or /proc/ or something. Testing 2.4.x-based trees might be a
> little painful since I'd have to debug why 2.4.x stopped booting on my
> boxen, which would take me a bit far afield from my current hacking.

2.4.21pre4aa3 has to boot on it.

> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> a single process that generates unbounded amounts of dirty data and
> >> inadvertently penalizes other processes via page reclaim, where instead
> >> it should be forced to fairly wait its turn for memory.
> 
> I believe I said something important here. =)

You're arguing about the async flushing heuristic that should be made
smarter instead of taking 50% of the freeable memory (not anonymous
memory). This isn't black and white stuff and you shouldn't mix issues,
it has nothing to do with the blkdev plugging logic driven by the limit
of in flight I/O (in every l-k out there ever).

> The reason why this _should_ be the case is because processes stealing
> from each other is the kind of mutual interference that leads to things
> like Mozilla taking ages to swap in because other things were running
> for a while and it wasn't and so on.
> 
> 
> -- wli


Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:41             ` Andrea Arcangeli
@ 2003-02-21 21:25               ` Andrew Morton
  2003-02-23 15:09                 ` Andrea Arcangeli
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2003-02-21 21:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: piggin, wli, david.lang, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> I don't
> buy Andrew complaining about the write throttling when he still allows
> several dozen mbytes of ram in flight and invisible to the VM,

The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling
decisions are made upon the sum of dirty+writeback pages.

The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount
of dirty memory.

Throttling in both write() and the page allocator is fully decoupled from the
queue size.  An 8192-slot (4 gigabyte) queue on a 32M machine has been
tested.

The only tasks which block in get_request_wait() are the ones which we want
to block there: heavy writers.

Page reclaim will never block page allocators in get_request_wait().  That
causes terrible latency if the writer is still active.

Page reclaim will never block a page-allocating process on I/O against a
particular disk block.  Allocators are instead throttled against _any_ write
I/O completion.  (This is broken in several ways, but it works well enough to
leave it alone I think).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 21:25               ` Andrew Morton
@ 2003-02-23 15:09                 ` Andrea Arcangeli
  0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2003-02-23 15:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: piggin, wli, david.lang, linux-kernel

On Fri, Feb 21, 2003 at 01:25:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > I don't
> > buy Andrew complaining about the write throttling when he still allows
> > several dozen mbytes of ram in flight and invisible to the VM,
> 
> The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling
> decisions are made upon the sum of dirty+writeback pages.
> 
> The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount
> of dirty memory.
> 
> Throttling in both write() and the page allocator is fully decoupled from the
> queue size.  An 8192-slot (4 gigabyte) queue on a 32M machine has been
> tested.

the 32M case is probably fine with it, you moved the limit of in-flight
I/O in the writeback layer, and the write throttling will limit the
amount of ram in flight to 16M or so. I would be much more interesting
to see some latency benchmark on a 8G machine with 4G simultaneously
locked in the I/O queue. a 4G queue on a IDE disk can only waste lots of
cpu and memory resources, increasing the latency too, without providing
any benefit. Your 4G queue thing provides only disavantages as far as I
can tell.

> 
> The only tasks which block in get_request_wait() are the ones which we want
> to block there: heavy writers.
> 
> Page reclaim will never block page allocators in get_request_wait().  That
> causes terrible latency if the writer is still active.
> 
> Page reclaim will never block a page-allocating process on I/O against a
> particular disk block.  Allocators are instead throttled against _any_ write
> I/O completion.  (This is broken in several ways, but it works well enough to
> leave it alone I think).

2.4 on desktop boxes could fill all ram with locked and dirty stuff
because of the excessive size of the queue, so any comparison with 2.4
in terms of page reclaim should be repeated on 2.4.21pre4aa3 IMHO, where
the VM has a chance not to find the machine in collapsed state where the
only thing it can do is to either wait or panic(), feel free to choose
what you prefer.

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
@ 2003-02-25  5:35 rwhron
  2003-02-25  6:38 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: rwhron @ 2003-02-25  5:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

Executive question: Why does 2.5.62-mm2 have higher sequential
write latency than 2.5.61-mm1?

tiobench numbers on uniprocessor single disk IDE:
The cfq scheduler (2.5.62-mm2 and 2.5.61-cfq) has a big latency
regression.

2.5.61-mm1		(default scheduler (anticipatory?))
2.5.61-mm1-cfq		elevator=cfq
2.5.62-mm2-as		anticipatory scheduler
2.5.62-mm2-dline	elevator=deadline
2.5.62-mm2		elevator=cfq

                    Thr  MB/sec   CPU%     avg lat      max latency
2.5.61-mm1            8   15.68   54.42%     5.87 ms     2.7 seconds
2.5.61-mm1-cfq        8    9.60   15.07%     7.54      393.0
2.5.62-mm2-as         8   14.76   52.04%     6.14        4.5
2.5.62-mm2-dline      8    9.91   13.90%     9.41         .8
2.5.62-mm2            8    9.83   15.62%     7.38      408.9
2.4.21-pre3           8   10.34   27.66%     8.80        1.0
2.4.21-pre3-ac4       8   10.53   28.41%     8.83         .6
2.4.21-pre3aa1        8   18.55   71.95%     3.25       87.6


For most thread counts (8 - 128), the anticipatory scheduler has roughly 
45% higher ext2 sequential read throughput.  Latency was higher than 
deadline, but a lot lower than cfq.

For tiobench sequential writes, the max latency numbers for 2.4.21-pre3
are notably lower than 2.5.62-mm2 (but not as good as 2.5.61-mm1).  
This is with 16 threads.  

                    Thr  MB/sec   CPU%      avg lat     max latency
2.5.61-mm1           16   18.30   81.12%     9.159 ms     6.1 seconds
2.5.61-mm1-cfq       16   18.03   80.71%     9.086        6.1
2.5.62-mm2-as        16   18.84   84.25%     8.620       47.7
2.5.62-mm2-dline     16   18.53   84.10%     8.967       53.4
2.5.62-mm2           16   18.46   83.28%     8.521       40.8
2.4.21-pre3          16   16.20   65.13%     9.566        8.7
2.4.21-pre3-ac4      16   18.50   83.68%     8.774       11.6
2.4.21-pre3aa1       16   18.49   88.10%     8.455        7.5

Recent uniprocessor benchmarks:
http://home.earthlink.net/~rwhron/kernel/latest.html

More uniprocessor benchmarks:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html

-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html
latest quad xeon benchmarks:
http://home.earthlink.net/~rwhron/kernel/blatest.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-25  5:35 IO scheduler benchmarking rwhron
@ 2003-02-25  6:38 ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2003-02-25  6:38 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

rwhron@earthlink.net wrote:
>
> Executive question: Why does 2.5.62-mm2 have higher sequential
> write latency than 2.5.61-mm1?

Well bear in mind that we sometimes need to perform reads to be able to
perform writes.  So the way tiobench measures it, you could be seeing
read-vs-write latencies here.

And there are various odd interactions in, at least, ext3.  You did not
specify which filesystem was used.

>  ...
>                     Thr  MB/sec   CPU%     avg lat      max latency
> 2.5.62-mm2-as         8   14.76   52.04%     6.14        4.5
> 2.5.62-mm2-dline      8    9.91   13.90%     9.41         .8
> 2.5.62-mm2            8    9.83   15.62%     7.38      408.9

Fishiness.  2.5.62-mm2 _is_ 2.5.62-mm2-as.  Why the 100x difference?

That 408 seconds looks suspect.


I don't know what tiobench is doing in there, really.  I find it more useful
to test simple things, which I can understand.  If you want to test write
latency, do this:

	while true
	do
		write-and-fsync -m 200 -O -f foo
	done

Maybe run a few of these.  This command will cause a continuous streaming
file overwrite.


then do:

	time write-and-fsync -m1 -f foo

this will simply write a megabyte file, fsync it and exit.

You need to be careful with this - get it wrong and most of the runtime is
actually paging the executables back in.  That is why the above background
load is just reusing the same pagecache over and over.

The latency which I see for the one megabyte write and fsync varies a lot. 
>From one second to ten.  That's with the deadline scheduler.

There is a place in VFS where one writing task could accidentally hammer a
different one.  I cannot trigger that, but I'll fix it up in next -mm.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
@ 2003-02-25 12:59 rwhron
  2003-02-25 22:09 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: rwhron @ 2003-02-25 12:59 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

>> Why does 2.5.62-mm2 have higher sequential
>> write latency than 2.5.61-mm1?

> And there are various odd interactions in, at least, ext3.  You did not
> specify which filesystem was used.

ext2

>>                     Thr  MB/sec   CPU%     avg lat      max latency
>> 2.5.62-mm2-as         8   14.76   52.04%     6.14        4.5
>> 2.5.62-mm2-dline      8    9.91   13.90%     9.41         .8
>> 2.5.62-mm2            8    9.83   15.62%     7.38      408.9

> Fishiness.  2.5.62-mm2 _is_ 2.5.62-mm2-as.  Why the 100x difference?

Bad EXTRAVERSION naming on my part.  2.5.62-mm2 _was_ booted with 
elevator=cfq.

How it happened:
2.5.61-mm1 tested
2.5.61-mm1-cfq tested and elevator=cfq added to boot flags
2.5.62-mm1 tested (elevator=cfq still in lilo boot boot flags)
Then to test the other two schedulers I changed extraversion and boot
flags.

> That 408 seconds looks suspect.

AFAICT, that's the one request in over 500,000 that took the longest.
The numbers are fairly consistent.  How relevant they are is debatable.  

> If you want to test write latency, do this:

Your approach is more realistic than tiobench.  

> There is a place in VFS where one writing task could accidentally hammer a
> different one.  I cannot trigger that, but I'll fix it up in next -mm.

2.5.62-mm3 or 2.5.63-mm1?  (-mm3 is running now)

-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
@ 2003-02-25 21:57 rwhron
  0 siblings, 0 replies; 17+ messages in thread
From: rwhron @ 2003-02-25 21:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

> Why does 2.5.62-mm2 have higher sequential
> write latency than 2.5.61-mm1?

Anticipatory scheduler tiobench profile on uniprocessor:

                              2.5.61-mm1   2.5.62-mm2
total                           1993387     1933241
default_idle                    1873179     1826650
system_call                       49838       43036
get_offset_tsc                    21905       20883
do_schedule                       13893       10344
do_gettimeofday                    8478        6044
sys_gettimeofday                   8077        5153
current_kernel_time                4904       12165
syscall_exit                       4047        1243
__wake_up                          1274        1000
io_schedule                        1166        1039
prepare_to_wait                    1093         792
schedule_timeout                    612         366
delay_tsc                           502         443
get_fpu_cwd                         473         376
syscall_call                        389         378
math_state_restore                  354         271
restore_fpu                         329         287
del_timer                           325         200
device_not_available                290         377
finish_wait                         257         181
add_timer                           218         137
io_schedule_timeout                 195          72
cpu_idle                            193         218
run_timer_softirq                   137          33
remove_wait_queue                   121         188
eligible_child                      106         154
sys_wait4                           105         162
work_resched                        104         110
ret_from_intr                        97          74
dup_task_struct                      75          48
add_wait_queue                       67         124
__cond_resched                       59          69
do_page_fault                        55           0
do_softirq                           53          12
pte_alloc_one                        51          67
release_task                         44          55
get_signal_to_deliver                38          43
get_wchan                            16          10
mod_timer                            15           0
old_mmap                             14          19
prepare_to_wait_exclusive            10          32
mm_release                            7           0
release_x86_irqs                      7           8
sys_getppid                           6           5
handle_IRQ_event                      4           0
schedule_tail                         4           0
kill_proc_info                        3           0
device_not_available_emulate          2           0
task_prio                             1           1
__down                                0          33
__down_failed_interruptible           0           3
init_fpu                              0          12
pgd_ctor                              0           3
process_timeout                       0           2
restore_all                           0           2
sys_exit                              0           2
-- 
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-25 12:59 rwhron
@ 2003-02-25 22:09 ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2003-02-25 22:09 UTC (permalink / raw)
  To: rwhron; +Cc: linux-kernel

rwhron@earthlink.net wrote:
>
> >> Why does 2.5.62-mm2 have higher sequential
> >> write latency than 2.5.61-mm1?
> 
> > And there are various odd interactions in, at least, ext3.  You did not
> > specify which filesystem was used.
> 
> ext2
> 
> >>                     Thr  MB/sec   CPU%     avg lat      max latency
> >> 2.5.62-mm2-as         8   14.76   52.04%     6.14        4.5
> >> 2.5.62-mm2-dline      8    9.91   13.90%     9.41         .8
> >> 2.5.62-mm2            8    9.83   15.62%     7.38      408.9
> 
> > Fishiness.  2.5.62-mm2 _is_ 2.5.62-mm2-as.  Why the 100x difference?
> 
> Bad EXTRAVERSION naming on my part.  2.5.62-mm2 _was_ booted with 
> elevator=cfq.
> 
> ...
> > That 408 seconds looks suspect.
> 
> AFAICT, that's the one request in over 500,000 that took the longest.
> The numbers are fairly consistent.  How relevant they are is debatable.  

OK.  When I was testing CFQ I saw some odd behaviour, such as a 100%
cessation of reads for periods of up to ten seconds.

So there is some sort of bug in there, and until that is understood we should
not conclude anything at all about CFQ from this testing.

> 2.5.62-mm3 or 2.5.63-mm1?  (-mm3 is running now)

Well I'm showing about seven more AS patches since 2.5.63-mm1 already, so
this is a bit of a moving target.  Sorry.



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2003-02-25 23:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-25  5:35 IO scheduler benchmarking rwhron
2003-02-25  6:38 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2003-02-25 21:57 rwhron
2003-02-25 12:59 rwhron
2003-02-25 22:09 ` Andrew Morton
2003-02-21  5:23 Andrew Morton
2003-02-21  6:51 ` David Lang
2003-02-21  8:16   ` Andrew Morton
2003-02-21 10:31     ` Andrea Arcangeli
2003-02-21 10:51       ` William Lee Irwin III
2003-02-21 11:08         ` Andrea Arcangeli
2003-02-21 11:17           ` Nick Piggin
2003-02-21 11:41             ` Andrea Arcangeli
2003-02-21 21:25               ` Andrew Morton
2003-02-23 15:09                 ` Andrea Arcangeli
2003-02-21 11:34           ` William Lee Irwin III
2003-02-21 12:38             ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox