IO scheduler benchmarking

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* IO scheduler benchmarking
@ 2003-02-21  5:23 Andrew Morton
  2003-02-21  5:23 ` iosched: parallel streaming reads Andrew Morton
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:23 UTC (permalink / raw)
  To: linux-kernel

Following this email are the results of a number of tests of various I/O
schedulers:

- Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx)

- CFQ (as in 2.5.61-mm1)

- 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory
  scheduler - tweaks which fix the writes-starve-reads problem via a
  scheduling storm)

- 2.4.21-pre4

All these tests are simple things from the command line.

I stayed away from the standard benchmarks because they do not really touch
on areas where the Linux I/O scheduler has traditionally been bad.  (If they
did, perhaps it wouldn't have been so bad..)

Plus all the I/O schedulers perform similarly with the usual benchmarks. 
With the exception of some tiobench phases, where AS does very well.

Executive summary: the anticipatory scheduler is wiping the others off the
map, and 2.4 is a disaster.

I really have not sought to make the AS look good - I mainly concentrated on
things which we have traditonally been bad at.  If anyone wants to suggest
other tests, please let me know.

The known regressions from the anticipatory scheduler are:

1) 15% (ish) slowdown in David Mansfield's database run.  This appeared to
   go away in later versions of the scheduler.

2) 5% dropoff in single-threaded qsbench swapstorms

3) 30% dropoff in write bandwidth when there is a streaming read (this is
   actually good).

The test machine is a fast P4-HT with 256MB of memory.  Testing was against a
single fast IDE disk, using ext2.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: parallel streaming reads
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
@ 2003-02-21  5:23 ` Andrew Morton
  2003-02-21  5:24 ` iosched: effect of streaming write on interactivity Andrew Morton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:23 UTC (permalink / raw)
  To: linux-kernel


Here we see how well the scheduler can cope with multiple processes reading
multiple large files.  We read ten well laid out 100 megabyte files in
parallel (ten readers):

	for i in $(seq 0 9)
	do
		time cat 100-meg-file-$i > /dev/null &
	done

2.4.21-pre4:

	0.00s user 0.18s system 2% cpu 6.115 total
	0.02s user 0.22s system 1% cpu 14.312 total
	0.01s user 0.19s system 1% cpu 14.812 total
	0.00s user 0.14s system 0% cpu 20.462 total
	0.02s user 0.19s system 0% cpu 23.887 total
	0.06s user 0.14s system 0% cpu 27.085 total
	0.01s user 0.26s system 0% cpu 32.367 total
	0.00s user 0.22s system 0% cpu 34.844 total
	0.01s user 0.21s system 0% cpu 35.233 total
	0.01s user 0.16s system 0% cpu 37.007 total

2.5.61+hacks:

	0.01s user 0.16s system 0% cpu 2:12.00 total
	0.01s user 0.15s system 0% cpu 2:12.12 total
	0.00s user 0.14s system 0% cpu 2:12.34 total
	0.01s user 0.15s system 0% cpu 2:12.68 total
	0.00s user 0.15s system 0% cpu 2:12.93 total
	0.01s user 0.17s system 0% cpu 2:13.06 total
	0.01s user 0.14s system 0% cpu 2:13.18 total
	0.01s user 0.17s system 0% cpu 2:13.31 total
	0.01s user 0.16s system 0% cpu 2:13.49 total
	0.01s user 0.19s system 0% cpu 2:13.51 total

2.5.61+CFQ:

	0.01s user 0.16s system 0% cpu 50.778 total
	0.01s user 0.16s system 0% cpu 51.067 total
	0.01s user 0.16s system 0% cpu 52.854 total
	0.01s user 0.17s system 0% cpu 53.303 total
	0.01s user 0.17s system 0% cpu 54.565 total
	0.01s user 0.18s system 0% cpu 1:07.39 total
	0.01s user 0.17s system 0% cpu 1:19.96 total
	0.00s user 0.17s system 0% cpu 1:28.74 total
	0.01s user 0.18s system 0% cpu 1:31.28 total
	0.01s user 0.18s system 0% cpu 1:32.34 total

2.5.61+AS

	0.01s user 0.17s system 0% cpu 27.995 total
	0.01s user 0.18s system 0% cpu 30.550 total
	0.00s user 0.17s system 0% cpu 31.413 total
	0.00s user 0.18s system 0% cpu 32.381 total
	0.01s user 0.17s system 0% cpu 33.273 total
	0.01s user 0.18s system 0% cpu 33.389 total
	0.01s user 0.15s system 0% cpu 34.534 total
	0.01s user 0.17s system 0% cpu 34.481 total
	0.00s user 0.17s system 0% cpu 34.694 total
	0.01s user 0.16s system 0% cpu 34.832 total


AS and 2.4 almost achieved full disk bandwidth.  2.4 does quite well here,
although it was unfair.

As an aside, I reran this test with the VM readahead wound down from the
usual 128k to just 8k:

2.5.61+CFQ:

	0.01s user 0.25s system 0% cpu 7:48.39 total
	0.01s user 0.23s system 0% cpu 7:48.72 total
	0.02s user 0.26s system 0% cpu 7:48.93 total
	0.02s user 0.25s system 0% cpu 7:48.93 total
	0.01s user 0.26s system 0% cpu 7:49.08 total
	0.02s user 0.25s system 0% cpu 7:49.22 total
	0.02s user 0.26s system 0% cpu 7:49.25 total
	0.02s user 0.25s system 0% cpu 7:50.35 total
	0.02s user 0.26s system 0% cpu 8:19.82 total
	0.02s user 0.28s system 0% cpu 8:19.83 total

2.5.61 base:

	0.01s user 0.25s system 0% cpu 8:10.53 total
	0.01s user 0.27s system 0% cpu 8:11.96 total
	0.02s user 0.26s system 0% cpu 8:14.95 total
	0.02s user 0.26s system 0% cpu 8:17.33 total
	0.02s user 0.25s system 0% cpu 8:18.05 total
	0.01s user 0.24s system 0% cpu 8:19.03 total
	0.02s user 0.27s system 0% cpu 8:19.66 total
	0.02s user 0.25s system 0% cpu 8:20.00 total
	0.02s user 0.26s system 0% cpu 8:20.10 total
	0.02s user 0.25s system 0% cpu 8:20.11 total

2.5.61+AS

	0.02s user 0.23s system 0% cpu 28.640 total
	0.01s user 0.23s system 0% cpu 28.066 total
	0.02s user 0.23s system 0% cpu 28.525 total
	0.01s user 0.20s system 0% cpu 28.925 total
	0.01s user 0.22s system 0% cpu 28.835 total
	0.02s user 0.21s system 0% cpu 29.014 total
	0.02s user 0.23s system 0% cpu 29.093 total
	0.01s user 0.20s system 0% cpu 29.175 total
	0.01s user 0.23s system 0% cpu 29.233 total
	0.01s user 0.21s system 0% cpu 29.285 total

We see here that the anticipatory scheduler is not dependent upon large
readahead to get good performance.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: effect of streaming write on interactivity
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
  2003-02-21  5:23 ` iosched: parallel streaming reads Andrew Morton
@ 2003-02-21  5:24 ` Andrew Morton
  2003-02-21  5:25 ` iosched: effect of streaming read " Andrew Morton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:24 UTC (permalink / raw)
  To: linux-kernel


It peeves me that if a machine is writing heavily, it takes *ages* to get a
login prompt.

Here we start a large streaming write, wait for that to reach steady state
and then see how long it takes to pop up an xterm from the machine under
test with

	time ssh testbox xterm -e true

there is quite a lot of variability here.

2.4.21-4:	62 seconds
2.5.61+hacks:	14 seconds
2.5.61+CFQ:	11 seconds
2.5.61+AS:	12 seconds


^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: effect of streaming read on interactivity
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
  2003-02-21  5:23 ` iosched: parallel streaming reads Andrew Morton
  2003-02-21  5:24 ` iosched: effect of streaming write on interactivity Andrew Morton
@ 2003-02-21  5:25 ` Andrew Morton
  2003-02-21  5:25 ` iosched: time to copy many small files Andrew Morton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:25 UTC (permalink / raw)
  To: linux-kernel


Similarly, start a large streaming read on the test box and see how long it
then takes to pop up an x client running on that box with

	time ssh testbox xterm -e true


2.4.21-4:	45 seconds
2.5.61+hacks:	5 seconds
2.5.61+CFQ:	8 seconds
2.5.61+AS:	9 seconds



^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: time to copy many small files
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (2 preceding siblings ...)
  2003-02-21  5:25 ` iosched: effect of streaming read " Andrew Morton
@ 2003-02-21  5:25 ` Andrew Morton
  2003-02-21  5:26 ` iosched: concurrent reads of " Andrew Morton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:25 UTC (permalink / raw)
  To: linux-kernel

This test simply measures how long it takes to copy a large number of files
within the same filesystem.  It creates a lot of small, competing read and
write I/O's.  Changes which were made to the VFS dirty memory handling early
in the 2.5 cycle tends to make 2.5 a bit slower at this.

Three copies of the 2.4.19 kernel tree were placed on an ext2 filesystem. 
Measure the time it takes to copy them all to the same filesystem, and to
then sync the system.  This is just

	cp -a ./dir-with-three-kernel-trees/ ./new-dir
	sync

The anticipatory scheduler doesn't help here.  It could, but we haven't got
there yet, and it may need VFS help.

2.4.21-pre4:	70 seconds
2.5.61+hacks:	72 seconds
2.5.61+CFQ:	69 seconds
2.5.61+AS:	66 seconds

^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: concurrent reads of many small files
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (3 preceding siblings ...)
  2003-02-21  5:25 ` iosched: time to copy many small files Andrew Morton
@ 2003-02-21  5:26 ` Andrew Morton
  2003-02-21  5:27 ` iosched: impact of streaming write on streaming read Andrew Morton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:26 UTC (permalink / raw)
  To: linux-kernel

This test is very approximately the "busy web server" workload.  We set up a
number of processes each of which are reading many small files from different
parts of the disk.

Set up six separate copies of the 2.4.19 kernel tree, and then run, in
parallel, six processes which are reading them:

	for i in 1 2 3 4 5 6
	do
		time (find kernel-tree-$i -type f | xargs cat > /dev/null ) &
	done

With this test we have six read requests in the queue all the time.  It's
what the anticipatory scheduler was designed for.

2.4.21-pre4:
	6m57.537s
	6m57.620s
	6m57.741s
	6m57.891s
	6m57.909s
	6m57.916s

2.5.61+hacks:
	3m40.188s
	3m51.332s
	3m55.110s
	3m56.186s
	3m56.757s
	3m56.791s

2.5.61+CFQ:
	5m15.932s
	5m16.219s
	5m16.386s
	5m17.407s
	5m50.233s
	5m50.602s

2.5.61+AS:
	0m44.573s
	0m45.119s
	0m46.559s
	0m49.202s
	0m51.884s
	0m53.087s

This was a little unfair to 2.4 because three of the trees were laid out by
the pre-Orlov ext2.  So I reran the test with 2.4.21-pre4 when all six trees
were laid out by 2.5's Orlov allocator:

	6m12.767s
	6m12.974s
	6m13.001s
	6m13.045s
	6m13.062s
	6m13.085s

Not much difference there, although Orlov is worth a 4x speedup in this test
when there is only a single reader (or multiple readers + anticipatory
scheduler)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: impact of streaming write on streaming read
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (4 preceding siblings ...)
  2003-02-21  5:26 ` iosched: concurrent reads of " Andrew Morton
@ 2003-02-21  5:27 ` Andrew Morton
  2003-02-21  5:27 ` iosched: impact of streaming write on read-many-files Andrew Morton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:27 UTC (permalink / raw)
  To: linux-kernel


Here we take a look at the impact which a streaming write has upon streaming
read bandwidth.

A single streaming write was set up with:

	while true
	do
	        dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc
	done

and we measure how long it takes to read a 100 megabyte file from the same
filesystem with

	time cat 100m-file > /dev/null

I'll include `vmstat 1' snippets here as well.


2.4.21-pre4:	42 seconds
 1  3    276   4384   2144 222300    0    0    80 26480  520   743  0  6 94  0
 0  3    276   4344   2144 222240    0    0    76 25224  512   492  0  4 96  0
 0  3    276   4340   2148 222220    0    0   124 25584  520   536  0  3 97  0
 0  3    276   4404   2152 222132    0    0    44 26604  538   533  0  5 95  0
 0  4    276   4464   2160 221928    0    0    60 25040  516   559  0  4 96  0
 0  4    276   4460   2160 221900    0    0   612 27456  560   621  0  4 96  0
 0  4    276   4392   2156 221972    0    0   708 23872  488   566  0  4 95  0
 0  4    276   4420   2168 221852    0    0   688 26668  545   653  0  4 96  0
 0  4    276   4204   2164 221912    0    0   696 21588  492   884  0  5 95  0
 0  4    276   4448   2164 221668    0    0   396 21376  423   833  0  4 96  0
 0  4    276   4432   2160 221688    0    0   784 26368  544   705  0  4 96  0
 0  4    276   4400   2168 221608    0    0   560 27640  563   596  0  5 95  0
 4  1    276   4324   2188 221616    0    0 12476 12996  538   908  0  4 96  0
 0  4    276   3516   2196 222408    0    0 12320 16048  529   971  0  2 98  0
 0  4    276   3468   2212 222424    0    0 12704 14428  540  1039  0  4 96  0
 0  4    276   4112   2208 221700    0    0   552 20824  474   539  0  4 96  0
 3  2    276   3768   2208 222040    0    0   524 25428  503   612  0  3 97  0
 0  4    276   4452   2216 221344    0    0   536 19548  437  1241  0  3 97  0

2.5.61+hacks:	48 seconds
 0  5      0   2140   1296 227700    0    0     0 22236 1213   126  0  4  0 96
 0  5      0   2252   1296 227664    0    0     0 23340 1219   123  0  3  0 97
 0  6      0   4044   1288 225904    0    0  1844 13632 1183   236  0  2  0 98
 0  6      0   4100   1268 225788    0    0  1920 13780 1173   217  0  2  0 98
 0  6      0   4156   1248 225908    0    0  2184 14828 1184   236  0  3  0 97
 0  6      0   4100   1244 226012    0    0  2176 13720 1173   237  0  2  0 98
 0  6      0   4212   1240 225980    0    0  1924 13900 1175   236  0  2  0 98
 0  5      0   5444   1192 224824    0    0  2304 11820 1164   206  0  2  0 98
 0  6      0   2196   1180 228088    0    0  2308 14460 1180   269  0  3  0 97

2.5.61+CFQ:	27 seconds
 1  3      0   6196   2060 222852    0    0     0 23840 1247   220  0  4  4 92
 0  2      0   4404   1820 224880    0    0     0 22208 1237   271  0  3  8 89
 2  4      0   2884   1680 226588    0    0  1496 26944 1263   355  0  4  2 94
 0  4      0   4332   1312 225388    0    0  4592 14692 1244   414  0  3  0 97
 0  4      0   4268   1012 225764    0    0  1408 29540 1308   671  0  5  0 95
 0  4      0   3316   1016 226752    0    0  2820 27500 1306   668  0  5  0 95
 0  4      0   4212    992 225924    0    0  3076 22148 1255   508  0  3  0 97

2.5.61+AS:	3.8 seconds
 0  4      0   2236   1320 227548    0    0     0 36684 1335   136  0  5  0 95
 0  4      0   2236   1296 227636    0    0     0 37736 1334   134  0  5  0 95
 0  5      0   3348   1088 226604    0    0  1232 30040 1320   174  0  4  0 96
 0  5      0   2284   1056 227920    0    0 29088  5488 1536   855  0  4  0 96
 0  5      0   4916   1080 225672    0    0 26904  8452 1517   993  0  5  0 95
 0  5    120   2228   1108 228732    0  120 29472  6752 1545   940  0  3  1 96
 0  4    120   4196   1060 226984    0    0 16164 15740 1426   627  0  3  3 93
	


^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: impact of streaming write on read-many-files
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (5 preceding siblings ...)
  2003-02-21  5:27 ` iosched: impact of streaming write on streaming read Andrew Morton
@ 2003-02-21  5:27 ` Andrew Morton
  2003-02-21  5:27 ` iosched: impact of streaming read " Andrew Morton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:27 UTC (permalink / raw)
  To: linux-kernel


Here we look at what affect a large streaming write has upon an operation
which reads many small files from the same disk.


A single streaming write was set up with:

	while true
	do
	        dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc
	done

and we measure how long it takes to read all the files from a 2.4.19 kernel
tree off the same disk with

	time (find kernel-tree -type f | xargs cat > /dev/null)

As a reference, the time to read the kernel tree with no competing I/O is 7.9
seconds.

2.4.21-pre4:

    Don't know.  I killed it after 15 minutes.  Judging from the vmstat
    output it would have taken many hours.

2.5.61+hacks:	7 minutes 27 seconds
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  8      0   2188   1200 226692    0    0   852 17664 1204   253  0  3  0 97
 0  8      0   4148   1212 224804    0    0  1940 16208 1187   245  0  2  0 98
 0  7      0   4260   1128 224756    0    0   324 20228 1226   298  0  3  0 97
 0  8      0   4204   1048 224944    0    0   500 20856 1227   313  0  3  0 97
 1  7      0   2300   1040 226840    0    0   348 20272 1227   313  0  3  0 97
 0  8      0   4204   1044 224952    0    0   212 21564 1230   320  0  3  0 97

2.5.61+CFQ:	9 minutes 55 seconds
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  2      0   4308   1028 224660    0    0   180 38368 1250   357  0  3  6 91
 0  4      0   2180   1020 226852    0    0   324 25196 1266   408  0  4  1 95
 0  4      0   2236   1016 226744    0    0   252 26948 1276   449  0  4  2 93
 0  4      0   4196   1020 224816    0    0   380 23204 1250   454  0  3  4 93
 0  3      0   4356   1036 224632    0    0  2616 25824 1271   490  0  4  0 96
 0  4      0   4140    968 224996    0    0   496 29416 1304   609  0  4  0 96
 0  4      0   2180    948 226972    0    0   352 29364 1300   688  0  5  0 95
 0  3      0   4364    928 224796    0    0   344 22100 1281   656  0  4 22 74

(CFQ had a strange 20-second pause in which it performed no reads at all)
(And a later 4-second one)
(then 10 seconds..)


2.5.61+AS:	17 seconds
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  6      0   2280   2716 226112    0    0     0 22388 1205   151  0  3  0 97
 0  6      0   4296   2596 224168    0    0     0 21968 1213   148  0  3  0 97
 1  6      0   3872   2516 224408    0    0   296 19552 1223   249  0  3  0 97
 0  9      0   2176   2584 225324    0    0  5112 14588 1573  1424  0  5  0 94
 0  8      0   3364   2668 223116    0    0 17512  8500 3059  6065  0  8  0 92
 1  8      0   4156   2708 221340    0    0 12812  9560 2695  4863  0  9  0 91
 0  8      0   3740   2956 221188    0    0 17216  7200 2406  4045  0  6  0 94
 0  9      0   3828   2668 221192    0    0  9712  8972 1615  1540  0  5  0 94
 1  6      0   2060   2924 222272    0    0  8428 17784 1713  1718  0  5  0 95



^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: impact of streaming read on read-many-files
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (6 preceding siblings ...)
  2003-02-21  5:27 ` iosched: impact of streaming write on read-many-files Andrew Morton
@ 2003-02-21  5:27 ` Andrew Morton
  2003-02-21 10:40   ` Andrea Arcangeli
  2003-02-21  5:28 ` iosched: effect of streaming read on streaming write Andrew Morton
  2003-02-21  6:51 ` IO scheduler benchmarking David Lang
  9 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:27 UTC (permalink / raw)
  To: linux-kernel


Here we look at what affect a large streaming read has upon an operation
which reads many small files from the same disk.

A single streaming read was set up with:

	while true
	do
	        cat 512M-file > /dev/null
	done

and we measure how long it takes to read all the files from a 2.4.19 kernel
tree off the same disk with

	time (find kernel-tree -type f | xargs cat > /dev/null)



2.4.21-pre4:	31 minutes 30 seconds

2.5.61+hacks:	3 minutes 39 seconds

2.5.61+CFQ:	5 minutes 7 seconds (*)

2.5.61+AS:	17 seconds





* CFQ performed very strangely here.  Tremendous amount of seeking and a
  big drop in aggregate bandwidth.  See the vmstat 1 output from when the
  kernel tree read started up:


 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1   1240 125260   1176 109488    0    0 40744     0 1672   725  0  3 49 47
 0  1   1240  85892   1220 148788    0    0 39344     0 1651   693  0  3 49 48
 0  1   1240  45124   1260 189492    0    0 40744     0 1663   683  0  3 49 47
 1  1   1240   4544   1300 230068    0    0 40616     0 1661   837  0  4 49 47
 0  2   1348   3468    944 231696    0  108 40488   148 1671   800  0  4  4 91
 0  2   1348   2180    936 232920    0    0 40612    64 1668   789  0  4  0 96
 0  3   1348   4220    996 230648    0    0 11348     0 1256   352  0  2  0 98
 0  3   1348   4052   1064 230472    0    0  9012     0 1207   305  0  1  0 98
 0  4   1348   3596   1148 230580    0    0  6756     0 1171   247  0  1  0 99
 0  4   1348   4044   1148 229888    0    0  6344     0 1165   237  0  1  0 99
 1  3   1348   3708   1160 230212    0    0  7800     0 1187   255  0  1 21 78



^ permalink raw reply	[flat|nested] 27+ messages in thread

* iosched: effect of streaming read on streaming write
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (7 preceding siblings ...)
  2003-02-21  5:27 ` iosched: impact of streaming read " Andrew Morton
@ 2003-02-21  5:28 ` Andrew Morton
  2003-02-21  6:51 ` IO scheduler benchmarking David Lang
  9 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  5:28 UTC (permalink / raw)
  To: linux-kernel


Here we look at how much damage a streaming read can do to writeout
performance.  Start a streaming read with:

	while true
	do
	        cat 512M-file > /dev/null
	done

and measure how long it takes to write out and fsync a 100 megabyte file:

	time write-and-fsync -f -m 100 outfile

2.4.21-pre4:	6.4 seconds
2.5.61+hacks:	7.7 seconds
2.5.61+CFQ:	8.4 seconds
2.5.61+AS:	11.9 seconds

This is the one where the anticipatory scheduler could show its downside. 
It's actually not too bad - the read stream steals 2/3rds of the disk
bandwidth.  Dirty memory will reach the vm threshold and writers will
throttle.  This is usually what we want to happen.

Here is the vmstat 1 trace for the anticipatory scheduler:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1   8728   2268   2620 233412    0    0 40360     0 1658   802  0  4  0 96
 0  2   8728   3780   2508 231924    0    0 40616     4 1668   874  0  5  0 95
 0  2   8728   3668   2276 232416    0    0 40740    20 1668   978  0  4  0 96
 0  3   8728   3660   2192 232668   40    0 35296    12 1603   904  0  4  0 95
 0  5   8728   3612   1964 231672    0    0 26220 18572 1497  1381  0 15  0 85
 0  5   8728   2100   1732 233584    0    0 25232  8696 1497   867  0  3 16 81
 0  5   8728   3664   1204 232424    0    0 27668  8696 1533   787  0  3  0 97
 1  4   8728   2432    792 234108    0    0 27160  8696 1527   965  0  3  0 97
 0  6   8728   2208    760 234436    0    0 25904  9584 1513   856  0  3  0 97
 2  6   8728   3776    760 233148    0    0 27776  8716 1537   880  0  3  0 97
 0  6   8728   2204    624 234968    0    0 27924  8812 1541   991  0  4  0 96
 0  4   8716   2508    600 234740    0    0 28188  8216 1537  1038  0  4  0 96
 0  4   8716   4072    532 233316    0   16 25624  9644 1515   896  0  3  0 97
 0  4   8716   3740    548 233624    0    0 27548  8696 1528   908  0  3  0 97




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
                   ` (8 preceding siblings ...)
  2003-02-21  5:28 ` iosched: effect of streaming read on streaming write Andrew Morton
@ 2003-02-21  6:51 ` David Lang
  2003-02-21  8:16   ` Andrew Morton
  9 siblings, 1 reply; 27+ messages in thread
From: David Lang @ 2003-02-21  6:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

one other useful test would be the time to copy a large (multi-gig) file.
currently this takes forever and uses very little fo the disk bandwidth, I
suspect that the AS would give more preference to reads and therefor would
go faster.

for a real-world example, mozilla downloads files to a temp directory and
then copies it to the premanent location. When I download a video from my
tivo it takes ~20 min to download a 1G video, during which time the system
is perfectly responsive, then after the download completes when mozilla
copies it to the real destination (on a seperate disk so it is a copy, not
just a move) the system becomes completely unresponsive to anything
requireing disk IO for several min.

David Lang

On Thu, 20 Feb 2003, Andrew Morton wrote:

> Date: Thu, 20 Feb 2003 21:23:04 -0800
> From: Andrew Morton <akpm@digeo.com>
> To: linux-kernel@vger.kernel.org
> Subject: IO scheduler benchmarking
>
>
> Following this email are the results of a number of tests of various I/O
> schedulers:
>
> - Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx)
>
> - CFQ (as in 2.5.61-mm1)
>
> - 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory
>   scheduler - tweaks which fix the writes-starve-reads problem via a
>   scheduling storm)
>
> - 2.4.21-pre4
>
> All these tests are simple things from the command line.
>
> I stayed away from the standard benchmarks because they do not really touch
> on areas where the Linux I/O scheduler has traditionally been bad.  (If they
> did, perhaps it wouldn't have been so bad..)
>
> Plus all the I/O schedulers perform similarly with the usual benchmarks.
> With the exception of some tiobench phases, where AS does very well.
>
> Executive summary: the anticipatory scheduler is wiping the others off the
> map, and 2.4 is a disaster.
>
> I really have not sought to make the AS look good - I mainly concentrated on
> things which we have traditonally been bad at.  If anyone wants to suggest
> other tests, please let me know.
>
> The known regressions from the anticipatory scheduler are:
>
> 1) 15% (ish) slowdown in David Mansfield's database run.  This appeared to
>    go away in later versions of the scheduler.
>
> 2) 5% dropoff in single-threaded qsbench swapstorms
>
> 3) 30% dropoff in write bandwidth when there is a streaming read (this is
>    actually good).
>
> The test machine is a fast P4-HT with 256MB of memory.  Testing was against a
> single fast IDE disk, using ext2.
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  6:51 ` IO scheduler benchmarking David Lang
@ 2003-02-21  8:16   ` Andrew Morton
  2003-02-21 10:31     ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2003-02-21  8:16 UTC (permalink / raw)
  To: David Lang; +Cc: linux-kernel

David Lang <david.lang@digitalinsight.com> wrote:
>
> one other useful test would be the time to copy a large (multi-gig) file.
> currently this takes forever and uses very little fo the disk bandwidth, I
> suspect that the AS would give more preference to reads and therefor would
> go faster.

Yes, that's a test.

	time (cp 1-gig-file foo ; sync)

2.5.62-mm2,AS:		1:22.36
2.5.62-mm2,CFQ:		1:25.54
2.5.62-mm2,deadline:	1:11.03
2.4.21-pre4:		1:07.69

Well gee.


> for a real-world example, mozilla downloads files to a temp directory and
> then copies it to the premanent location. When I download a video from my
> tivo it takes ~20 min to download a 1G video, during which time the system
> is perfectly responsive, then after the download completes when mozilla
> copies it to the real destination (on a seperate disk so it is a copy, not
> just a move) the system becomes completely unresponsive to anything
> requireing disk IO for several min.

Well 2.4 is unreponsive period.  That's due to problems in the VM - processes
which are trying to allocate memory get continually DoS'ed by `cp' in page
reclaim.

For the reads-starved-by-writes problem which you describe, you'll see that
quite a few of the tests did cover that.  contest does as well.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21  8:16   ` Andrew Morton
@ 2003-02-21 10:31     ` Andrea Arcangeli
  2003-02-21 10:51       ` William Lee Irwin III
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 10:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Lang, linux-kernel

On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
> Yes, that's a test.
> 
> 	time (cp 1-gig-file foo ; sync)
> 
> 2.5.62-mm2,AS:		1:22.36
> 2.5.62-mm2,CFQ:		1:25.54
> 2.5.62-mm2,deadline:	1:11.03
> 2.4.21-pre4:		1:07.69
> 
> Well gee.

It's pointless to benchmark CFQ in a workload like that IMHO. if you
read and write to the same harddisk you want lots of unfariness to go
faster.  Your latency is the mixture of read and writes and the writes
are run by the kernel likely so CFQ will likely generate more seeks (it
also depends if you have the magic for the current->mm == NULL).

You should run something on these lines to measure the difference:

	dd if=/dev/zero of=readme bs=1M count=2000
	sync
	cp /dev/zero . & time cp readme /dev/null

And the best CFQ benchmark really is to run tiobench read test with 1
single thread during the `cp /dev/zero .`. That will measure the worst
case latency that `read` provided during the benchmark, and it should
make the most difference because that is definitely the only thing one
can care about if you need CFQ or SFQ. You don't care that much about
throughput if you enable CFQ, so it's not even correct to even benchmark in
function of real time, but only the worst case `read` latency matters.

> > for a real-world example, mozilla downloads files to a temp directory and
> > then copies it to the premanent location. When I download a video from my
> > tivo it takes ~20 min to download a 1G video, during which time the system
> > is perfectly responsive, then after the download completes when mozilla
> > copies it to the real destination (on a seperate disk so it is a copy, not
> > just a move) the system becomes completely unresponsive to anything
> > requireing disk IO for several min.
> 
> Well 2.4 is unreponsive period.  That's due to problems in the VM - processes
> which are trying to allocate memory get continually DoS'ed by `cp' in page
> reclaim.

this depends on the workload, you may not have that many allocations,
a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
by too much dirty cache. Furthmore elevator-lowlatency makes
the blkdev layer much more fair under load.

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21  5:27 ` iosched: impact of streaming read " Andrew Morton
@ 2003-02-21 10:40   ` Andrea Arcangeli
  2003-02-21 10:55     ` Nick Piggin
  2003-02-21 21:11     ` Andrew Morton
  0 siblings, 2 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 10:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, Feb 20, 2003 at 09:27:58PM -0800, Andrew Morton wrote:
> 
> Here we look at what affect a large streaming read has upon an operation
> which reads many small files from the same disk.
> 
> A single streaming read was set up with:
> 
> 	while true
> 	do
> 	        cat 512M-file > /dev/null
> 	done
> 
> and we measure how long it takes to read all the files from a 2.4.19 kernel
> tree off the same disk with
> 
> 	time (find kernel-tree -type f | xargs cat > /dev/null)
> 
> 
> 
> 2.4.21-pre4:	31 minutes 30 seconds
> 
> 2.5.61+hacks:	3 minutes 39 seconds
> 
> 2.5.61+CFQ:	5 minutes 7 seconds (*)
> 
> 2.5.61+AS:	17 seconds
> 
> 
> 
> 
> 
> * CFQ performed very strangely here.  Tremendous amount of seeking and a

strangely? this is the *feature*. Benchmarking CFQ in function of real
time is pointless, apparently you don't understand the whole point about
CFQ and you keep benchmarking like if CFQ was designed for a database
workload. the only thing you care if you run CFQ is the worst case
latency of read, never the throughput, 128k/sec is more than enough as
far as you never wait 2 seconds before you can get the next 128k.

take tiobench with 1 single thread in read mode and keep it running in
background and collect the worst case latency, only *then* you will have
a chance to see a benefit. CFQ is all but a generic purpose elevator.
You must never use CFQ if your object is throughput and you benchmark
the global workload and not the worst case latency of every single read
or write-sync syscall.

CFQ is made for multimedia desktop usage only, you want to be sure
mplayer or xmms will never skip frames, not for parallel cp reading
floods of data at max speed like a database with zillon of threads. For
multimedia not to skip frames 1M/sec is  more than enough bandwidth,
doesn't matter if the huge database in background runs much slower as
far as you never skip a frame.

If you don't mind to skip frames you shouldn't use CFQ and everything
will run faster, period.

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 10:31     ` Andrea Arcangeli
@ 2003-02-21 10:51       ` William Lee Irwin III
  2003-02-21 11:08         ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2003-02-21 10:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
>> Well 2.4 is unreponsive period.  That's due to problems in the VM -
>> processes which are trying to allocate memory get continually DoS'ed
>> by `cp' in page reclaim.

On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote:
> this depends on the workload, you may not have that many allocations,
> a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
> by too much dirty cache. Furthmore elevator-lowlatency makes
> the blkdev layer much more fair under load.

Restricting io in flight doesn't actually repair the issues raised by
it, but rather avoids them by limiting functionality.

The issue raised here is streaming io competing with processes working
within bounded memory. It's unclear to me how 2.5.x mitigates this but
the effects are far less drastic there. The "fix" you're suggesting is
clamping off the entire machine's io just to contain the working set of
a single process that generates unbounded amounts of dirty data and
inadvertently penalizes other processes via page reclaim, where instead
it should be forced to fairly wait its turn for memory.

-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21 10:40   ` Andrea Arcangeli
@ 2003-02-21 10:55     ` Nick Piggin
  2003-02-21 11:23       ` Andrea Arcangeli
  2003-02-21 21:11     ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2003-02-21 10:55 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

Andrea Arcangeli wrote:

>On Thu, Feb 20, 2003 at 09:27:58PM -0800, Andrew Morton wrote:
>
>>Here we look at what affect a large streaming read has upon an operation
>>which reads many small files from the same disk.
>>
>>A single streaming read was set up with:
>>
>>	while true
>>	do
>>	        cat 512M-file > /dev/null
>>	done
>>
>>and we measure how long it takes to read all the files from a 2.4.19 kernel
>>tree off the same disk with
>>
>>	time (find kernel-tree -type f | xargs cat > /dev/null)
>>
>>
>>
>>2.4.21-pre4:	31 minutes 30 seconds
>>
>>2.5.61+hacks:	3 minutes 39 seconds
>>
>>2.5.61+CFQ:	5 minutes 7 seconds (*)
>>
>>2.5.61+AS:	17 seconds
>>
>>
>>
>>
>>
>>* CFQ performed very strangely here.  Tremendous amount of seeking and a
>>
>
>strangely? this is the *feature*. Benchmarking CFQ in function of real
>time is pointless, apparently you don't understand the whole point about
>CFQ and you keep benchmarking like if CFQ was designed for a database
>workload. the only thing you care if you run CFQ is the worst case
>latency of read, never the throughput, 128k/sec is more than enough as
>far as you never wait 2 seconds before you can get the next 128k.
>
>take tiobench with 1 single thread in read mode and keep it running in
>background and collect the worst case latency, only *then* you will have
>a chance to see a benefit. CFQ is all but a generic purpose elevator.
>You must never use CFQ if your object is throughput and you benchmark
>the global workload and not the worst case latency of every single read
>or write-sync syscall.
>
>CFQ is made for multimedia desktop usage only, you want to be sure
>mplayer or xmms will never skip frames, not for parallel cp reading
>floods of data at max speed like a database with zillon of threads. For
>multimedia not to skip frames 1M/sec is  more than enough bandwidth,
>doesn't matter if the huge database in background runs much slower as
>far as you never skip a frame.
>
>If you don't mind to skip frames you shouldn't use CFQ and everything
>will run faster, period.
>
There is actually a point when you have a number of other IO streams
going on where your decreased throughput means *maximum* latency goes
up because robin doesn't go round fast enough. I guess desktop loads
won't often have a lot of different IO streams.

The anticipatory scheduler isn't so strict about fairness, however it
will make as good an attempt as CFQ at keeping maximum read latency
below read_expire (actually read_expire*2 in the current implementation).


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 10:51       ` William Lee Irwin III
@ 2003-02-21 11:08         ` Andrea Arcangeli
  2003-02-21 11:17           ` Nick Piggin
  2003-02-21 11:34           ` William Lee Irwin III
  0 siblings, 2 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 11:08 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote:
> >> Well 2.4 is unreponsive period.  That's due to problems in the VM -
> >> processes which are trying to allocate memory get continually DoS'ed
> >> by `cp' in page reclaim.
> 
> On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote:
> > this depends on the workload, you may not have that many allocations,
> > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted
> > by too much dirty cache. Furthmore elevator-lowlatency makes
> > the blkdev layer much more fair under load.
> 
> Restricting io in flight doesn't actually repair the issues raised by

the amount of I/O that we allow in flight is purerly random, there is no
point to allow several dozen mbytes of I/O in flight on a 64M machine,
my patch fixes that and nothing more.

> it, but rather avoids them by limiting functionality.

If you can show a (throughput) benchmark where you see this limited
functionalty I'd be very interested.

Alternatively I can also claim that 2.4 and 2.5 are limiting
functionalty too by limiting the I/O in flight to some hundred megabytes
right?

it's like a dma ring buffer size of a soundcard, if you want low latency
it has to be small, it's as simple as that. It's a tradeoff between
latency and performance, but the point here is that apparently you gain
nothing with such an huge amount of I/O in flight. This has nothing to
do with the number of requests, the requests have to be a lot, or seeks
won't be reordered aggressively, but when everything merges using all
the requests is pointless and it only has the effect of locking
everything in ram, and this screw the write throttling too, because we
do write throttling on the dirty stuff, not on the locked stuff, and
this is what elevator-lowlatency address.

You may argue on the amount of in flight I/O limit I choosen, but really
the default in mainlines looks overkill to me for generic hardware.

> The issue raised here is streaming io competing with processes working
> within bounded memory. It's unclear to me how 2.5.x mitigates this but
> the effects are far less drastic there. The "fix" you're suggesting is
> clamping off the entire machine's io just to contain the working set of

show me this claimping off please. take 2.4.21pre4aa3 and trash it
compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
interested, if I've a problem I must fix it ASAP, but all the benchmarks
are in green so far and the behaviour was very bad before these fixes,
go ahead and show me red and you'll make me a big favour. Either that or
you're wrong that I'm claimping off anything.

Just to be clear, this whole thing has nothing to do with the elevator,
or the CFQ or whatever, it only is related to the worthwhile amount of
in flight I/O to keep the disk always running.

> a single process that generates unbounded amounts of dirty data and
> inadvertently penalizes other processes via page reclaim, where instead
> it should be forced to fairly wait its turn for memory.
> 
> -- wli

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:08         ` Andrea Arcangeli
@ 2003-02-21 11:17           ` Nick Piggin
  2003-02-21 11:41             ` Andrea Arcangeli
  2003-02-21 11:34           ` William Lee Irwin III
  1 sibling, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2003-02-21 11:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

Andrea Arcangeli wrote:

>it's like a dma ring buffer size of a soundcard, if you want low latency
>it has to be small, it's as simple as that. It's a tradeoff between
>
Although the dma buffer is strictly FIFO, so the situation isn't
quite so simple for disk IO.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21 10:55     ` Nick Piggin
@ 2003-02-21 11:23       ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 11:23 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Fri, Feb 21, 2003 at 09:55:00PM +1100, Nick Piggin wrote:
> There is actually a point when you have a number of other IO streams
> going on where your decreased throughput means *maximum* latency goes
> up because robin doesn't go round fast enough. I guess desktop loads

this is why it would be nice to set a prctl in the task structure that
defines the latency sensitive tasks, so you could leave enabled the CFQ
always and only xmms and mplayer would take advantage of it (unless you
run then with --skip-frame-is-ok). CFQ in function of pid is the simpler
closer transparent approximation of that.

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:08         ` Andrea Arcangeli
  2003-02-21 11:17           ` Nick Piggin
@ 2003-02-21 11:34           ` William Lee Irwin III
  2003-02-21 12:38             ` Andrea Arcangeli
  1 sibling, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2003-02-21 11:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> Restricting io in flight doesn't actually repair the issues raised by

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> the amount of I/O that we allow in flight is purerly random, there is no
> point to allow several dozen mbytes of I/O in flight on a 64M machine,
> my patch fixes that and nothing more.

I was arguing against having any preset limit whatsoever.

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> it, but rather avoids them by limiting functionality.

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> If you can show a (throughput) benchmark where you see this limited
> functionalty I'd be very interested.
> Alternatively I can also claim that 2.4 and 2.5 are limiting
> functionalty too by limiting the I/O in flight to some hundred megabytes
> right?

This has nothing to do with benchmarks.

Counterexample: suppose the process generating dirty data is the only
one running. The machine's effective RAM capacity is then limited to
the dirty data limit plus some small constant by this io in flight
limitation.

This functionality is not to be dismissed lightly: changing the /proc/
business is root-only, hence it may not be within the power of a victim
of a poor setting to adjust it.

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> it's like a dma ring buffer size of a soundcard, if you want low latency
> it has to be small, it's as simple as that. It's a tradeoff between
> latency and performance, but the point here is that apparently you gain
> nothing with such an huge amount of I/O in flight. This has nothing to
> do with the number of requests, the requests have to be a lot, or seeks
> won't be reordered aggressively, but when everything merges using all
> the requests is pointless and it only has the effect of locking
> everything in ram, and this screw the write throttling too, because we
> do write throttling on the dirty stuff, not on the locked stuff, and
> this is what elevator-lowlatency address.
> You may argue on the amount of in flight I/O limit I choosen, but really
> the default in mainlines looks overkill to me for generic hardware.

It's not a question of gain but rather immunity to reconfigurations.
Redoing it for all the hardware raises a tuning issue, and in truth
all I've ever wound up doing is turning it off because I've got so
much RAM that various benchmarks could literally be done in-core as a
first pass, then sorted, then sprayed out to disk in block-order. And
a bunch of open benchmarks are basically just in-core spinlock exercise.
(Ignore the fact there was a benchmark mentioned.)

Amortizing seeks and incrementally sorting and so on generally require
large buffers, and if you have the RAM, the kernel should use it.

But more seriously, global io in flight limits are truly worthless, if
anything it should be per-process, but even that's inadequate as it
requires retuning for varying io speeds. Limit enforcement needs to be
(1) localized
(2) self-tuned via block layer feedback

If I understand the code properly, 2.5.x has (2) but not (1).

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> The issue raised here is streaming io competing with processes working
>> within bounded memory. It's unclear to me how 2.5.x mitigates this but
>> the effects are far less drastic there. The "fix" you're suggesting is
>> clamping off the entire machine's io just to contain the working set of

On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> show me this claimping off please. take 2.4.21pre4aa3 and trash it
> compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
> interested, if I've a problem I must fix it ASAP, but all the benchmarks
> are in green so far and the behaviour was very bad before these fixes,
> go ahead and show me red and you'll make me a big favour. Either that or
> you're wrong that I'm claimping off anything.
> Just to be clear, this whole thing has nothing to do with the elevator,
> or the CFQ or whatever, it only is related to the worthwhile amount of
> in flight I/O to keep the disk always running.

You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on
2.4.21pre4. Some limit that's a hard upper bound but resettable via a
sysctl or /proc/ or something. Testing 2.4.x-based trees might be a
little painful since I'd have to debug why 2.4.x stopped booting on my
boxen, which would take me a bit far afield from my current hacking.

On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
>> a single process that generates unbounded amounts of dirty data and
>> inadvertently penalizes other processes via page reclaim, where instead
>> it should be forced to fairly wait its turn for memory.

I believe I said something important here. =)

The reason why this _should_ be the case is because processes stealing
from each other is the kind of mutual interference that leads to things
like Mozilla taking ages to swap in because other things were running
for a while and it wasn't and so on.

-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:17           ` Nick Piggin
@ 2003-02-21 11:41             ` Andrea Arcangeli
  2003-02-21 21:25               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 11:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 10:17:55PM +1100, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> 
> >it's like a dma ring buffer size of a soundcard, if you want low latency
> >it has to be small, it's as simple as that. It's a tradeoff between
> >
> Although the dma buffer is strictly FIFO, so the situation isn't
> quite so simple for disk IO.

In genereal (w/o CFQ or the other side of it that is an extreme unfair
starving elevator where you're stuck regardless the size of the queue)
larger queue will mean higher latencies in presence of flood of async
load like in a dma buffer. This is obvious for the elevator noop for
example.

I'm speaking about a stable, non starving, fast, default elevator
(something like in 2.4 mainline incidentally) and for that the
similarity with dma buffer definitely applies, there will be a latency
effect coming from the size of the queue (even ignoring the other issues
that the load of locked buffers introduces).

The whole idea of CFQ is to make some workload work lowlatency
indipendent on the size of the async queue. But still (even with CFQ)
you have all the other problems about write throttling and worthless
amount of locked ram and even wasted time on lots of full just ordered
requests in the elevator (yeah I know you use elevator noop won't waste
almost any time, but again this is not most people will use). I don't
buy Andrew complaining about the write throttling when he still allows
several dozen mbytes of ram in flight and invisible to the VM, I mean,
before complaining about write throttling the excessive worthless amount
of locked buffers must be fixed and so I did and it works very well from
the feedback I had so far. 

You can take 2.4.21pre4aa3 and benchmark it as you want if you think I'm
totally wrong, the elevator-lowlatency should be trivial to apply and
backout (benchmarking against pre4 would be unfair).

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:34           ` William Lee Irwin III
@ 2003-02-21 12:38             ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-21 12:38 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel

On Fri, Feb 21, 2003 at 03:34:36AM -0800, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> Restricting io in flight doesn't actually repair the issues raised by
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > the amount of I/O that we allow in flight is purerly random, there is no
> > point to allow several dozen mbytes of I/O in flight on a 64M machine,
> > my patch fixes that and nothing more.
> 
> I was arguing against having any preset limit whatsoever.

the preset limit exists in every linux kernel out there.  It should be
mandated by the lowlevel device driver, I don't allow that yet, but it
should be trivial to extend with just an additional per-queue int, it's
just an implementation matter.

> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> it, but rather avoids them by limiting functionality.
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > If you can show a (throughput) benchmark where you see this limited
> > functionalty I'd be very interested.
> > Alternatively I can also claim that 2.4 and 2.5 are limiting
> > functionalty too by limiting the I/O in flight to some hundred megabytes
> > right?
> 
> This has nothing to do with benchmarks.

it has to, you claimed I limited functionalty, if you can't measure it
in any way (or at least demonstrate it with math), it doesn't exist.

> Counterexample: suppose the process generating dirty data is the only
> one running. The machine's effective RAM capacity is then limited to
> the dirty data limit plus some small constant by this io in flight
> limitation.

only the free memory and cache is accounted here, while this task allocates
ram with malloc, the amount of dirty ram will be reduced accordingly,
what you said is far from reality. We aren't 100% accurate in the cache
level accounting true, but we're 100% accurate in the anonymous memory
accounting.

> This functionality is not to be dismissed lightly: changing the /proc/
> business is root-only, hence it may not be within the power of a victim
> of a poor setting to adjust it.
> 
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > it's like a dma ring buffer size of a soundcard, if you want low latency
> > it has to be small, it's as simple as that. It's a tradeoff between
> > latency and performance, but the point here is that apparently you gain
> > nothing with such an huge amount of I/O in flight. This has nothing to
> > do with the number of requests, the requests have to be a lot, or seeks
> > won't be reordered aggressively, but when everything merges using all
> > the requests is pointless and it only has the effect of locking
> > everything in ram, and this screw the write throttling too, because we
> > do write throttling on the dirty stuff, not on the locked stuff, and
> > this is what elevator-lowlatency address.
> > You may argue on the amount of in flight I/O limit I choosen, but really
> > the default in mainlines looks overkill to me for generic hardware.
> 
> It's not a question of gain but rather immunity to reconfigurations.

You mean immunity of reconfigurations of machines with more than 4G of
ram maybe, and you are ok to ignore completely the latency effects of
the overkill queue size. Everything smaller can be affected by it not
only in terms of latency effect. Especially if you have multiple spindle
that literally multiply the fixed max amount of in flight I/O.

> Redoing it for all the hardware raises a tuning issue, and in truth
> all I've ever wound up doing is turning it off because I've got so
> much RAM that various benchmarks could literally be done in-core as a
> first pass, then sorted, then sprayed out to disk in block-order. And
> a bunch of open benchmarks are basically just in-core spinlock exercise.
> (Ignore the fact there was a benchmark mentioned.)
> 
> Amortizing seeks and incrementally sorting and so on generally require
> large buffers, and if you have the RAM, the kernel should use it.
> 
> But more seriously, global io in flight limits are truly worthless, if
> anything it should be per-process, but even that's inadequate as it

This doesn't make any sense, the limit alwyas exists, it has to, if you
drop it the machine will die deadlocking in a few milliseconds, the
whole plugging and write throttling logic that completely drives the
whole I/O subsystem totally depends on a limit on the in flight I/O.

> requires retuning for varying io speeds. Limit enforcement needs to be
> (1) localized
> (2) self-tuned via block layer feedback
> 
> If I understand the code properly, 2.5.x has (2) but not (1).

2.5 has the unplugging logic so it definitely has an high limit of in
flight I/O too, no matter what elevator or whatever, w/o the fixed limit
2.5 will die too like any other linux kernel out there I have ever seen.

> 
> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> The issue raised here is streaming io competing with processes working
> >> within bounded memory. It's unclear to me how 2.5.x mitigates this but
> >> the effects are far less drastic there. The "fix" you're suggesting is
> >> clamping off the entire machine's io just to contain the working set of
> 
> On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote:
> > show me this claimping off please. take 2.4.21pre4aa3 and trash it
> > compared to 2.4.21pre4 with the minimum 32M queue, I'd be very
> > interested, if I've a problem I must fix it ASAP, but all the benchmarks
> > are in green so far and the behaviour was very bad before these fixes,
> > go ahead and show me red and you'll make me a big favour. Either that or
> > you're wrong that I'm claimping off anything.
> > Just to be clear, this whole thing has nothing to do with the elevator,
> > or the CFQ or whatever, it only is related to the worthwhile amount of
> > in flight I/O to keep the disk always running.
> 
> You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on
> 2.4.21pre4. Some limit that's a hard upper bound but resettable via a
> sysctl or /proc/ or something. Testing 2.4.x-based trees might be a
> little painful since I'd have to debug why 2.4.x stopped booting on my
> boxen, which would take me a bit far afield from my current hacking.

2.4.21pre4aa3 has to boot on it.

> On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote:
> >> a single process that generates unbounded amounts of dirty data and
> >> inadvertently penalizes other processes via page reclaim, where instead
> >> it should be forced to fairly wait its turn for memory.
> 
> I believe I said something important here. =)

You're arguing about the async flushing heuristic that should be made
smarter instead of taking 50% of the freeable memory (not anonymous
memory). This isn't black and white stuff and you shouldn't mix issues,
it has nothing to do with the blkdev plugging logic driven by the limit
of in flight I/O (in every l-k out there ever).

> The reason why this _should_ be the case is because processes stealing
> from each other is the kind of mutual interference that leads to things
> like Mozilla taking ages to swap in because other things were running
> for a while and it wasn't and so on.
> 
> 
> -- wli


Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21 10:40   ` Andrea Arcangeli
  2003-02-21 10:55     ` Nick Piggin
@ 2003-02-21 21:11     ` Andrew Morton
  2003-02-23 15:16       ` Andrea Arcangeli
  2003-02-25 12:02       ` Pavel Machek
  1 sibling, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2003-02-21 21:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> CFQ is made for multimedia desktop usage only, you want to be sure
> mplayer or xmms will never skip frames, not for parallel cp reading
> floods of data at max speed like a database with zillon of threads. For
> multimedia not to skip frames 1M/sec is  more than enough bandwidth,
> doesn't matter if the huge database in background runs much slower as
> far as you never skip a frame.

These applications are broken.  The kernel shouldn't be bending over
backwards trying to fix them up.  Because this will never ever work as well
as fixing the applications.

The correct way to design such an application is to use an RT thread to
perform the display/audio device I/O and a non-RT thread to perform the disk
I/O.  The disk IO thread keeps the shared 8 megabyte buffer full.  The RT
thread mlocks that buffer.

The deadline scheduler will handle that OK.  The anticipatory scheduler
(which is also deadline) will handle it better.

If an RT thread performs disk I/O it is bust, and we should not try to fix
it.  The only place where VFS/VM/block needs to care for RT tasks is in the
page allocator.  Because even well-designed RT tasks need to allocate pages.

The 2.4 page allocator has a tendency to cause 5-10 second stalls for a
single page allocation when the system is under writeout load.  That is fixed
in 2.5, but special-casing RT tasks in the allocator would make sense.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 11:41             ` Andrea Arcangeli
@ 2003-02-21 21:25               ` Andrew Morton
  2003-02-23 15:09                 ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2003-02-21 21:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: piggin, wli, david.lang, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> I don't
> buy Andrew complaining about the write throttling when he still allows
> several dozen mbytes of ram in flight and invisible to the VM,

The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling
decisions are made upon the sum of dirty+writeback pages.

The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount
of dirty memory.

Throttling in both write() and the page allocator is fully decoupled from the
queue size.  An 8192-slot (4 gigabyte) queue on a 32M machine has been
tested.

The only tasks which block in get_request_wait() are the ones which we want
to block there: heavy writers.

Page reclaim will never block page allocators in get_request_wait().  That
causes terrible latency if the writer is still active.

Page reclaim will never block a page-allocating process on I/O against a
particular disk block.  Allocators are instead throttled against _any_ write
I/O completion.  (This is broken in several ways, but it works well enough to
leave it alone I think).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: IO scheduler benchmarking
  2003-02-21 21:25               ` Andrew Morton
@ 2003-02-23 15:09                 ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-23 15:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: piggin, wli, david.lang, linux-kernel

On Fri, Feb 21, 2003 at 01:25:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > I don't
> > buy Andrew complaining about the write throttling when he still allows
> > several dozen mbytes of ram in flight and invisible to the VM,
> 
> The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling
> decisions are made upon the sum of dirty+writeback pages.
> 
> The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount
> of dirty memory.
> 
> Throttling in both write() and the page allocator is fully decoupled from the
> queue size.  An 8192-slot (4 gigabyte) queue on a 32M machine has been
> tested.

the 32M case is probably fine with it, you moved the limit of in-flight
I/O in the writeback layer, and the write throttling will limit the
amount of ram in flight to 16M or so. I would be much more interesting
to see some latency benchmark on a 8G machine with 4G simultaneously
locked in the I/O queue. a 4G queue on a IDE disk can only waste lots of
cpu and memory resources, increasing the latency too, without providing
any benefit. Your 4G queue thing provides only disavantages as far as I
can tell.

> 
> The only tasks which block in get_request_wait() are the ones which we want
> to block there: heavy writers.
> 
> Page reclaim will never block page allocators in get_request_wait().  That
> causes terrible latency if the writer is still active.
> 
> Page reclaim will never block a page-allocating process on I/O against a
> particular disk block.  Allocators are instead throttled against _any_ write
> I/O completion.  (This is broken in several ways, but it works well enough to
> leave it alone I think).

2.4 on desktop boxes could fill all ram with locked and dirty stuff
because of the excessive size of the queue, so any comparison with 2.4
in terms of page reclaim should be repeated on 2.4.21pre4aa3 IMHO, where
the VM has a chance not to find the machine in collapsed state where the
only thing it can do is to either wait or panic(), feel free to choose
what you prefer.

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21 21:11     ` Andrew Morton
@ 2003-02-23 15:16       ` Andrea Arcangeli
  2003-02-25 12:02       ` Pavel Machek
  1 sibling, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2003-02-23 15:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, Feb 21, 2003 at 01:11:58PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > CFQ is made for multimedia desktop usage only, you want to be sure
> > mplayer or xmms will never skip frames, not for parallel cp reading
> > floods of data at max speed like a database with zillon of threads. For
> > multimedia not to skip frames 1M/sec is  more than enough bandwidth,
> > doesn't matter if the huge database in background runs much slower as
> > far as you never skip a frame.
> 
> These applications are broken.  The kernel shouldn't be bending over
> backwards trying to fix them up.  Because this will never ever work as well
> as fixing the applications.

disagree, if the kernel doesn't provide a lowlatency elevator of some
sort there's no way to workaround it in userspace with just a
partial-mem buffer (unless you do [1])

> The correct way to design such an application is to use an RT thread to
> perform the display/audio device I/O and a non-RT thread to perform the disk
> I/O.  The disk IO thread keeps the shared 8 megabyte buffer full.  The RT
> thread mlocks that buffer.

having an huge buffering introduces the 8m latency during startup
Which is very annoying if the machine is under high load (especially if
you want to apply realtime effects to the audio, ever tried the xmms
equalizer with an 8m buffer? and it still doesn't guarantee that 8megs
are enough.  secondly 8mbytes mlocked are quite a lot for a 128m
destkop. third, applications are just doing what you suggest and still
you can hear seldom skips during heavy I/O i.e.  having buffering is not
enough if the elevator only cares about global throughput or if the
queue is very huge (and incidentally you're not using SFQ/CFQ).  It is
also possible you don't know what you want to read until the last
minute.

[1] Along your lines you can also buy some giga of ram and copy the
whole multimedia data in ramfs before playback ;) I mean, I agree it's a
problem that can be solved by throwing money into the hardware.

> The deadline scheduler will handle that OK.  The anticipatory scheduler
> (which is also deadline) will handle it better.
> 
> 
> 
> If an RT thread performs disk I/O it is bust, and we should not try to fix
> it.  The only place where VFS/VM/block needs to care for RT tasks is in the
> page allocator.  Because even well-designed RT tasks need to allocate pages.
> 
> The 2.4 page allocator has a tendency to cause 5-10 second stalls for a
> single page allocation when the system is under writeout load.  That is fixed
> in 2.5, but special-casing RT tasks in the allocator would make sense.

the main issue that matters here is not the vm but the blkdev layer
and there you never know if the I/O was submitted by an RT task or not.

and btw the right design for such app is really to use async-io not to
fork off a worthless thread for the I/O.

Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: iosched: impact of streaming read on read-many-files
  2003-02-21 21:11     ` Andrew Morton
  2003-02-23 15:16       ` Andrea Arcangeli
@ 2003-02-25 12:02       ` Pavel Machek
  1 sibling, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2003-02-25 12:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel

Hi!

> > mplayer or xmms will never skip frames, not for parallel cp reading
> > floods of data at max speed like a database with zillon of threads. For
> > multimedia not to skip frames 1M/sec is  more than enough bandwidth,
> > doesn't matter if the huge database in background runs much slower as
> > far as you never skip a frame.
> 
> These applications are broken.  The kernel shouldn't be bending over
> backwards trying to fix them up.  Because this will never ever work as well
> as fixing the applications.
> 
> The correct way to design such an application is to use an RT thread to
> perform the display/audio device I/O and a non-RT thread to perform the disk

I do not think this can be done easily.
For mplayer case you'd need to mlock
X server...

And emacs/vi/all interactive tasks
 are in similar situation (latency matters),
are you going to make them all realtime?

-- 
				Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-02-24 20:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-21  5:23 IO scheduler benchmarking Andrew Morton
2003-02-21  5:23 ` iosched: parallel streaming reads Andrew Morton
2003-02-21  5:24 ` iosched: effect of streaming write on interactivity Andrew Morton
2003-02-21  5:25 ` iosched: effect of streaming read " Andrew Morton
2003-02-21  5:25 ` iosched: time to copy many small files Andrew Morton
2003-02-21  5:26 ` iosched: concurrent reads of " Andrew Morton
2003-02-21  5:27 ` iosched: impact of streaming write on streaming read Andrew Morton
2003-02-21  5:27 ` iosched: impact of streaming write on read-many-files Andrew Morton
2003-02-21  5:27 ` iosched: impact of streaming read " Andrew Morton
2003-02-21 10:40   ` Andrea Arcangeli
2003-02-21 10:55     ` Nick Piggin
2003-02-21 11:23       ` Andrea Arcangeli
2003-02-21 21:11     ` Andrew Morton
2003-02-23 15:16       ` Andrea Arcangeli
2003-02-25 12:02       ` Pavel Machek
2003-02-21  5:28 ` iosched: effect of streaming read on streaming write Andrew Morton
2003-02-21  6:51 ` IO scheduler benchmarking David Lang
2003-02-21  8:16   ` Andrew Morton
2003-02-21 10:31     ` Andrea Arcangeli
2003-02-21 10:51       ` William Lee Irwin III
2003-02-21 11:08         ` Andrea Arcangeli
2003-02-21 11:17           ` Nick Piggin
2003-02-21 11:41             ` Andrea Arcangeli
2003-02-21 21:25               ` Andrew Morton
2003-02-23 15:09                 ` Andrea Arcangeli
2003-02-21 11:34           ` William Lee Irwin III
2003-02-21 12:38             ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox