* Re: IO scheduler benchmarking @ 2003-02-25 12:59 rwhron 2003-02-25 22:09 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: rwhron @ 2003-02-25 12:59 UTC (permalink / raw) To: akpm; +Cc: linux-kernel >> Why does 2.5.62-mm2 have higher sequential >> write latency than 2.5.61-mm1? > And there are various odd interactions in, at least, ext3. You did not > specify which filesystem was used. ext2 >> Thr MB/sec CPU% avg lat max latency >> 2.5.62-mm2-as 8 14.76 52.04% 6.14 4.5 >> 2.5.62-mm2-dline 8 9.91 13.90% 9.41 .8 >> 2.5.62-mm2 8 9.83 15.62% 7.38 408.9 > Fishiness. 2.5.62-mm2 _is_ 2.5.62-mm2-as. Why the 100x difference? Bad EXTRAVERSION naming on my part. 2.5.62-mm2 _was_ booted with elevator=cfq. How it happened: 2.5.61-mm1 tested 2.5.61-mm1-cfq tested and elevator=cfq added to boot flags 2.5.62-mm1 tested (elevator=cfq still in lilo boot boot flags) Then to test the other two schedulers I changed extraversion and boot flags. > That 408 seconds looks suspect. AFAICT, that's the one request in over 500,000 that took the longest. The numbers are fairly consistent. How relevant they are is debatable. > If you want to test write latency, do this: Your approach is more realistic than tiobench. > There is a place in VFS where one writing task could accidentally hammer a > different one. I cannot trigger that, but I'll fix it up in next -mm. 2.5.62-mm3 or 2.5.63-mm1? (-mm3 is running now) -- Randy Hron http://home.earthlink.net/~rwhron/kernel/bigbox.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-25 12:59 IO scheduler benchmarking rwhron @ 2003-02-25 22:09 ` Andrew Morton 0 siblings, 0 replies; 17+ messages in thread From: Andrew Morton @ 2003-02-25 22:09 UTC (permalink / raw) To: rwhron; +Cc: linux-kernel rwhron@earthlink.net wrote: > > >> Why does 2.5.62-mm2 have higher sequential > >> write latency than 2.5.61-mm1? > > > And there are various odd interactions in, at least, ext3. You did not > > specify which filesystem was used. > > ext2 > > >> Thr MB/sec CPU% avg lat max latency > >> 2.5.62-mm2-as 8 14.76 52.04% 6.14 4.5 > >> 2.5.62-mm2-dline 8 9.91 13.90% 9.41 .8 > >> 2.5.62-mm2 8 9.83 15.62% 7.38 408.9 > > > Fishiness. 2.5.62-mm2 _is_ 2.5.62-mm2-as. Why the 100x difference? > > Bad EXTRAVERSION naming on my part. 2.5.62-mm2 _was_ booted with > elevator=cfq. > > ... > > That 408 seconds looks suspect. > > AFAICT, that's the one request in over 500,000 that took the longest. > The numbers are fairly consistent. How relevant they are is debatable. OK. When I was testing CFQ I saw some odd behaviour, such as a 100% cessation of reads for periods of up to ten seconds. So there is some sort of bug in there, and until that is understood we should not conclude anything at all about CFQ from this testing. > 2.5.62-mm3 or 2.5.63-mm1? (-mm3 is running now) Well I'm showing about seven more AS patches since 2.5.63-mm1 already, so this is a bit of a moving target. Sorry. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking
@ 2003-02-25 21:57 rwhron
0 siblings, 0 replies; 17+ messages in thread
From: rwhron @ 2003-02-25 21:57 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm
> Why does 2.5.62-mm2 have higher sequential
> write latency than 2.5.61-mm1?
Anticipatory scheduler tiobench profile on uniprocessor:
2.5.61-mm1 2.5.62-mm2
total 1993387 1933241
default_idle 1873179 1826650
system_call 49838 43036
get_offset_tsc 21905 20883
do_schedule 13893 10344
do_gettimeofday 8478 6044
sys_gettimeofday 8077 5153
current_kernel_time 4904 12165
syscall_exit 4047 1243
__wake_up 1274 1000
io_schedule 1166 1039
prepare_to_wait 1093 792
schedule_timeout 612 366
delay_tsc 502 443
get_fpu_cwd 473 376
syscall_call 389 378
math_state_restore 354 271
restore_fpu 329 287
del_timer 325 200
device_not_available 290 377
finish_wait 257 181
add_timer 218 137
io_schedule_timeout 195 72
cpu_idle 193 218
run_timer_softirq 137 33
remove_wait_queue 121 188
eligible_child 106 154
sys_wait4 105 162
work_resched 104 110
ret_from_intr 97 74
dup_task_struct 75 48
add_wait_queue 67 124
__cond_resched 59 69
do_page_fault 55 0
do_softirq 53 12
pte_alloc_one 51 67
release_task 44 55
get_signal_to_deliver 38 43
get_wchan 16 10
mod_timer 15 0
old_mmap 14 19
prepare_to_wait_exclusive 10 32
mm_release 7 0
release_x86_irqs 7 8
sys_getppid 6 5
handle_IRQ_event 4 0
schedule_tail 4 0
kill_proc_info 3 0
device_not_available_emulate 2 0
task_prio 1 1
__down 0 33
__down_failed_interruptible 0 3
init_fpu 0 12
pgd_ctor 0 3
process_timeout 0 2
restore_all 0 2
sys_exit 0 2
--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: IO scheduler benchmarking
@ 2003-02-25 5:35 rwhron
2003-02-25 6:38 ` Andrew Morton
0 siblings, 1 reply; 17+ messages in thread
From: rwhron @ 2003-02-25 5:35 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm
Executive question: Why does 2.5.62-mm2 have higher sequential
write latency than 2.5.61-mm1?
tiobench numbers on uniprocessor single disk IDE:
The cfq scheduler (2.5.62-mm2 and 2.5.61-cfq) has a big latency
regression.
2.5.61-mm1 (default scheduler (anticipatory?))
2.5.61-mm1-cfq elevator=cfq
2.5.62-mm2-as anticipatory scheduler
2.5.62-mm2-dline elevator=deadline
2.5.62-mm2 elevator=cfq
Thr MB/sec CPU% avg lat max latency
2.5.61-mm1 8 15.68 54.42% 5.87 ms 2.7 seconds
2.5.61-mm1-cfq 8 9.60 15.07% 7.54 393.0
2.5.62-mm2-as 8 14.76 52.04% 6.14 4.5
2.5.62-mm2-dline 8 9.91 13.90% 9.41 .8
2.5.62-mm2 8 9.83 15.62% 7.38 408.9
2.4.21-pre3 8 10.34 27.66% 8.80 1.0
2.4.21-pre3-ac4 8 10.53 28.41% 8.83 .6
2.4.21-pre3aa1 8 18.55 71.95% 3.25 87.6
For most thread counts (8 - 128), the anticipatory scheduler has roughly
45% higher ext2 sequential read throughput. Latency was higher than
deadline, but a lot lower than cfq.
For tiobench sequential writes, the max latency numbers for 2.4.21-pre3
are notably lower than 2.5.62-mm2 (but not as good as 2.5.61-mm1).
This is with 16 threads.
Thr MB/sec CPU% avg lat max latency
2.5.61-mm1 16 18.30 81.12% 9.159 ms 6.1 seconds
2.5.61-mm1-cfq 16 18.03 80.71% 9.086 6.1
2.5.62-mm2-as 16 18.84 84.25% 8.620 47.7
2.5.62-mm2-dline 16 18.53 84.10% 8.967 53.4
2.5.62-mm2 16 18.46 83.28% 8.521 40.8
2.4.21-pre3 16 16.20 65.13% 9.566 8.7
2.4.21-pre3-ac4 16 18.50 83.68% 8.774 11.6
2.4.21-pre3aa1 16 18.49 88.10% 8.455 7.5
Recent uniprocessor benchmarks:
http://home.earthlink.net/~rwhron/kernel/latest.html
More uniprocessor benchmarks:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
--
Randy Hron
http://home.earthlink.net/~rwhron/kernel/bigbox.html
latest quad xeon benchmarks:
http://home.earthlink.net/~rwhron/kernel/blatest.html
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: IO scheduler benchmarking 2003-02-25 5:35 rwhron @ 2003-02-25 6:38 ` Andrew Morton 0 siblings, 0 replies; 17+ messages in thread From: Andrew Morton @ 2003-02-25 6:38 UTC (permalink / raw) To: rwhron; +Cc: linux-kernel rwhron@earthlink.net wrote: > > Executive question: Why does 2.5.62-mm2 have higher sequential > write latency than 2.5.61-mm1? Well bear in mind that we sometimes need to perform reads to be able to perform writes. So the way tiobench measures it, you could be seeing read-vs-write latencies here. And there are various odd interactions in, at least, ext3. You did not specify which filesystem was used. > ... > Thr MB/sec CPU% avg lat max latency > 2.5.62-mm2-as 8 14.76 52.04% 6.14 4.5 > 2.5.62-mm2-dline 8 9.91 13.90% 9.41 .8 > 2.5.62-mm2 8 9.83 15.62% 7.38 408.9 Fishiness. 2.5.62-mm2 _is_ 2.5.62-mm2-as. Why the 100x difference? That 408 seconds looks suspect. I don't know what tiobench is doing in there, really. I find it more useful to test simple things, which I can understand. If you want to test write latency, do this: while true do write-and-fsync -m 200 -O -f foo done Maybe run a few of these. This command will cause a continuous streaming file overwrite. then do: time write-and-fsync -m1 -f foo this will simply write a megabyte file, fsync it and exit. You need to be careful with this - get it wrong and most of the runtime is actually paging the executables back in. That is why the above background load is just reusing the same pagecache over and over. The latency which I see for the one megabyte write and fsync varies a lot. >From one second to ten. That's with the deadline scheduler. There is a place in VFS where one writing task could accidentally hammer a different one. I cannot trigger that, but I'll fix it up in next -mm. ^ permalink raw reply [flat|nested] 17+ messages in thread
* IO scheduler benchmarking @ 2003-02-21 5:23 Andrew Morton 2003-02-21 6:51 ` David Lang 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2003-02-21 5:23 UTC (permalink / raw) To: linux-kernel Following this email are the results of a number of tests of various I/O schedulers: - Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx) - CFQ (as in 2.5.61-mm1) - 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory scheduler - tweaks which fix the writes-starve-reads problem via a scheduling storm) - 2.4.21-pre4 All these tests are simple things from the command line. I stayed away from the standard benchmarks because they do not really touch on areas where the Linux I/O scheduler has traditionally been bad. (If they did, perhaps it wouldn't have been so bad..) Plus all the I/O schedulers perform similarly with the usual benchmarks. With the exception of some tiobench phases, where AS does very well. Executive summary: the anticipatory scheduler is wiping the others off the map, and 2.4 is a disaster. I really have not sought to make the AS look good - I mainly concentrated on things which we have traditonally been bad at. If anyone wants to suggest other tests, please let me know. The known regressions from the anticipatory scheduler are: 1) 15% (ish) slowdown in David Mansfield's database run. This appeared to go away in later versions of the scheduler. 2) 5% dropoff in single-threaded qsbench swapstorms 3) 30% dropoff in write bandwidth when there is a streaming read (this is actually good). The test machine is a fast P4-HT with 256MB of memory. Testing was against a single fast IDE disk, using ext2. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 5:23 Andrew Morton @ 2003-02-21 6:51 ` David Lang 2003-02-21 8:16 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: David Lang @ 2003-02-21 6:51 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel one other useful test would be the time to copy a large (multi-gig) file. currently this takes forever and uses very little fo the disk bandwidth, I suspect that the AS would give more preference to reads and therefor would go faster. for a real-world example, mozilla downloads files to a temp directory and then copies it to the premanent location. When I download a video from my tivo it takes ~20 min to download a 1G video, during which time the system is perfectly responsive, then after the download completes when mozilla copies it to the real destination (on a seperate disk so it is a copy, not just a move) the system becomes completely unresponsive to anything requireing disk IO for several min. David Lang On Thu, 20 Feb 2003, Andrew Morton wrote: > Date: Thu, 20 Feb 2003 21:23:04 -0800 > From: Andrew Morton <akpm@digeo.com> > To: linux-kernel@vger.kernel.org > Subject: IO scheduler benchmarking > > > Following this email are the results of a number of tests of various I/O > schedulers: > > - Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx) > > - CFQ (as in 2.5.61-mm1) > > - 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory > scheduler - tweaks which fix the writes-starve-reads problem via a > scheduling storm) > > - 2.4.21-pre4 > > All these tests are simple things from the command line. > > I stayed away from the standard benchmarks because they do not really touch > on areas where the Linux I/O scheduler has traditionally been bad. (If they > did, perhaps it wouldn't have been so bad..) > > Plus all the I/O schedulers perform similarly with the usual benchmarks. > With the exception of some tiobench phases, where AS does very well. > > Executive summary: the anticipatory scheduler is wiping the others off the > map, and 2.4 is a disaster. > > I really have not sought to make the AS look good - I mainly concentrated on > things which we have traditonally been bad at. If anyone wants to suggest > other tests, please let me know. > > The known regressions from the anticipatory scheduler are: > > 1) 15% (ish) slowdown in David Mansfield's database run. This appeared to > go away in later versions of the scheduler. > > 2) 5% dropoff in single-threaded qsbench swapstorms > > 3) 30% dropoff in write bandwidth when there is a streaming read (this is > actually good). > > The test machine is a fast P4-HT with 256MB of memory. Testing was against a > single fast IDE disk, using ext2. > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 6:51 ` David Lang @ 2003-02-21 8:16 ` Andrew Morton 2003-02-21 10:31 ` Andrea Arcangeli 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2003-02-21 8:16 UTC (permalink / raw) To: David Lang; +Cc: linux-kernel David Lang <david.lang@digitalinsight.com> wrote: > > one other useful test would be the time to copy a large (multi-gig) file. > currently this takes forever and uses very little fo the disk bandwidth, I > suspect that the AS would give more preference to reads and therefor would > go faster. Yes, that's a test. time (cp 1-gig-file foo ; sync) 2.5.62-mm2,AS: 1:22.36 2.5.62-mm2,CFQ: 1:25.54 2.5.62-mm2,deadline: 1:11.03 2.4.21-pre4: 1:07.69 Well gee. > for a real-world example, mozilla downloads files to a temp directory and > then copies it to the premanent location. When I download a video from my > tivo it takes ~20 min to download a 1G video, during which time the system > is perfectly responsive, then after the download completes when mozilla > copies it to the real destination (on a seperate disk so it is a copy, not > just a move) the system becomes completely unresponsive to anything > requireing disk IO for several min. Well 2.4 is unreponsive period. That's due to problems in the VM - processes which are trying to allocate memory get continually DoS'ed by `cp' in page reclaim. For the reads-starved-by-writes problem which you describe, you'll see that quite a few of the tests did cover that. contest does as well. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 8:16 ` Andrew Morton @ 2003-02-21 10:31 ` Andrea Arcangeli 2003-02-21 10:51 ` William Lee Irwin III 0 siblings, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2003-02-21 10:31 UTC (permalink / raw) To: Andrew Morton; +Cc: David Lang, linux-kernel On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: > Yes, that's a test. > > time (cp 1-gig-file foo ; sync) > > 2.5.62-mm2,AS: 1:22.36 > 2.5.62-mm2,CFQ: 1:25.54 > 2.5.62-mm2,deadline: 1:11.03 > 2.4.21-pre4: 1:07.69 > > Well gee. It's pointless to benchmark CFQ in a workload like that IMHO. if you read and write to the same harddisk you want lots of unfariness to go faster. Your latency is the mixture of read and writes and the writes are run by the kernel likely so CFQ will likely generate more seeks (it also depends if you have the magic for the current->mm == NULL). You should run something on these lines to measure the difference: dd if=/dev/zero of=readme bs=1M count=2000 sync cp /dev/zero . & time cp readme /dev/null And the best CFQ benchmark really is to run tiobench read test with 1 single thread during the `cp /dev/zero .`. That will measure the worst case latency that `read` provided during the benchmark, and it should make the most difference because that is definitely the only thing one can care about if you need CFQ or SFQ. You don't care that much about throughput if you enable CFQ, so it's not even correct to even benchmark in function of real time, but only the worst case `read` latency matters. > > for a real-world example, mozilla downloads files to a temp directory and > > then copies it to the premanent location. When I download a video from my > > tivo it takes ~20 min to download a 1G video, during which time the system > > is perfectly responsive, then after the download completes when mozilla > > copies it to the real destination (on a seperate disk so it is a copy, not > > just a move) the system becomes completely unresponsive to anything > > requireing disk IO for several min. > > Well 2.4 is unreponsive period. That's due to problems in the VM - processes > which are trying to allocate memory get continually DoS'ed by `cp' in page > reclaim. this depends on the workload, you may not have that many allocations, a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted by too much dirty cache. Furthmore elevator-lowlatency makes the blkdev layer much more fair under load. Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 10:31 ` Andrea Arcangeli @ 2003-02-21 10:51 ` William Lee Irwin III 2003-02-21 11:08 ` Andrea Arcangeli 0 siblings, 1 reply; 17+ messages in thread From: William Lee Irwin III @ 2003-02-21 10:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: >> Well 2.4 is unreponsive period. That's due to problems in the VM - >> processes which are trying to allocate memory get continually DoS'ed >> by `cp' in page reclaim. On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote: > this depends on the workload, you may not have that many allocations, > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted > by too much dirty cache. Furthmore elevator-lowlatency makes > the blkdev layer much more fair under load. Restricting io in flight doesn't actually repair the issues raised by it, but rather avoids them by limiting functionality. The issue raised here is streaming io competing with processes working within bounded memory. It's unclear to me how 2.5.x mitigates this but the effects are far less drastic there. The "fix" you're suggesting is clamping off the entire machine's io just to contain the working set of a single process that generates unbounded amounts of dirty data and inadvertently penalizes other processes via page reclaim, where instead it should be forced to fairly wait its turn for memory. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 10:51 ` William Lee Irwin III @ 2003-02-21 11:08 ` Andrea Arcangeli 2003-02-21 11:17 ` Nick Piggin 2003-02-21 11:34 ` William Lee Irwin III 0 siblings, 2 replies; 17+ messages in thread From: Andrea Arcangeli @ 2003-02-21 11:08 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: > >> Well 2.4 is unreponsive period. That's due to problems in the VM - > >> processes which are trying to allocate memory get continually DoS'ed > >> by `cp' in page reclaim. > > On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote: > > this depends on the workload, you may not have that many allocations, > > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted > > by too much dirty cache. Furthmore elevator-lowlatency makes > > the blkdev layer much more fair under load. > > Restricting io in flight doesn't actually repair the issues raised by the amount of I/O that we allow in flight is purerly random, there is no point to allow several dozen mbytes of I/O in flight on a 64M machine, my patch fixes that and nothing more. > it, but rather avoids them by limiting functionality. If you can show a (throughput) benchmark where you see this limited functionalty I'd be very interested. Alternatively I can also claim that 2.4 and 2.5 are limiting functionalty too by limiting the I/O in flight to some hundred megabytes right? it's like a dma ring buffer size of a soundcard, if you want low latency it has to be small, it's as simple as that. It's a tradeoff between latency and performance, but the point here is that apparently you gain nothing with such an huge amount of I/O in flight. This has nothing to do with the number of requests, the requests have to be a lot, or seeks won't be reordered aggressively, but when everything merges using all the requests is pointless and it only has the effect of locking everything in ram, and this screw the write throttling too, because we do write throttling on the dirty stuff, not on the locked stuff, and this is what elevator-lowlatency address. You may argue on the amount of in flight I/O limit I choosen, but really the default in mainlines looks overkill to me for generic hardware. > The issue raised here is streaming io competing with processes working > within bounded memory. It's unclear to me how 2.5.x mitigates this but > the effects are far less drastic there. The "fix" you're suggesting is > clamping off the entire machine's io just to contain the working set of show me this claimping off please. take 2.4.21pre4aa3 and trash it compared to 2.4.21pre4 with the minimum 32M queue, I'd be very interested, if I've a problem I must fix it ASAP, but all the benchmarks are in green so far and the behaviour was very bad before these fixes, go ahead and show me red and you'll make me a big favour. Either that or you're wrong that I'm claimping off anything. Just to be clear, this whole thing has nothing to do with the elevator, or the CFQ or whatever, it only is related to the worthwhile amount of in flight I/O to keep the disk always running. > a single process that generates unbounded amounts of dirty data and > inadvertently penalizes other processes via page reclaim, where instead > it should be forced to fairly wait its turn for memory. > > -- wli Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 11:08 ` Andrea Arcangeli @ 2003-02-21 11:17 ` Nick Piggin 2003-02-21 11:41 ` Andrea Arcangeli 2003-02-21 11:34 ` William Lee Irwin III 1 sibling, 1 reply; 17+ messages in thread From: Nick Piggin @ 2003-02-21 11:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel Andrea Arcangeli wrote: >it's like a dma ring buffer size of a soundcard, if you want low latency >it has to be small, it's as simple as that. It's a tradeoff between > Although the dma buffer is strictly FIFO, so the situation isn't quite so simple for disk IO. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 11:17 ` Nick Piggin @ 2003-02-21 11:41 ` Andrea Arcangeli 2003-02-21 21:25 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2003-02-21 11:41 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel On Fri, Feb 21, 2003 at 10:17:55PM +1100, Nick Piggin wrote: > Andrea Arcangeli wrote: > > >it's like a dma ring buffer size of a soundcard, if you want low latency > >it has to be small, it's as simple as that. It's a tradeoff between > > > Although the dma buffer is strictly FIFO, so the situation isn't > quite so simple for disk IO. In genereal (w/o CFQ or the other side of it that is an extreme unfair starving elevator where you're stuck regardless the size of the queue) larger queue will mean higher latencies in presence of flood of async load like in a dma buffer. This is obvious for the elevator noop for example. I'm speaking about a stable, non starving, fast, default elevator (something like in 2.4 mainline incidentally) and for that the similarity with dma buffer definitely applies, there will be a latency effect coming from the size of the queue (even ignoring the other issues that the load of locked buffers introduces). The whole idea of CFQ is to make some workload work lowlatency indipendent on the size of the async queue. But still (even with CFQ) you have all the other problems about write throttling and worthless amount of locked ram and even wasted time on lots of full just ordered requests in the elevator (yeah I know you use elevator noop won't waste almost any time, but again this is not most people will use). I don't buy Andrew complaining about the write throttling when he still allows several dozen mbytes of ram in flight and invisible to the VM, I mean, before complaining about write throttling the excessive worthless amount of locked buffers must be fixed and so I did and it works very well from the feedback I had so far. You can take 2.4.21pre4aa3 and benchmark it as you want if you think I'm totally wrong, the elevator-lowlatency should be trivial to apply and backout (benchmarking against pre4 would be unfair). Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 11:41 ` Andrea Arcangeli @ 2003-02-21 21:25 ` Andrew Morton 2003-02-23 15:09 ` Andrea Arcangeli 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2003-02-21 21:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: piggin, wli, david.lang, linux-kernel Andrea Arcangeli <andrea@suse.de> wrote: > > I don't > buy Andrew complaining about the write throttling when he still allows > several dozen mbytes of ram in flight and invisible to the VM, The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling decisions are made upon the sum of dirty+writeback pages. The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount of dirty memory. Throttling in both write() and the page allocator is fully decoupled from the queue size. An 8192-slot (4 gigabyte) queue on a 32M machine has been tested. The only tasks which block in get_request_wait() are the ones which we want to block there: heavy writers. Page reclaim will never block page allocators in get_request_wait(). That causes terrible latency if the writer is still active. Page reclaim will never block a page-allocating process on I/O against a particular disk block. Allocators are instead throttled against _any_ write I/O completion. (This is broken in several ways, but it works well enough to leave it alone I think). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 21:25 ` Andrew Morton @ 2003-02-23 15:09 ` Andrea Arcangeli 0 siblings, 0 replies; 17+ messages in thread From: Andrea Arcangeli @ 2003-02-23 15:09 UTC (permalink / raw) To: Andrew Morton; +Cc: piggin, wli, david.lang, linux-kernel On Fri, Feb 21, 2003 at 01:25:49PM -0800, Andrew Morton wrote: > Andrea Arcangeli <andrea@suse.de> wrote: > > > > I don't > > buy Andrew complaining about the write throttling when he still allows > > several dozen mbytes of ram in flight and invisible to the VM, > > The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling > decisions are made upon the sum of dirty+writeback pages. > > The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount > of dirty memory. > > Throttling in both write() and the page allocator is fully decoupled from the > queue size. An 8192-slot (4 gigabyte) queue on a 32M machine has been > tested. the 32M case is probably fine with it, you moved the limit of in-flight I/O in the writeback layer, and the write throttling will limit the amount of ram in flight to 16M or so. I would be much more interesting to see some latency benchmark on a 8G machine with 4G simultaneously locked in the I/O queue. a 4G queue on a IDE disk can only waste lots of cpu and memory resources, increasing the latency too, without providing any benefit. Your 4G queue thing provides only disavantages as far as I can tell. > > The only tasks which block in get_request_wait() are the ones which we want > to block there: heavy writers. > > Page reclaim will never block page allocators in get_request_wait(). That > causes terrible latency if the writer is still active. > > Page reclaim will never block a page-allocating process on I/O against a > particular disk block. Allocators are instead throttled against _any_ write > I/O completion. (This is broken in several ways, but it works well enough to > leave it alone I think). 2.4 on desktop boxes could fill all ram with locked and dirty stuff because of the excessive size of the queue, so any comparison with 2.4 in terms of page reclaim should be repeated on 2.4.21pre4aa3 IMHO, where the VM has a chance not to find the machine in collapsed state where the only thing it can do is to either wait or panic(), feel free to choose what you prefer. Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 11:08 ` Andrea Arcangeli 2003-02-21 11:17 ` Nick Piggin @ 2003-02-21 11:34 ` William Lee Irwin III 2003-02-21 12:38 ` Andrea Arcangeli 1 sibling, 1 reply; 17+ messages in thread From: William Lee Irwin III @ 2003-02-21 11:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, David Lang, linux-kernel On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: >> Restricting io in flight doesn't actually repair the issues raised by On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > the amount of I/O that we allow in flight is purerly random, there is no > point to allow several dozen mbytes of I/O in flight on a 64M machine, > my patch fixes that and nothing more. I was arguing against having any preset limit whatsoever. On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: >> it, but rather avoids them by limiting functionality. On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > If you can show a (throughput) benchmark where you see this limited > functionalty I'd be very interested. > Alternatively I can also claim that 2.4 and 2.5 are limiting > functionalty too by limiting the I/O in flight to some hundred megabytes > right? This has nothing to do with benchmarks. Counterexample: suppose the process generating dirty data is the only one running. The machine's effective RAM capacity is then limited to the dirty data limit plus some small constant by this io in flight limitation. This functionality is not to be dismissed lightly: changing the /proc/ business is root-only, hence it may not be within the power of a victim of a poor setting to adjust it. On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > it's like a dma ring buffer size of a soundcard, if you want low latency > it has to be small, it's as simple as that. It's a tradeoff between > latency and performance, but the point here is that apparently you gain > nothing with such an huge amount of I/O in flight. This has nothing to > do with the number of requests, the requests have to be a lot, or seeks > won't be reordered aggressively, but when everything merges using all > the requests is pointless and it only has the effect of locking > everything in ram, and this screw the write throttling too, because we > do write throttling on the dirty stuff, not on the locked stuff, and > this is what elevator-lowlatency address. > You may argue on the amount of in flight I/O limit I choosen, but really > the default in mainlines looks overkill to me for generic hardware. It's not a question of gain but rather immunity to reconfigurations. Redoing it for all the hardware raises a tuning issue, and in truth all I've ever wound up doing is turning it off because I've got so much RAM that various benchmarks could literally be done in-core as a first pass, then sorted, then sprayed out to disk in block-order. And a bunch of open benchmarks are basically just in-core spinlock exercise. (Ignore the fact there was a benchmark mentioned.) Amortizing seeks and incrementally sorting and so on generally require large buffers, and if you have the RAM, the kernel should use it. But more seriously, global io in flight limits are truly worthless, if anything it should be per-process, but even that's inadequate as it requires retuning for varying io speeds. Limit enforcement needs to be (1) localized (2) self-tuned via block layer feedback If I understand the code properly, 2.5.x has (2) but not (1). On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: >> The issue raised here is streaming io competing with processes working >> within bounded memory. It's unclear to me how 2.5.x mitigates this but >> the effects are far less drastic there. The "fix" you're suggesting is >> clamping off the entire machine's io just to contain the working set of On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > show me this claimping off please. take 2.4.21pre4aa3 and trash it > compared to 2.4.21pre4 with the minimum 32M queue, I'd be very > interested, if I've a problem I must fix it ASAP, but all the benchmarks > are in green so far and the behaviour was very bad before these fixes, > go ahead and show me red and you'll make me a big favour. Either that or > you're wrong that I'm claimping off anything. > Just to be clear, this whole thing has nothing to do with the elevator, > or the CFQ or whatever, it only is related to the worthwhile amount of > in flight I/O to keep the disk always running. You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on 2.4.21pre4. Some limit that's a hard upper bound but resettable via a sysctl or /proc/ or something. Testing 2.4.x-based trees might be a little painful since I'd have to debug why 2.4.x stopped booting on my boxen, which would take me a bit far afield from my current hacking. On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: >> a single process that generates unbounded amounts of dirty data and >> inadvertently penalizes other processes via page reclaim, where instead >> it should be forced to fairly wait its turn for memory. I believe I said something important here. =) The reason why this _should_ be the case is because processes stealing from each other is the kind of mutual interference that leads to things like Mozilla taking ages to swap in because other things were running for a while and it wasn't and so on. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: IO scheduler benchmarking 2003-02-21 11:34 ` William Lee Irwin III @ 2003-02-21 12:38 ` Andrea Arcangeli 0 siblings, 0 replies; 17+ messages in thread From: Andrea Arcangeli @ 2003-02-21 12:38 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, David Lang, linux-kernel On Fri, Feb 21, 2003 at 03:34:36AM -0800, William Lee Irwin III wrote: > On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > >> Restricting io in flight doesn't actually repair the issues raised by > > On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > > the amount of I/O that we allow in flight is purerly random, there is no > > point to allow several dozen mbytes of I/O in flight on a 64M machine, > > my patch fixes that and nothing more. > > I was arguing against having any preset limit whatsoever. the preset limit exists in every linux kernel out there. It should be mandated by the lowlevel device driver, I don't allow that yet, but it should be trivial to extend with just an additional per-queue int, it's just an implementation matter. > On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > >> it, but rather avoids them by limiting functionality. > > On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > > If you can show a (throughput) benchmark where you see this limited > > functionalty I'd be very interested. > > Alternatively I can also claim that 2.4 and 2.5 are limiting > > functionalty too by limiting the I/O in flight to some hundred megabytes > > right? > > This has nothing to do with benchmarks. it has to, you claimed I limited functionalty, if you can't measure it in any way (or at least demonstrate it with math), it doesn't exist. > Counterexample: suppose the process generating dirty data is the only > one running. The machine's effective RAM capacity is then limited to > the dirty data limit plus some small constant by this io in flight > limitation. only the free memory and cache is accounted here, while this task allocates ram with malloc, the amount of dirty ram will be reduced accordingly, what you said is far from reality. We aren't 100% accurate in the cache level accounting true, but we're 100% accurate in the anonymous memory accounting. > This functionality is not to be dismissed lightly: changing the /proc/ > business is root-only, hence it may not be within the power of a victim > of a poor setting to adjust it. > > > On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > > it's like a dma ring buffer size of a soundcard, if you want low latency > > it has to be small, it's as simple as that. It's a tradeoff between > > latency and performance, but the point here is that apparently you gain > > nothing with such an huge amount of I/O in flight. This has nothing to > > do with the number of requests, the requests have to be a lot, or seeks > > won't be reordered aggressively, but when everything merges using all > > the requests is pointless and it only has the effect of locking > > everything in ram, and this screw the write throttling too, because we > > do write throttling on the dirty stuff, not on the locked stuff, and > > this is what elevator-lowlatency address. > > You may argue on the amount of in flight I/O limit I choosen, but really > > the default in mainlines looks overkill to me for generic hardware. > > It's not a question of gain but rather immunity to reconfigurations. You mean immunity of reconfigurations of machines with more than 4G of ram maybe, and you are ok to ignore completely the latency effects of the overkill queue size. Everything smaller can be affected by it not only in terms of latency effect. Especially if you have multiple spindle that literally multiply the fixed max amount of in flight I/O. > Redoing it for all the hardware raises a tuning issue, and in truth > all I've ever wound up doing is turning it off because I've got so > much RAM that various benchmarks could literally be done in-core as a > first pass, then sorted, then sprayed out to disk in block-order. And > a bunch of open benchmarks are basically just in-core spinlock exercise. > (Ignore the fact there was a benchmark mentioned.) > > Amortizing seeks and incrementally sorting and so on generally require > large buffers, and if you have the RAM, the kernel should use it. > > But more seriously, global io in flight limits are truly worthless, if > anything it should be per-process, but even that's inadequate as it This doesn't make any sense, the limit alwyas exists, it has to, if you drop it the machine will die deadlocking in a few milliseconds, the whole plugging and write throttling logic that completely drives the whole I/O subsystem totally depends on a limit on the in flight I/O. > requires retuning for varying io speeds. Limit enforcement needs to be > (1) localized > (2) self-tuned via block layer feedback > > If I understand the code properly, 2.5.x has (2) but not (1). 2.5 has the unplugging logic so it definitely has an high limit of in flight I/O too, no matter what elevator or whatever, w/o the fixed limit 2.5 will die too like any other linux kernel out there I have ever seen. > > On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > >> The issue raised here is streaming io competing with processes working > >> within bounded memory. It's unclear to me how 2.5.x mitigates this but > >> the effects are far less drastic there. The "fix" you're suggesting is > >> clamping off the entire machine's io just to contain the working set of > > On Fri, Feb 21, 2003 at 12:08:07PM +0100, Andrea Arcangeli wrote: > > show me this claimping off please. take 2.4.21pre4aa3 and trash it > > compared to 2.4.21pre4 with the minimum 32M queue, I'd be very > > interested, if I've a problem I must fix it ASAP, but all the benchmarks > > are in green so far and the behaviour was very bad before these fixes, > > go ahead and show me red and you'll make me a big favour. Either that or > > you're wrong that I'm claimping off anything. > > Just to be clear, this whole thing has nothing to do with the elevator, > > or the CFQ or whatever, it only is related to the worthwhile amount of > > in flight I/O to keep the disk always running. > > You named the clamping off yourself. A dozen MB on a 64MB box, 32MB on > 2.4.21pre4. Some limit that's a hard upper bound but resettable via a > sysctl or /proc/ or something. Testing 2.4.x-based trees might be a > little painful since I'd have to debug why 2.4.x stopped booting on my > boxen, which would take me a bit far afield from my current hacking. 2.4.21pre4aa3 has to boot on it. > On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > >> a single process that generates unbounded amounts of dirty data and > >> inadvertently penalizes other processes via page reclaim, where instead > >> it should be forced to fairly wait its turn for memory. > > I believe I said something important here. =) You're arguing about the async flushing heuristic that should be made smarter instead of taking 50% of the freeable memory (not anonymous memory). This isn't black and white stuff and you shouldn't mix issues, it has nothing to do with the blkdev plugging logic driven by the limit of in flight I/O (in every l-k out there ever). > The reason why this _should_ be the case is because processes stealing > from each other is the kind of mutual interference that leads to things > like Mozilla taking ages to swap in because other things were running > for a while and it wasn't and so on. > > > -- wli Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2003-02-25 23:46 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-02-25 12:59 IO scheduler benchmarking rwhron 2003-02-25 22:09 ` Andrew Morton -- strict thread matches above, loose matches on Subject: below -- 2003-02-25 21:57 rwhron 2003-02-25 5:35 rwhron 2003-02-25 6:38 ` Andrew Morton 2003-02-21 5:23 Andrew Morton 2003-02-21 6:51 ` David Lang 2003-02-21 8:16 ` Andrew Morton 2003-02-21 10:31 ` Andrea Arcangeli 2003-02-21 10:51 ` William Lee Irwin III 2003-02-21 11:08 ` Andrea Arcangeli 2003-02-21 11:17 ` Nick Piggin 2003-02-21 11:41 ` Andrea Arcangeli 2003-02-21 21:25 ` Andrew Morton 2003-02-23 15:09 ` Andrea Arcangeli 2003-02-21 11:34 ` William Lee Irwin III 2003-02-21 12:38 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox