* io-scheduler tuning for better read/write ratio @ 2009-06-16 15:43 Ralf Gross 2009-06-16 16:41 ` David Newall 0 siblings, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-16 15:43 UTC (permalink / raw) To: linux-kernel Hi, I'm trying to tune the kernel/io-scheduler for better read/write ratio on a Areca RAID0 device (4 disks, kernel 2.6.26, xfs fs). I can get 200 MB/s seq. writes and about the same for seq. reads. My problem is that if there are reads _and_ writes on this device, the write throughput is much higher than the read throughput (40 MB/s read, 90 MB/s write). The deadline scheduler sounded like the way to go for getting better read results, but reagardless which parameter I change, the ratio keeps the same. cfq, noop.. different paramter settings, but alway the same result. short: is there a way to tune the kernel/scheduler settings in a way to get a higher read throughput? Writes are not that important, basicially there are only two 30 GB files on the device/filesystem that are used to spool data for two LTO-4 tape drives. So I need a certain read speed to keep both drives streaming. While the data gets written to one file, I need at least 50 MB/s for reading from the other file and sent to the tape drive. Thanks, Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 15:43 io-scheduler tuning for better read/write ratio Ralf Gross @ 2009-06-16 16:41 ` David Newall 2009-06-16 18:40 ` Ralf Gross 0 siblings, 1 reply; 20+ messages in thread From: David Newall @ 2009-06-16 16:41 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel Ralf Gross wrote: > write throughput is much higher than the read throughput (40 MB/s > read, 90 MB/s write). Perhaps I've misunderstood, but isn't that common? Reads have to come from disk, whereas writes get cached by the drive. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 16:41 ` David Newall @ 2009-06-16 18:40 ` Ralf Gross 2009-06-16 18:43 ` Casey Dahlin 0 siblings, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-16 18:40 UTC (permalink / raw) To: linux-kernel David Newall schrieb: > Ralf Gross wrote: > > write throughput is much higher than the read throughput (40 MB/s > > read, 90 MB/s write). Hm, but I get higher read throughput (160-200 MB/s) if I don't write to the device at the same time. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 18:40 ` Ralf Gross @ 2009-06-16 18:43 ` Casey Dahlin 2009-06-16 18:56 ` Ralf Gross 0 siblings, 1 reply; 20+ messages in thread From: Casey Dahlin @ 2009-06-16 18:43 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel On 06/16/2009 02:40 PM, Ralf Gross wrote: > David Newall schrieb: >> Ralf Gross wrote: >>> write throughput is much higher than the read throughput (40 MB/s >>> read, 90 MB/s write). > > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > to the device at the same time. > > Ralf How specifically are you testing? It could depend a lot on the particular access patterns you're using to test. --CJD ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 18:43 ` Casey Dahlin @ 2009-06-16 18:56 ` Ralf Gross 2009-06-16 20:16 ` Jeff Moyer 0 siblings, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-16 18:56 UTC (permalink / raw) To: linux-kernel Casey Dahlin schrieb: > On 06/16/2009 02:40 PM, Ralf Gross wrote: > > David Newall schrieb: > >> Ralf Gross wrote: > >>> write throughput is much higher than the read throughput (40 MB/s > >>> read, 90 MB/s write). > > > > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > > to the device at the same time. > > > > Ralf > > How specifically are you testing? It could depend a lot on the > particular access patterns you're using to test. I did the basic tests with tiobench. The real test is a test backup (bacula) with 2 jobs that create 2 30 GB spool files on that device. The jobs partially write to the device in parallel. Depending which spool file reaches the 30 GB first, one starts reading from that file and writing to tape, while to other is still spooling. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 18:56 ` Ralf Gross @ 2009-06-16 20:16 ` Jeff Moyer 2009-06-22 14:43 ` Jeff Moyer 0 siblings, 1 reply; 20+ messages in thread From: Jeff Moyer @ 2009-06-16 20:16 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel Ralf Gross <rg@stz-softwaretechnik.com> writes: > Casey Dahlin schrieb: >> On 06/16/2009 02:40 PM, Ralf Gross wrote: >> > David Newall schrieb: >> >> Ralf Gross wrote: >> >>> write throughput is much higher than the read throughput (40 MB/s >> >>> read, 90 MB/s write). >> > >> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write >> > to the device at the same time. >> > >> > Ralf >> >> How specifically are you testing? It could depend a lot on the >> particular access patterns you're using to test. > > I did the basic tests with tiobench. The real test is a test backup > (bacula) with 2 jobs that create 2 30 GB spool files on that device. > The jobs partially write to the device in parallel. Depending which > spool file reaches the 30 GB first, one starts reading from that file > and writing to tape, while to other is still spooling. We are missing a lot of details, here. I guess the first thing I'd try would be bumping up the max_readahead_kb parameter, since I'm guessing that your backup application isn't driving very deep queue depths. If that doesn't work, then please provide exact invocations of tiobench that reprduce the problem or some blktrace output for your real test. Cheers, Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-16 20:16 ` Jeff Moyer @ 2009-06-22 14:43 ` Jeff Moyer 2009-06-22 16:31 ` Ralf Gross 0 siblings, 1 reply; 20+ messages in thread From: Jeff Moyer @ 2009-06-22 14:43 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel Jeff Moyer <jmoyer@redhat.com> writes: > Ralf Gross <rg@stz-softwaretechnik.com> writes: > >> Casey Dahlin schrieb: >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: >>> > David Newall schrieb: >>> >> Ralf Gross wrote: >>> >>> write throughput is much higher than the read throughput (40 MB/s >>> >>> read, 90 MB/s write). >>> > >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write >>> > to the device at the same time. >>> > >>> > Ralf >>> >>> How specifically are you testing? It could depend a lot on the >>> particular access patterns you're using to test. >> >> I did the basic tests with tiobench. The real test is a test backup >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. >> The jobs partially write to the device in parallel. Depending which >> spool file reaches the 30 GB first, one starts reading from that file >> and writing to tape, while to other is still spooling. > > We are missing a lot of details, here. I guess the first thing I'd try > would be bumping up the max_readahead_kb parameter, since I'm guessing > that your backup application isn't driving very deep queue depths. If > that doesn't work, then please provide exact invocations of tiobench > that reprduce the problem or some blktrace output for your real test. Any news, Ralf? Cheers, Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-22 14:43 ` Jeff Moyer @ 2009-06-22 16:31 ` Ralf Gross 2009-06-22 19:42 ` Jeff Moyer 0 siblings, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-22 16:31 UTC (permalink / raw) To: linux-kernel Jeff Moyer schrieb: > Jeff Moyer <jmoyer@redhat.com> writes: > > > Ralf Gross <rg@stz-softwaretechnik.com> writes: > > > >> Casey Dahlin schrieb: > >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > >>> > David Newall schrieb: > >>> >> Ralf Gross wrote: > >>> >>> write throughput is much higher than the read throughput (40 MB/s > >>> >>> read, 90 MB/s write). > >>> > > >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > >>> > to the device at the same time. > >>> > > >>> > Ralf > >>> > >>> How specifically are you testing? It could depend a lot on the > >>> particular access patterns you're using to test. > >> > >> I did the basic tests with tiobench. The real test is a test backup > >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > >> The jobs partially write to the device in parallel. Depending which > >> spool file reaches the 30 GB first, one starts reading from that file > >> and writing to tape, while to other is still spooling. > > > > We are missing a lot of details, here. I guess the first thing I'd try > > would be bumping up the max_readahead_kb parameter, since I'm guessing > > that your backup application isn't driving very deep queue depths. If > > that doesn't work, then please provide exact invocations of tiobench > > that reprduce the problem or some blktrace output for your real test. > > Any news, Ralf? sorry for the delay. atm there are large backups running and using the raid device for spooling. So I can't do any tests. Re. read ahead: I tested different settings from 8Kb to 65Kb, this didn't help. I'll do some more tests when the backups are done (3-4 more days). Thanks, Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-22 16:31 ` Ralf Gross @ 2009-06-22 19:42 ` Jeff Moyer 2009-06-23 7:24 ` Ralf Gross 2009-06-26 2:19 ` Wu Fengguang 0 siblings, 2 replies; 20+ messages in thread From: Jeff Moyer @ 2009-06-22 19:42 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel, fengguang.wu Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > Jeff Moyer schrieb: >> Jeff Moyer <jmoyer@redhat.com> writes: >> >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: >> > >> >> Casey Dahlin schrieb: >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: >> >>> > David Newall schrieb: >> >>> >> Ralf Gross wrote: >> >>> >>> write throughput is much higher than the read throughput (40 MB/s >> >>> >>> read, 90 MB/s write). >> >>> > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write >> >>> > to the device at the same time. >> >>> > >> >>> > Ralf >> >>> >> >>> How specifically are you testing? It could depend a lot on the >> >>> particular access patterns you're using to test. >> >> >> >> I did the basic tests with tiobench. The real test is a test backup >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. >> >> The jobs partially write to the device in parallel. Depending which >> >> spool file reaches the 30 GB first, one starts reading from that file >> >> and writing to tape, while to other is still spooling. >> > >> > We are missing a lot of details, here. I guess the first thing I'd try >> > would be bumping up the max_readahead_kb parameter, since I'm guessing >> > that your backup application isn't driving very deep queue depths. If >> > that doesn't work, then please provide exact invocations of tiobench >> > that reprduce the problem or some blktrace output for your real test. >> >> Any news, Ralf? > > sorry for the delay. atm there are large backups running and using the > raid device for spooling. So I can't do any tests. > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > didn't help. > > I'll do some more tests when the backups are done (3-4 more days). The default is 128KB, I believe, so it's strange that you would test smaller values. ;) I would try something along the lines of 1 or 2 MB. I'm CCing Fengguang in case he has any suggestions. Cheers, Jeff p.s. Fengguang, the thread starts here: http://lkml.org/lkml/2009/6/16/390 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-22 19:42 ` Jeff Moyer @ 2009-06-23 7:24 ` Ralf Gross 2009-06-23 13:53 ` Jeff Moyer 2009-06-26 2:19 ` Wu Fengguang 1 sibling, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-23 7:24 UTC (permalink / raw) To: linux-kernel; +Cc: fengguang.wu Jeff Moyer schrieb: > Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > > > Jeff Moyer schrieb: > >> Jeff Moyer <jmoyer@redhat.com> writes: > >> > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > >> > > >> >> Casey Dahlin schrieb: > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > >> >>> > David Newall schrieb: > >> >>> >> Ralf Gross wrote: > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > >> >>> >>> read, 90 MB/s write). > >> >>> > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > >> >>> > to the device at the same time. > >> >>> > > >> >>> > Ralf > >> >>> > >> >>> How specifically are you testing? It could depend a lot on the > >> >>> particular access patterns you're using to test. > >> >> > >> >> I did the basic tests with tiobench. The real test is a test backup > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > >> >> The jobs partially write to the device in parallel. Depending which > >> >> spool file reaches the 30 GB first, one starts reading from that file > >> >> and writing to tape, while to other is still spooling. > >> > > >> > We are missing a lot of details, here. I guess the first thing I'd try > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > >> > that your backup application isn't driving very deep queue depths. If > >> > that doesn't work, then please provide exact invocations of tiobench > >> > that reprduce the problem or some blktrace output for your real test. > >> > >> Any news, Ralf? > > > > sorry for the delay. atm there are large backups running and using the > > raid device for spooling. So I can't do any tests. > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > > didn't help. > > > > I'll do some more tests when the backups are done (3-4 more days). > > The default is 128KB, I believe, so it's strange that you would test > smaller values. ;) I would try something along the lines of 1 or 2 MB. Err, yes this should have been MB not KB. $cat /sys/block/sdc/queue/read_ahead_kb 16384 $cat /sys/block/sdd/queue/read_ahead_kb 16384 I also tried different values for max_sectors_kb, nr_requests. But the trend that writes were much faster than reads while there was read and write load on the device didn't change. Changing the deadline parameter writes_starved, write_expire, read_expire, front_merges or fifo_batch didn't change this behavoir. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-23 7:24 ` Ralf Gross @ 2009-06-23 13:53 ` Jeff Moyer 2009-06-24 7:25 ` Ralf Gross 0 siblings, 1 reply; 20+ messages in thread From: Jeff Moyer @ 2009-06-23 13:53 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel, fengguang.wu Ralf Gross <Ralf-Lists@ralfgross.de> writes: > Jeff Moyer schrieb: >> Ralf Gross <rg@STZ-Softwaretechnik.com> writes: >> >> > Jeff Moyer schrieb: >> >> Jeff Moyer <jmoyer@redhat.com> writes: >> >> >> >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: >> >> > >> >> >> Casey Dahlin schrieb: >> >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: >> >> >>> > David Newall schrieb: >> >> >>> >> Ralf Gross wrote: >> >> >>> >>> write throughput is much higher than the read throughput (40 MB/s >> >> >>> >>> read, 90 MB/s write). >> >> >>> > >> >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write >> >> >>> > to the device at the same time. >> >> >>> > >> >> >>> > Ralf >> >> >>> >> >> >>> How specifically are you testing? It could depend a lot on the >> >> >>> particular access patterns you're using to test. >> >> >> >> >> >> I did the basic tests with tiobench. The real test is a test backup >> >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. >> >> >> The jobs partially write to the device in parallel. Depending which >> >> >> spool file reaches the 30 GB first, one starts reading from that file >> >> >> and writing to tape, while to other is still spooling. >> >> > >> >> > We are missing a lot of details, here. I guess the first thing I'd try >> >> > would be bumping up the max_readahead_kb parameter, since I'm guessing >> >> > that your backup application isn't driving very deep queue depths. If >> >> > that doesn't work, then please provide exact invocations of tiobench >> >> > that reprduce the problem or some blktrace output for your real test. >> >> >> >> Any news, Ralf? >> > >> > sorry for the delay. atm there are large backups running and using the >> > raid device for spooling. So I can't do any tests. >> > >> > Re. read ahead: I tested different settings from 8Kb to 65Kb, this >> > didn't help. >> > >> > I'll do some more tests when the backups are done (3-4 more days). >> >> The default is 128KB, I believe, so it's strange that you would test >> smaller values. ;) I would try something along the lines of 1 or 2 MB. > > Err, yes this should have been MB not KB. > > > $cat /sys/block/sdc/queue/read_ahead_kb > 16384 > $cat /sys/block/sdd/queue/read_ahead_kb > 16384 > > I also tried different values for max_sectors_kb, nr_requests. But the > trend that writes were much faster than reads while there was read and > write load on the device didn't change. > > Changing the deadline parameter writes_starved, write_expire, > read_expire, front_merges or fifo_batch didn't change this behavoir. OK, bumping up readahead and changing the deadline parameters listed should have give some better results, I would think. Can you give the invocation of tiobench you used so I can try to reproduce this? Thanks! Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-23 13:53 ` Jeff Moyer @ 2009-06-24 7:25 ` Ralf Gross 2009-06-24 7:57 ` Al Boldi 0 siblings, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-24 7:25 UTC (permalink / raw) To: linux-kernel, fengguang.wu Jeff Moyer schrieb: > Ralf Gross <Ralf-Lists@ralfgross.de> writes: > > > Jeff Moyer schrieb: > >> Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > >> > >> > Jeff Moyer schrieb: > >> >> Jeff Moyer <jmoyer@redhat.com> writes: > >> >> > >> >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > >> >> > > >> >> >> Casey Dahlin schrieb: > >> >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > >> >> >>> > David Newall schrieb: > >> >> >>> >> Ralf Gross wrote: > >> >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > >> >> >>> >>> read, 90 MB/s write). > >> >> >>> > > >> >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > >> >> >>> > to the device at the same time. > >> >> >>> > > >> >> >>> > Ralf > >> >> >>> > >> >> >>> How specifically are you testing? It could depend a lot on the > >> >> >>> particular access patterns you're using to test. > >> >> >> > >> >> >> I did the basic tests with tiobench. The real test is a test backup > >> >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > >> >> >> The jobs partially write to the device in parallel. Depending which > >> >> >> spool file reaches the 30 GB first, one starts reading from that file > >> >> >> and writing to tape, while to other is still spooling. > >> >> > > >> >> > We are missing a lot of details, here. I guess the first thing I'd try > >> >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > >> >> > that your backup application isn't driving very deep queue depths. If > >> >> > that doesn't work, then please provide exact invocations of tiobench > >> >> > that reprduce the problem or some blktrace output for your real test. > >> >> > >> >> Any news, Ralf? > >> > > >> > sorry for the delay. atm there are large backups running and using the > >> > raid device for spooling. So I can't do any tests. > >> > > >> > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > >> > didn't help. > >> > > >> > I'll do some more tests when the backups are done (3-4 more days). > >> > >> The default is 128KB, I believe, so it's strange that you would test > >> smaller values. ;) I would try something along the lines of 1 or 2 MB. > > > > Err, yes this should have been MB not KB. > > > > > > $cat /sys/block/sdc/queue/read_ahead_kb > > 16384 > > $cat /sys/block/sdd/queue/read_ahead_kb > > 16384 > > > > I also tried different values for max_sectors_kb, nr_requests. But the > > trend that writes were much faster than reads while there was read and > > write load on the device didn't change. > > > > Changing the deadline parameter writes_starved, write_expire, > > read_expire, front_merges or fifo_batch didn't change this behavoir. > > OK, bumping up readahead and changing the deadline parameters listed > should have give some better results, I would think. Can you give the > invocation of tiobench you used so I can try to reproduce this? The main problem is with bacula. It reads/writes from/to two spoolfiles on the same device. I get the same behavior with 2 dd processes, one reading from disk, one writing to it. Here's the output from dstat (5 sec intervall). --dsk/md1-- _read _writ 26M 95M 31M 96M 20M 85M 31M 108M 28M 89M 24M 95M 26M 79M 32M 115M 50M 74M 129M 15k 147M 1638B 147M 0 147M 0 113M 0 At the end I stopped the dd process that is writing to the device, so you can see that the md device is capable of reading with >120 MB/s. I did this with these two commands. dd if=/dev/zero of=test bs=1MB dd if=/dev/md1 of=/dev/null bs=1M Maybe this is too simple, but with a real world application I see the same behavior. md1 is a md raid 0 device with 2 disks. md1 : active raid0 sdc[0] sdd[1] 781422592 blocks 64k chunks sdc: /sys/block/sdc/queue/hw_sector_size 512 /sys/block/sdc/queue/max_hw_sectors_kb 32767 /sys/block/sdc/queue/max_sectors_kb 512 /sys/block/sdc/queue/nomerges 0 /sys/block/sdc/queue/nr_requests 128 /sys/block/sdc/queue/read_ahead_kb 16384 /sys/block/sdc/queue/scheduler noop anticipatory [deadline] cfq /sys/block/sdc/queue/iosched/fifo_batch 16 /sys/block/sdc/queue/iosched/front_merges 1 /sys/block/sdc/queue/iosched/read_expire 500 /sys/block/sdc/queue/iosched/write_expire 5000 /sys/block/sdc/queue/iosched/writes_starved 2 sdd: /sys/block/sdd/queue/hw_sector_size 512 /sys/block/sdd/queue/max_hw_sectors_kb 32767 /sys/block/sdd/queue/max_sectors_kb 512 /sys/block/sdd/queue/nomerges 0 /sys/block/sdd/queue/nr_requests 128 /sys/block/sdd/queue/read_ahead_kb 16384 /sys/block/sdd/queue/scheduler noop anticipatory [deadline] cfq /sys/block/sdd/queue/iosched/fifo_batch 16 /sys/block/sdd/queue/iosched/front_merges 1 /sys/block/sdd/queue/iosched/read_expire 500 /sys/block/sdd/queue/iosched/write_expire 5000 /sys/block/sdd/queue/iosched/writes_starved 2 The deadline parameters are the default ones. Setting writes_starved much higher I expected a change in the read/write ratio, but didn't see any change. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-24 7:25 ` Ralf Gross @ 2009-06-24 7:57 ` Al Boldi 2009-06-25 7:26 ` Ralf Gross 2009-06-25 7:27 ` Ralf Gross 0 siblings, 2 replies; 20+ messages in thread From: Al Boldi @ 2009-06-24 7:57 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel, fengguang.wu Ralf Gross wrote: > The main problem is with bacula. It reads/writes from/to two > spoolfiles on the same device. > > I get the same behavior with 2 dd processes, one reading from disk, one > writing to it. > > Here's the output from dstat (5 sec intervall). > > --dsk/md1-- > _read _writ > 26M 95M > 31M 96M > 20M 85M > 31M 108M > 28M 89M > 24M 95M > 26M 79M > 32M 115M > 50M 74M > 129M 15k > 147M 1638B > 147M 0 > 147M 0 > 113M 0 > > > At the end I stopped the dd process that is writing to the device, so you > can see that the md device is capable of reading with >120 MB/s. > > I did this with these two commands. > > dd if=/dev/zero of=test bs=1MB > dd if=/dev/md1 of=/dev/null bs=1M Try changing /proc/sys/vm/dirty_ratio = 1 Thanks! -- Al ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-24 7:57 ` Al Boldi @ 2009-06-25 7:26 ` Ralf Gross 2009-06-25 13:45 ` Al Boldi 2009-06-25 7:27 ` Ralf Gross 1 sibling, 1 reply; 20+ messages in thread From: Ralf Gross @ 2009-06-25 7:26 UTC (permalink / raw) To: Al Boldi; +Cc: linux-kernel, fengguang.wu Al Boldi schrieb: > Ralf Gross wrote: > > The main problem is with bacula. It reads/writes from/to two > > spoolfiles on the same device. > > > > I get the same behavior with 2 dd processes, one reading from disk, one > > writing to it. > > > > Here's the output from dstat (5 sec intervall). > > > > --dsk/md1-- > > _read _writ > > 26M 95M > > 31M 96M > > 20M 85M > > 31M 108M > > 28M 89M > > 24M 95M > > 26M 79M > > 32M 115M > > 50M 74M > > 129M 15k > > 147M 1638B > > 147M 0 > > 147M 0 > > 113M 0 > > > > > > At the end I stopped the dd process that is writing to the device, so you > > can see that the md device is capable of reading with >120 MB/s. > > > > I did this with these two commands. > > > > dd if=/dev/zero of=test bs=1MB > > dd if=/dev/md1 of=/dev/null bs=1M > > Try changing /proc/sys/vm/dirty_ratio = 1 $cat /proc/sys/vm/dirty_ratio 1 $dstat -D md1 -d 5 --dsk/md1-- _read _writ 18M 18M 0 0 820k 101M 18M 113M 26M 73M 26M 110M 32M 100M 19M 111M 13M 117M 13M 142M 32M 88M 26M 99M 38M 58M No change. Even setting dirty_ratio to 100 didn't show any difference. With the cfq scheduler and slice_idle = 24 (trial and error) I get better results. Itried this before, but the overall throughput was a bit lower than with deadline. It seems that I can not tune deadline to get the samet behaviour. --dsk/md1-- _read _writ 18M 18M 25M 77M 51M 65M 51M 47M 62M 45M 53M 28M 45M 43M 46M 47M 47M 42M 51M 41M 38M 51M 51M 40M 45M 40M 58M 42M 69M 41M 72M 42M 122M 0 141M 340k --dsk/md1-- _read _writ 139M 562k 136M 0 141M 13k 64M 0 1638B 104M 0 110M 0 122M 0 104M 0 108M The last numbers are for reading/writing only. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-25 7:26 ` Ralf Gross @ 2009-06-25 13:45 ` Al Boldi 0 siblings, 0 replies; 20+ messages in thread From: Al Boldi @ 2009-06-25 13:45 UTC (permalink / raw) To: Ralf Gross; +Cc: linux-kernel, fengguang.wu Ralf Gross wrote: > Al Boldi schrieb: > > Try changing /proc/sys/vm/dirty_ratio = 1 > > $cat /proc/sys/vm/dirty_ratio > 1 > > > $dstat -D md1 -d 5 > --dsk/md1-- > _read _writ > 18M 18M > 0 0 > 820k 101M > 18M 113M > 26M 73M > 26M 110M > 32M 100M > 19M 111M > 13M 117M > 13M 142M > 32M 88M > 26M 99M > 38M 58M > > No change. Even setting dirty_ratio to 100 didn't show any difference. What's your readahead? Do a blockdev --getra /dev/sdX and /dev/mdX. Try increasing it, while keeping dirty_ratio low. Thanks! -- Al ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-24 7:57 ` Al Boldi 2009-06-25 7:26 ` Ralf Gross @ 2009-06-25 7:27 ` Ralf Gross 1 sibling, 0 replies; 20+ messages in thread From: Ralf Gross @ 2009-06-25 7:27 UTC (permalink / raw) To: Al Boldi; +Cc: linux-kernel, fengguang.wu Al Boldi schrieb: > Ralf Gross wrote: > > The main problem is with bacula. It reads/writes from/to two > > spoolfiles on the same device. > > > > I get the same behavior with 2 dd processes, one reading from disk, one > > writing to it. > > > > Here's the output from dstat (5 sec intervall). > > > > --dsk/md1-- > > _read _writ > > 26M 95M > > 31M 96M > > 20M 85M > > 31M 108M > > 28M 89M > > 24M 95M > > 26M 79M > > 32M 115M > > 50M 74M > > 129M 15k > > 147M 1638B > > 147M 0 > > 147M 0 > > 113M 0 > > > > > > At the end I stopped the dd process that is writing to the device, so you > > can see that the md device is capable of reading with >120 MB/s. > > > > I did this with these two commands. > > > > dd if=/dev/zero of=test bs=1MB > > dd if=/dev/md1 of=/dev/null bs=1M > > Try changing /proc/sys/vm/dirty_ratio = 1 $cat /proc/sys/vm/dirty_ratio 1 $dstat -D md1 -d 5 --dsk/md1-- _read _writ 18M 18M 0 0 820k 101M 18M 113M 26M 73M 26M 110M 32M 100M 19M 111M 13M 117M 13M 142M 32M 88M 26M 99M 38M 58M No change. Even setting dirty_ratio to 100 didn't show any difference. With the cfq scheduler and slice_idle = 24 (trial and error) I get better results. Itried this before, but the overall throughput was a bit lower than with deadline. It seems that I can not tune deadline to get the samet behaviour. --dsk/md1-- _read _writ 18M 18M 25M 77M 51M 65M 51M 47M 62M 45M 53M 28M 45M 43M 46M 47M 47M 42M 51M 41M 38M 51M 51M 40M 45M 40M 58M 42M 69M 41M 72M 42M 122M 0 141M 340k --dsk/md1-- _read _writ 139M 562k 136M 0 141M 13k 64M 0 1638B 104M 0 110M 0 122M 0 104M 0 108M The last numbers are for reading/writing only. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-22 19:42 ` Jeff Moyer 2009-06-23 7:24 ` Ralf Gross @ 2009-06-26 2:19 ` Wu Fengguang 2009-06-26 10:44 ` Jens Axboe 1 sibling, 1 reply; 20+ messages in thread From: Wu Fengguang @ 2009-06-26 2:19 UTC (permalink / raw) To: Jeff Moyer Cc: Ralf Gross, linux-kernel@vger.kernel.org, linux-fsdevel, Jens Axboe On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote: > Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > > > Jeff Moyer schrieb: > >> Jeff Moyer <jmoyer@redhat.com> writes: > >> > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > >> > > >> >> Casey Dahlin schrieb: > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > >> >>> > David Newall schrieb: > >> >>> >> Ralf Gross wrote: > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > >> >>> >>> read, 90 MB/s write). > >> >>> > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > >> >>> > to the device at the same time. > >> >>> > > >> >>> > Ralf > >> >>> > >> >>> How specifically are you testing? It could depend a lot on the > >> >>> particular access patterns you're using to test. > >> >> > >> >> I did the basic tests with tiobench. The real test is a test backup > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > >> >> The jobs partially write to the device in parallel. Depending which > >> >> spool file reaches the 30 GB first, one starts reading from that file > >> >> and writing to tape, while to other is still spooling. > >> > > >> > We are missing a lot of details, here. I guess the first thing I'd try > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > >> > that your backup application isn't driving very deep queue depths. If > >> > that doesn't work, then please provide exact invocations of tiobench > >> > that reprduce the problem or some blktrace output for your real test. > >> > >> Any news, Ralf? > > > > sorry for the delay. atm there are large backups running and using the > > raid device for spooling. So I can't do any tests. > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > > didn't help. > > > > I'll do some more tests when the backups are done (3-4 more days). > > The default is 128KB, I believe, so it's strange that you would test > smaller values. ;) I would try something along the lines of 1 or 2 MB. > > I'm CCing Fengguang in case he has any suggestions. Jeff, thank you for the forwarding (and sorry for the long delay)! The read:write (or rather sync:async) ratio control is an IO scheduler feature. CFQ has parameters slice_sync and slice_async for that. What's more, CFQ will let async IO wait if there are any in flight sync IO. This is good, but not quite enough. Normally sync IOs come one by one, with some small idle time window in between. If we only start dispatching async IOs after the last sync IO has completed for eg. 1ms, then we may stop the async background write IOs when there are active sync foreground read IO stream. This simple patch aims to address the writes-push-aside-reads problem. Ralf, you can try applying this patch and run your workload with this (huge) CFQ parameter: echo 1000 > /sys/block/sda/queue/iosched/slice_sync The patch is based on 2.6.30, but can be trivially backported if you want to use some old kernel. It may impact overall (sync+async) IO throughput when there are one or more ongoing sync IO streams, so requires considerable benchmarks and adjustments. Thanks, Fengguang --- diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index a55a9bd..14011b7 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) return; - WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list)); WARN_ON(cfq_cfqq_slice_new(cfqq)); /* @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) * or if we want to idle in case it has no pending requests. */ if (cfqd->active_queue == cfqq) { - const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list); - if (cfq_cfqq_slice_new(cfqq)) { cfq_set_prio_slice(cfqd, cfqq); cfq_clear_cfqq_slice_new(cfqq); @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) */ if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq)) cfq_slice_expired(cfqd, 1); - else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && - sync && !rq_noidle(rq)) + else if (sync && !rq_noidle(rq) && + !cfq_close_cooperator(cfqd, cfqq, 1)) cfq_arm_slice_timer(cfqd); } ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-26 2:19 ` Wu Fengguang @ 2009-06-26 10:44 ` Jens Axboe 2009-06-27 3:46 ` Wu Fengguang 0 siblings, 1 reply; 20+ messages in thread From: Jens Axboe @ 2009-06-26 10:44 UTC (permalink / raw) To: Wu Fengguang Cc: Jeff Moyer, Ralf Gross, linux-kernel@vger.kernel.org, linux-fsdevel On Fri, Jun 26 2009, Wu Fengguang wrote: > On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote: > > Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > > > > > Jeff Moyer schrieb: > > >> Jeff Moyer <jmoyer@redhat.com> writes: > > >> > > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > > >> > > > >> >> Casey Dahlin schrieb: > > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > > >> >>> > David Newall schrieb: > > >> >>> >> Ralf Gross wrote: > > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > > >> >>> >>> read, 90 MB/s write). > > >> >>> > > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > > >> >>> > to the device at the same time. > > >> >>> > > > >> >>> > Ralf > > >> >>> > > >> >>> How specifically are you testing? It could depend a lot on the > > >> >>> particular access patterns you're using to test. > > >> >> > > >> >> I did the basic tests with tiobench. The real test is a test backup > > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > > >> >> The jobs partially write to the device in parallel. Depending which > > >> >> spool file reaches the 30 GB first, one starts reading from that file > > >> >> and writing to tape, while to other is still spooling. > > >> > > > >> > We are missing a lot of details, here. I guess the first thing I'd try > > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > > >> > that your backup application isn't driving very deep queue depths. If > > >> > that doesn't work, then please provide exact invocations of tiobench > > >> > that reprduce the problem or some blktrace output for your real test. > > >> > > >> Any news, Ralf? > > > > > > sorry for the delay. atm there are large backups running and using the > > > raid device for spooling. So I can't do any tests. > > > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > > > didn't help. > > > > > > I'll do some more tests when the backups are done (3-4 more days). > > > > The default is 128KB, I believe, so it's strange that you would test > > smaller values. ;) I would try something along the lines of 1 or 2 MB. > > > > I'm CCing Fengguang in case he has any suggestions. > > Jeff, thank you for the forwarding (and sorry for the long delay)! > > The read:write (or rather sync:async) ratio control is an IO scheduler > feature. CFQ has parameters slice_sync and slice_async for that. > What's more, CFQ will let async IO wait if there are any in flight > sync IO. This is good, but not quite enough. Normally sync IOs come > one by one, with some small idle time window in between. If we only > start dispatching async IOs after the last sync IO has completed for > eg. 1ms, then we may stop the async background write IOs when there > are active sync foreground read IO stream. > > This simple patch aims to address the writes-push-aside-reads problem. > Ralf, you can try applying this patch and run your workload with this > (huge) CFQ parameter: > > echo 1000 > /sys/block/sda/queue/iosched/slice_sync > > The patch is based on 2.6.30, but can be trivially backported if you > want to use some old kernel. > > It may impact overall (sync+async) IO throughput when there are one or > more ongoing sync IO streams, so requires considerable benchmarks and > adjustments. > > Thanks, > Fengguang > --- > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index a55a9bd..14011b7 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) > return; > > - WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list)); > WARN_ON(cfq_cfqq_slice_new(cfqq)); > > /* > @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > * or if we want to idle in case it has no pending requests. > */ > if (cfqd->active_queue == cfqq) { > - const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list); > - > if (cfq_cfqq_slice_new(cfqq)) { > cfq_set_prio_slice(cfqd, cfqq); > cfq_clear_cfqq_slice_new(cfqq); > @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > */ > if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq)) > cfq_slice_expired(cfqd, 1); > - else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && > - sync && !rq_noidle(rq)) > + else if (sync && !rq_noidle(rq) && > + !cfq_close_cooperator(cfqd, cfqq, 1)) > cfq_arm_slice_timer(cfqd); > } What's the purpose of this patch? If you have requests pending you don't want to arm the idle timer and wait, you want to dispatch those. -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-26 10:44 ` Jens Axboe @ 2009-06-27 3:46 ` Wu Fengguang 2009-06-29 9:47 ` Ralf Gross 0 siblings, 1 reply; 20+ messages in thread From: Wu Fengguang @ 2009-06-27 3:46 UTC (permalink / raw) To: Jens Axboe Cc: Jeff Moyer, Ralf Gross, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org On Fri, Jun 26, 2009 at 06:44:06PM +0800, Jens Axboe wrote: > On Fri, Jun 26 2009, Wu Fengguang wrote: > > On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote: > > > Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > > > > > > > Jeff Moyer schrieb: > > > >> Jeff Moyer <jmoyer@redhat.com> writes: > > > >> > > > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > > > >> > > > > >> >> Casey Dahlin schrieb: > > > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > > > >> >>> > David Newall schrieb: > > > >> >>> >> Ralf Gross wrote: > > > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > > > >> >>> >>> read, 90 MB/s write). > > > >> >>> > > > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > > > >> >>> > to the device at the same time. > > > >> >>> > > > > >> >>> > Ralf > > > >> >>> > > > >> >>> How specifically are you testing? It could depend a lot on the > > > >> >>> particular access patterns you're using to test. > > > >> >> > > > >> >> I did the basic tests with tiobench. The real test is a test backup > > > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > > > >> >> The jobs partially write to the device in parallel. Depending which > > > >> >> spool file reaches the 30 GB first, one starts reading from that file > > > >> >> and writing to tape, while to other is still spooling. > > > >> > > > > >> > We are missing a lot of details, here. I guess the first thing I'd try > > > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > > > >> > that your backup application isn't driving very deep queue depths. If > > > >> > that doesn't work, then please provide exact invocations of tiobench > > > >> > that reprduce the problem or some blktrace output for your real test. > > > >> > > > >> Any news, Ralf? > > > > > > > > sorry for the delay. atm there are large backups running and using the > > > > raid device for spooling. So I can't do any tests. > > > > > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > > > > didn't help. > > > > > > > > I'll do some more tests when the backups are done (3-4 more days). > > > > > > The default is 128KB, I believe, so it's strange that you would test > > > smaller values. ;) I would try something along the lines of 1 or 2 MB. > > > > > > I'm CCing Fengguang in case he has any suggestions. > > > > Jeff, thank you for the forwarding (and sorry for the long delay)! > > > > The read:write (or rather sync:async) ratio control is an IO scheduler > > feature. CFQ has parameters slice_sync and slice_async for that. > > What's more, CFQ will let async IO wait if there are any in flight > > sync IO. This is good, but not quite enough. Normally sync IOs come > > one by one, with some small idle time window in between. If we only > > start dispatching async IOs after the last sync IO has completed for > > eg. 1ms, then we may stop the async background write IOs when there > > are active sync foreground read IO stream. > > > > This simple patch aims to address the writes-push-aside-reads problem. > > Ralf, you can try applying this patch and run your workload with this > > (huge) CFQ parameter: > > > > echo 1000 > /sys/block/sda/queue/iosched/slice_sync > > > > The patch is based on 2.6.30, but can be trivially backported if you > > want to use some old kernel. > > > > It may impact overall (sync+async) IO throughput when there are one or > > more ongoing sync IO streams, so requires considerable benchmarks and > > adjustments. > > > > Thanks, > > Fengguang > > --- > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > index a55a9bd..14011b7 100644 > > --- a/block/cfq-iosched.c > > +++ b/block/cfq-iosched.c > > @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > > if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) > > return; > > > > - WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list)); > > WARN_ON(cfq_cfqq_slice_new(cfqq)); > > > > /* > > @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > > * or if we want to idle in case it has no pending requests. > > */ > > if (cfqd->active_queue == cfqq) { > > - const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list); > > - > > if (cfq_cfqq_slice_new(cfqq)) { > > cfq_set_prio_slice(cfqd, cfqq); > > cfq_clear_cfqq_slice_new(cfqq); > > @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > > */ > > if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq)) > > cfq_slice_expired(cfqd, 1); > > - else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && > > - sync && !rq_noidle(rq)) > > + else if (sync && !rq_noidle(rq) && > > + !cfq_close_cooperator(cfqd, cfqq, 1)) > > cfq_arm_slice_timer(cfqd); > > } > > What's the purpose of this patch? If you have requests pending you don't > want to arm the idle timer and wait, you want to dispatch those. You are right, please ignore this mindless hacking patch. Ralf, you can do the read/write ratio in the CFQ scheduler by tuning the slice_sync/slice_async parameters. For example, echo 10 > /sys//block/sda/queue/iosched/slice_async echo 100 > /sys//block/sda/queue/iosched/slice_sync gives -dsk/total- read writ 66M 25M 65M 20M 49M 32M 84M 19M 46M 28M 61M 23M 55M 25M 67M 23M 76M 18M 46M 31M 56M 29M 54M 23M 76M 20M while echo 10 > /sys//block/sda/queue/iosched/slice_async echo 300 > /sys//block/sda/queue/iosched/slice_sync gives -dsk/total- read writ 102M 11M 82M 10M 100M 12M 86M 10M 95M 11M 102M 3168k 96M 11M 88M 10M 96M 12M However too large slice_sync may not be desirable. Thanks, Fengguang ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: io-scheduler tuning for better read/write ratio 2009-06-27 3:46 ` Wu Fengguang @ 2009-06-29 9:47 ` Ralf Gross 0 siblings, 0 replies; 20+ messages in thread From: Ralf Gross @ 2009-06-29 9:47 UTC (permalink / raw) To: Wu Fengguang Cc: Jens Axboe, Jeff Moyer, Ralf Gross, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Wu Fengguang schrieb: > On Fri, Jun 26, 2009 at 06:44:06PM +0800, Jens Axboe wrote: > > On Fri, Jun 26 2009, Wu Fengguang wrote: > > > On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote: > > > > Ralf Gross <rg@STZ-Softwaretechnik.com> writes: > > > > > > > > > Jeff Moyer schrieb: > > > > >> Jeff Moyer <jmoyer@redhat.com> writes: > > > > >> > > > > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes: > > > > >> > > > > > >> >> Casey Dahlin schrieb: > > > > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote: > > > > >> >>> > David Newall schrieb: > > > > >> >>> >> Ralf Gross wrote: > > > > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s > > > > >> >>> >>> read, 90 MB/s write). > > > > >> >>> > > > > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write > > > > >> >>> > to the device at the same time. > > > > >> >>> > > > > > >> >>> > Ralf > > > > >> >>> > > > > >> >>> How specifically are you testing? It could depend a lot on the > > > > >> >>> particular access patterns you're using to test. > > > > >> >> > > > > >> >> I did the basic tests with tiobench. The real test is a test backup > > > > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device. > > > > >> >> The jobs partially write to the device in parallel. Depending which > > > > >> >> spool file reaches the 30 GB first, one starts reading from that file > > > > >> >> and writing to tape, while to other is still spooling. > > > > >> > > > > > >> > We are missing a lot of details, here. I guess the first thing I'd try > > > > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing > > > > >> > that your backup application isn't driving very deep queue depths. If > > > > >> > that doesn't work, then please provide exact invocations of tiobench > > > > >> > that reprduce the problem or some blktrace output for your real test. > > > > >> > > > > >> Any news, Ralf? > > > > > > > > > > sorry for the delay. atm there are large backups running and using the > > > > > raid device for spooling. So I can't do any tests. > > > > > > > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this > > > > > didn't help. > > > > > > > > > > I'll do some more tests when the backups are done (3-4 more days). > > > > > > > > The default is 128KB, I believe, so it's strange that you would test > > > > smaller values. ;) I would try something along the lines of 1 or 2 MB. > > > > > > > > I'm CCing Fengguang in case he has any suggestions. > > > > > > Jeff, thank you for the forwarding (and sorry for the long delay)! > > > > > > The read:write (or rather sync:async) ratio control is an IO scheduler > > > feature. CFQ has parameters slice_sync and slice_async for that. > > > What's more, CFQ will let async IO wait if there are any in flight > > > sync IO. This is good, but not quite enough. Normally sync IOs come > > > one by one, with some small idle time window in between. If we only > > > start dispatching async IOs after the last sync IO has completed for > > > eg. 1ms, then we may stop the async background write IOs when there > > > are active sync foreground read IO stream. > > > > > > This simple patch aims to address the writes-push-aside-reads problem. > > > Ralf, you can try applying this patch and run your workload with this > > > (huge) CFQ parameter: > > > > > > echo 1000 > /sys/block/sda/queue/iosched/slice_sync > > > > > > The patch is based on 2.6.30, but can be trivially backported if you > > > want to use some old kernel. > > > > > > It may impact overall (sync+async) IO throughput when there are one or > > > more ongoing sync IO streams, so requires considerable benchmarks and > > > adjustments. > > > > > > Thanks, > > > Fengguang > > > --- > > > > > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > > > index a55a9bd..14011b7 100644 > > > --- a/block/cfq-iosched.c > > > +++ b/block/cfq-iosched.c > > > @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > > > if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) > > > return; > > > > > > - WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list)); > > > WARN_ON(cfq_cfqq_slice_new(cfqq)); > > > > > > /* > > > @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > > > * or if we want to idle in case it has no pending requests. > > > */ > > > if (cfqd->active_queue == cfqq) { > > > - const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list); > > > - > > > if (cfq_cfqq_slice_new(cfqq)) { > > > cfq_set_prio_slice(cfqd, cfqq); > > > cfq_clear_cfqq_slice_new(cfqq); > > > @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) > > > */ > > > if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq)) > > > cfq_slice_expired(cfqd, 1); > > > - else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && > > > - sync && !rq_noidle(rq)) > > > + else if (sync && !rq_noidle(rq) && > > > + !cfq_close_cooperator(cfqd, cfqq, 1)) > > > cfq_arm_slice_timer(cfqd); > > > } > > > > What's the purpose of this patch? If you have requests pending you don't > > want to arm the idle timer and wait, you want to dispatch those. > > You are right, please ignore this mindless hacking patch. > > Ralf, you can do the read/write ratio in the CFQ scheduler by tuning > the slice_sync/slice_async parameters. > > For example, > > echo 10 > /sys//block/sda/queue/iosched/slice_async > echo 100 > /sys//block/sda/queue/iosched/slice_sync > > gives > > -dsk/total- > read writ > 66M 25M > 65M 20M > 49M 32M > 84M 19M > 46M 28M > 61M 23M > 55M 25M > 67M 23M > 76M 18M > 46M 31M > 56M 29M > 54M 23M > 76M 20M writing: --dsk/md1-- _read _writ 0 150M 0 142M 0 143M 0 112M 0 141M 0 152M 0 132M 0 123M 0 149M reading: --dsk/md1-- _read _writ 143M 0 145M 0 160M 0 128M 0 148M 0 140M 0 158M 0 130M 0 122M 0 reading + writing: --dsk/md1-- _read _writ 55M 76M 41M 83M 64M 81M 64M 83M 63M 68M 56M 117M 41M 61M 64M 87M 64M 69M 61M 87M 67M 81M 64M 33M 63M 68M 56M 76M > while > > echo 10 > /sys//block/sda/queue/iosched/slice_async > echo 300 > /sys//block/sda/queue/iosched/slice_sync > > gives > > -dsk/total- > read writ > 102M 11M > 82M 10M > 100M 12M > 86M 10M > 95M 11M > 102M 3168k > 96M 11M > 88M 10M > 96M 12M > > However too large slice_sync may not be desirable. writing: --dsk/md1-- _read _writ 0 131M 0 136M 0 145M 0 136M 0 128M 0 150M 0 127M 0 149M 0 127M 0 156M 0 125M 0 142M reading: --dsk/md1-- _read _writ 128M 0 160M 0 128M 0 128M 0 160M 0 128M 0 109M 0 128M 0 128M 0 160M 0 128M 0 writing: --dsk/md1-- _read _writ 0 183M 0 142M 0 137M 0 147M 0 135M 0 147M 0 117M 0 135M 0 156M 0 120M 0 147M 0 135M reading + writing: --dsk/md1-- _read _writ 96M 40M 64M 38M 96M 29M 96M 24M 96M 31M 95M 35M 97M 26M 96M 23M 96M 33M 95M 73M 91M 25M Thanks, this seem to be what I was looking for. I'll change the scheduler parameter for all spool devices and will run a backup with two concurrent backups. This will show me if bacula behaves the same as the simple dd test does. Ralf ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2009-06-29 9:49 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-16 15:43 io-scheduler tuning for better read/write ratio Ralf Gross 2009-06-16 16:41 ` David Newall 2009-06-16 18:40 ` Ralf Gross 2009-06-16 18:43 ` Casey Dahlin 2009-06-16 18:56 ` Ralf Gross 2009-06-16 20:16 ` Jeff Moyer 2009-06-22 14:43 ` Jeff Moyer 2009-06-22 16:31 ` Ralf Gross 2009-06-22 19:42 ` Jeff Moyer 2009-06-23 7:24 ` Ralf Gross 2009-06-23 13:53 ` Jeff Moyer 2009-06-24 7:25 ` Ralf Gross 2009-06-24 7:57 ` Al Boldi 2009-06-25 7:26 ` Ralf Gross 2009-06-25 13:45 ` Al Boldi 2009-06-25 7:27 ` Ralf Gross 2009-06-26 2:19 ` Wu Fengguang 2009-06-26 10:44 ` Jens Axboe 2009-06-27 3:46 ` Wu Fengguang 2009-06-29 9:47 ` Ralf Gross
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox