* [linux-lvm] IO scheduler, queue depth, nr_requests
@ 2004-02-16 9:11 Miquel van Smoorenburg
2004-02-16 8:30 ` [linux-lvm] " Jens Axboe
0 siblings, 1 reply; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-16 9:11 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-lvm
Hello,
as you might have seen from the linux-kernel mailinglist
I have been testing for months now with a fileserver set up to
use XFS over LVM2 on a 3ware RAID5 controller.
I asked for help several times on the list, but nobody really
replied, so now I'm taking a shot at mailing you directly, since
you appear to be the I/O request queueing guru of the kernel ;)
Cc: sent to linux-lvm@sistina.com. Any hint appreciated.
For some reason, when using LVM, write requests get queued out
of order to the 3ware controller, which results in quite a bit
of seeking and thus performance loss.
The default queue depth of the 3ware controller is 254. I found
out that lowering it to 64 in the driver fixed my problems, and
I advised 3ware support about this. They weren't really convinced..
By fiddling about today I just found that changing
/sys/block/sda/queue/nr_requests from 128 to something above
the queue depth of the 3ware controller (256 doesn't work,
384 and up do) also fixes the problem.
Does that actually make sense ?
Ah yes, I'm currently using 2.6.2 with a 3ware 8506-8 in
hardware raid5 mode, deadline scheduler, PIV 3.0 Ghz, 2 GB RAM.
Debug output ("mydd" works just like "dd", but has an fsync option):
- /mnt is an XFS filesystem on a LVM2 volume on the 3ware
- /mnt2 is an XFS filesystem directly on /dev/sda1 of the 3ware
- First on /mnt, the LVM partition. Note that a small "dd" runs
fast, a larger one runs slower:
# cd /mnt
# cat /sys/block/sda/device/queue_depth
254
# cat /sys/block/sda/queue/nr_requests
128
# ~/mydd --if /dev/zero --of file --bs 4096 --count 50000 --fsync
204800000 bytes transferred in 2.679812 seconds (76423271 bytes/sec)
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 9.501549 seconds (43108760 bytes/sec)
- Now I set the nr_requests to 512:
# echo 512 > /sys/block/sda/queue/nr_requests
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 5.374437 seconds (76212634 bytes/sec)
See that ? Weird thing is, it's only on LVM, directly on /dev/sda1
no problem at all:
# cat /sys/block/sda/device/queue_depth
254
# cat /sys/block/sda/queue/nr_requests
128
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec)
Somehow, LVM is causing the requests to the underlying 3ware
device to get out of order, and increasing nr_requests to be
larger than the queue_depth of the device fixes this.
I tried the latest dm-patches in -mm (applied those to vanilla
2.6.2), which include a patch called dm-04-maintain-bio-ordering.patch
but that doesn't really help (at first I though otherwise, but the
tests scripts I used lowered the queue_depth of the 3ware to 64
by accident) - if anything, it makes things worse.
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 13.138224 seconds (31176208 bytes/sec)
Setting nr_requests to 512 fixes things up again.
Mike.
^ permalink raw reply [flat|nested] 59+ messages in thread* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-16 9:11 [linux-lvm] IO scheduler, queue depth, nr_requests Miquel van Smoorenburg @ 2004-02-16 8:30 ` Jens Axboe 2004-02-16 9:01 ` Joe Thornber ` (3 more replies) 0 siblings, 4 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-16 8:30 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: linux-lvm On Mon, Feb 16 2004, Miquel van Smoorenburg wrote: > Hello, > > as you might have seen from the linux-kernel mailinglist > I have been testing for months now with a fileserver set up to > use XFS over LVM2 on a 3ware RAID5 controller. > > I asked for help several times on the list, but nobody really > replied, so now I'm taking a shot at mailing you directly, since > you appear to be the I/O request queueing guru of the kernel ;) > Cc: sent to linux-lvm@sistina.com. Any hint appreciated. > > For some reason, when using LVM, write requests get queued out > of order to the 3ware controller, which results in quite a bit > of seeking and thus performance loss. > > The default queue depth of the 3ware controller is 254. I found > out that lowering it to 64 in the driver fixed my problems, and > I advised 3ware support about this. They weren't really convinced.. > > By fiddling about today I just found that changing > /sys/block/sda/queue/nr_requests from 128 to something above > the queue depth of the 3ware controller (256 doesn't work, > 384 and up do) also fixes the problem. > > Does that actually make sense ? Yes, it makes perfect sense, I've been aware of this problem for quite some time. If you look init_tag_map() in ll_rw_blk.c: if (depth > q->nr_requests / 2) { q->nr_requests = depth * 2; printk(KERN_INFO "%s: large TCQ depth: adjusted nr_requests " "to %lu\n", __FUNCTION__, q->nr_requests); } it pretty much matches the problem you outlined. Unfortunately, the tagging depth of SCSI drivers cannot be controlled unless they use the generic block tagging helpers, and to my knowledge only a single driver does... > Ah yes, I'm currently using 2.6.2 with a 3ware 8506-8 in > hardware raid5 mode, deadline scheduler, PIV 3.0 Ghz, 2 GB RAM. > > Debug output ("mydd" works just like "dd", but has an fsync option): > > - /mnt is an XFS filesystem on a LVM2 volume on the 3ware > - /mnt2 is an XFS filesystem directly on /dev/sda1 of the 3ware > > - First on /mnt, the LVM partition. Note that a small "dd" runs > fast, a larger one runs slower: > > # cd /mnt > # cat /sys/block/sda/device/queue_depth > 254 > # cat /sys/block/sda/queue/nr_requests > 128 > # ~/mydd --if /dev/zero --of file --bs 4096 --count 50000 --fsync > 204800000 bytes transferred in 2.679812 seconds (76423271 bytes/sec) > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync > 409600000 bytes transferred in 9.501549 seconds (43108760 bytes/sec) > > - Now I set the nr_requests to 512: > # echo 512 > /sys/block/sda/queue/nr_requests > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync > 409600000 bytes transferred in 5.374437 seconds (76212634 bytes/sec) > > See that ? Weird thing is, it's only on LVM, directly on /dev/sda1 > no problem at all: > > # cat /sys/block/sda/device/queue_depth > 254 > # cat /sys/block/sda/queue/nr_requests > 128 > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync > 409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec) > > Somehow, LVM is causing the requests to the underlying 3ware > device to get out of order, and increasing nr_requests to be > larger than the queue_depth of the device fixes this. > > I tried the latest dm-patches in -mm (applied those to vanilla > 2.6.2), which include a patch called dm-04-maintain-bio-ordering.patch > but that doesn't really help (at first I though otherwise, but the > tests scripts I used lowered the queue_depth of the 3ware to 64 > by accident) - if anything, it makes things worse. > > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync > 409600000 bytes transferred in 13.138224 seconds (31176208 bytes/sec) > > Setting nr_requests to 512 fixes things up again. Seems there's an extra problem here, the nr_requests vs depth problem should not be too problematic unless you have heavy random io. Doesn't look like dm is reordering (bio_list_add() adds to tail, flush_deferred_io() processes from head. direct queueing doesn't look like it's reordering). Can the dm folks verify this? Or, you are just being hit by the problem first listed - requests get no hold time in the io scheduler for merging, because the driver drains them too quickly because of this artificially huge queue depth. If you did some stats on average request size and io/sec rate that should tell you for sure. I don't know what you have behind the 3ware, but it's generally not advised to use more than 4 tags per spindle. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-16 8:30 ` [linux-lvm] " Jens Axboe @ 2004-02-16 9:01 ` Joe Thornber 2004-02-16 19:32 ` Kevin P. Fleming ` (2 subsequent siblings) 3 siblings, 0 replies; 59+ messages in thread From: Joe Thornber @ 2004-02-16 9:01 UTC (permalink / raw) To: linux-lvm, Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm On Mon, Feb 16, 2004 at 02:30:47PM +0100, Jens Axboe wrote: > Seems there's an extra problem here, the nr_requests vs depth problem > should not be too problematic unless you have heavy random io. Doesn't > look like dm is reordering (bio_list_add() adds to tail, > flush_deferred_io() processes from head. direct queueing doesn't look > like it's reordering). Can the dm folks verify this? Ordering is certainly maintained for the simple targets (linear, striped). Snapshots and mirroring still need some work in this regard, but I don't think these are begin used for the tests. The place that I am expecting dm to cause poor performance is due to this bit of stupidity in dm-table.c. It's on my TODO list, but is not getting to the op since no-one seems to be complaining about it yet. /* FIXME: Device-Mapper on top of RAID-0 breaks because DM * currently doesn't honor MD's merge_bvec_fn routine. * In this case, we'll force DM to use PAGE_SIZE or * smaller I/O, just to be safe. A better fix is in the * works, but add this for the time being so it will at * least operate correctly. */ if (q->merge_bvec_fn) rs->max_sectors = min_not_zero(rs->max_sectors, (unsigned short)(PAGE_SIZE >> 9)); - Joe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-16 8:30 ` [linux-lvm] " Jens Axboe 2004-02-16 9:01 ` Joe Thornber @ 2004-02-16 19:32 ` Kevin P. Fleming 2004-02-17 1:46 ` Kevin P. Fleming 2004-02-18 10:29 ` Miquel van Smoorenburg 3 siblings, 0 replies; 59+ messages in thread From: Kevin P. Fleming @ 2004-02-16 19:32 UTC (permalink / raw) To: linux-lvm; +Cc: Miquel van Smoorenburg Jens Axboe wrote: > On Mon, Feb 16 2004, Miquel van Smoorenburg wrote: > >>Hello, >> >> as you might have seen from the linux-kernel mailinglist >>I have been testing for months now with a fileserver set up to >>use XFS over LVM2 on a 3ware RAID5 controller. >> I too have been fighting lower-than-expected performance with this identical combination, other than using a P4 2.8GHz instead of 3.0GHz, and 1GB RAM instead of 2GB. My 3ware 8506-8 has six Seagate Barracuda SATA drives behind it, which are very fast on their own :-) Try as I might, with the 2.6 kernel I have not been able to generate a bonnie++ or iozone run that provided performance better than a single disk, even with the 3ware card in JBOD, RAID-0 or RAID-1 mode. I had been conversing with 3ware about it, but to no real effect given than they don't support 2.6 yet. >>By fiddling about today I just found that changing >>/sys/block/sda/queue/nr_requests from 128 to something above >>the queue depth of the 3ware controller (256 doesn't work, >>384 and up do) also fixes the problem. I will try this as well this evening and report my results. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-16 8:30 ` [linux-lvm] " Jens Axboe 2004-02-16 9:01 ` Joe Thornber 2004-02-16 19:32 ` Kevin P. Fleming @ 2004-02-17 1:46 ` Kevin P. Fleming 2004-02-18 10:29 ` Miquel van Smoorenburg 3 siblings, 0 replies; 59+ messages in thread From: Kevin P. Fleming @ 2004-02-17 1:46 UTC (permalink / raw) To: linux-lvm; +Cc: Miquel van Smoorenburg Jens Axboe wrote: >>By fiddling about today I just found that changing >>/sys/block/sda/queue/nr_requests from 128 to something above >>the queue depth of the 3ware controller (256 doesn't work, >>384 and up do) also fixes the problem. OK, I have run three sets of bonnie++ tests on my system. This was using the 2.6-bk kernel as of about midnight GMT of 2004-02-17. System is single P4HT 2.8GHz (SMP kernel), 1GB RAM (4GB highmem enabled), Intel 865G chipset, 3ware 8506-8 with six Seagate 160GB Barracuda SATA disks in a single RAID-5. bonnie++ tests run on an XFS filesystem (made with default mkfs.xfs parameters using a recent xfstools package) over a 200GB LVM2 volume (dm-linear). bonnie++ -r 512 -s 40960 -f -b Anticipatory scheduler, nr_requests=128 (default) Seq. Write 53772 Rand. Write 15557 Seq. Read 47730 Rand. Seek 146.5 Anticipatory scheduler, nr_requests=384 Seq. Write 53843 Rand. Write 17595 Seq. Read 44663 Rand. Seek 143.1 Deadline scheduler, nr_requests=384 Seq. Write 54973 Rand. Write 18897 Seq. Read 41476 Rand. Seek 227.3 (!) Miquel's suggestion has a definite positive effect, except on sequential reads. This system still doesn't produce anywhere near the throughput I'd expect though, and still doesn't come close to 2.4 numbers either (but those are somewhat bogus, as they rely on setting extremely large readahead settings). There may be some sort of hardware problem that is limiting me to ~60MB/s, although with 2.4 I was able to get sequential read numbers around 95MB/s, so I think the hardware is OK. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-16 8:30 ` [linux-lvm] " Jens Axboe ` (2 preceding siblings ...) 2004-02-17 1:46 ` Kevin P. Fleming @ 2004-02-18 10:29 ` Miquel van Smoorenburg 2004-02-19 8:51 ` [linux-lvm] " Miquel van Smoorenburg 3 siblings, 1 reply; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-18 10:29 UTC (permalink / raw) To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, Joe Thornber On 2004.02.16 14:30, Jens Axboe wrote: > On Mon, Feb 16 2004, Miquel van Smoorenburg wrote: > > > By fiddling about today I just found that changing > > /sys/block/sda/queue/nr_requests from 128 to something above > > the queue depth of the 3ware controller (256 doesn't work, > > 384 and up do) also fixes the problem. > > [....] > > See that ? Weird thing is, it's only on LVM, directly on /dev/sda1 > > no problem at all: > > > > # cat /sys/block/sda/device/queue_depth > > 254 > > # cat /sys/block/sda/queue/nr_requests > > 128 > > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync > > 409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec) > > > > Somehow, LVM is causing the requests to the underlying 3ware > > device to get out of order, and increasing nr_requests to be > > larger than the queue_depth of the device fixes this. > > Seems there's an extra problem here, the nr_requests vs depth problem > should not be too problematic unless you have heavy random io. Doesn't > look like dm is reordering (bio_list_add() adds to tail, > flush_deferred_io() processes from head. direct queueing doesn't look > like it's reordering). Can the dm folks verify this? > > Or, you are just being hit by the problem first listed - requests get no > hold time in the io scheduler for merging, because the driver drains > them too quickly because of this artificially huge queue depth. If you > did some stats on average request size and io/sec rate that should tell > you for sure. I don't know what you have behind the 3ware, but it's > generally not advised to use more than 4 tags per spindle. Okay I repeated some earlier tests, and I added some debug code in several places. I added logging to tw_scsi_queue() in the 3ware driver to log the start sector and length of each request. It logs something like: 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 With a perl script, I can check if the requests are sent to the host in order. That outputs something like this: Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 See, 31 requests in order, then one request "backwards", then 31 in order, etc. I added some queue debug code as well, both the LVM2 queue and 3ware queue have the following settings: max_sectors: 256 max_phys_segments: 128 max_hw_segments: 62 hardsect_size: 512 max_segment_size: 65536 seg_boundary_mask: ffffffff Now 31 * 2 == 62 == max_hw_segments .. coincidence ? Weird thing is, still still only happens with LVM over 3ware raid5, not on /dev/sda1 of the 3ware directly. I added some printk's to scsi_request_fn() in scsi_lib.c to see if a requests was getting requeued - but no. Upping nr_requests to 2 * queue_depth does still fix things, but as you said that should not be necessary. This bugs me, I want to find out why this only happens with LVM and not without ... Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-18 10:29 ` Miquel van Smoorenburg @ 2004-02-19 8:51 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-18 23:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > For some reason, when using LVM, write requests get queued out > of order to the 3ware controller, which results in quite a bit > of seeking and thus performance loss. [..] > Okay I repeated some earlier tests, and I added some debug code in > several places. > > I added logging to tw_scsi_queue() in the 3ware driver to log the > start sector and length of each request. It logs something like: > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > With a perl script, I can check if the requests are sent to the > host in order. That outputs something like this: > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. I found out what causes this. It's get_request_wait(). When the request queue is full, and a new request needs to be created, __make_request() blocks in get_request_wait(). Another process wakes up first (pdflush / process submitting I/O itself / xfsdatad / etc) and sends the next bio's to __make_request(). In the mean time some free requests have become available, and the bios are merged into a new request. Those requests are submitted to the device. Then, get_request_wait() returns but the bio is not mergeable anymore - and that results in a backwards seek, severely limiting the I/O rate. Wouldn't it be better to allow the request allocation and queue the request, and /then/ put the process to sleep ? The queue will grow larger than nr_requests, but it does that anyway. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 8:51 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 8:51 UTC (permalink / raw) To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > For some reason, when using LVM, write requests get queued out > of order to the 3ware controller, which results in quite a bit > of seeking and thus performance loss. [..] > Okay I repeated some earlier tests, and I added some debug code in > several places. > > I added logging to tw_scsi_queue() in the 3ware driver to log the > start sector and length of each request. It logs something like: > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > With a perl script, I can check if the requests are sent to the > host in order. That outputs something like this: > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. I found out what causes this. It's get_request_wait(). When the request queue is full, and a new request needs to be created, __make_request() blocks in get_request_wait(). Another process wakes up first (pdflush / process submitting I/O itself / xfsdatad / etc) and sends the next bio's to __make_request(). In the mean time some free requests have become available, and the bios are merged into a new request. Those requests are submitted to the device. Then, get_request_wait() returns but the bio is not mergeable anymore - and that results in a backwards seek, severely limiting the I/O rate. Wouldn't it be better to allow the request allocation and queue the request, and /then/ put the process to sleep ? The queue will grow larger than nr_requests, but it does that anyway. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 8:51 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 8:51 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 1:24 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > >>For some reason, when using LVM, write requests get queued out >>of order to the 3ware controller, which results in quite a bit >>of seeking and thus performance loss. >> >[..] > >>Okay I repeated some earlier tests, and I added some debug code in >>several places. >> >>I added logging to tw_scsi_queue() in the 3ware driver to log the >>start sector and length of each request. It logs something like: >>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 >> >>With a perl script, I can check if the requests are sent to the >>host in order. That outputs something like this: >> >>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 >> >>See, 31 requests in order, then one request "backwards", then 31 in order, etc. >> > >I found out what causes this. It's get_request_wait(). > >When the request queue is full, and a new request needs to be created, >__make_request() blocks in get_request_wait(). > >Another process wakes up first (pdflush / process submitting I/O itself / >xfsdatad / etc) and sends the next bio's to __make_request(). >In the mean time some free requests have become available, and the bios >are merged into a new request. Those requests are submitted to the device. > >Then, get_request_wait() returns but the bio is not mergeable anymore - >and that results in a backwards seek, severely limiting the I/O rate. > >Wouldn't it be better to allow the request allocation and queue the >request, and /then/ put the process to sleep ? The queue will grow larger >than nr_requests, but it does that anyway. > > The "batching" logic there should allow a process to submit a number of requests even above the nr_requests limit to prevent this interleave and context switching. Are you using tagged command queueing? What depth? ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 8:51 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 8:51 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > >>For some reason, when using LVM, write requests get queued out >>of order to the 3ware controller, which results in quite a bit >>of seeking and thus performance loss. >> >[..] > >>Okay I repeated some earlier tests, and I added some debug code in >>several places. >> >>I added logging to tw_scsi_queue() in the 3ware driver to log the >>start sector and length of each request. It logs something like: >>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 >> >>With a perl script, I can check if the requests are sent to the >>host in order. That outputs something like this: >> >>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 >> >>See, 31 requests in order, then one request "backwards", then 31 in order, etc. >> > >I found out what causes this. It's get_request_wait(). > >When the request queue is full, and a new request needs to be created, >__make_request() blocks in get_request_wait(). > >Another process wakes up first (pdflush / process submitting I/O itself / >xfsdatad / etc) and sends the next bio's to __make_request(). >In the mean time some free requests have become available, and the bios >are merged into a new request. Those requests are submitted to the device. > >Then, get_request_wait() returns but the bio is not mergeable anymore - >and that results in a backwards seek, severely limiting the I/O rate. > >Wouldn't it be better to allow the request allocation and queue the >request, and /then/ put the process to sleep ? The queue will grow larger >than nr_requests, but it does that anyway. > > The "batching" logic there should allow a process to submit a number of requests even above the nr_requests limit to prevent this interleave and context switching. Are you using tagged command queueing? What depth? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 8:51 ` [linux-lvm] " Nick Piggin @ 2004-02-19 9:00 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 1:52 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel, Joe Thornber On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > Miquel van Smoorenburg wrote: > > >I found out what causes this. It's get_request_wait(). > > > >When the request queue is full, and a new request needs to be created, > >__make_request() blocks in get_request_wait(). > > > >Another process wakes up first (pdflush / process submitting I/O itself / > >xfsdatad / etc) and sends the next bio's to __make_request(). > >In the mean time some free requests have become available, and the bios > >are merged into a new request. Those requests are submitted to the device. > > > >Then, get_request_wait() returns but the bio is not mergeable anymore - > >and that results in a backwards seek, severely limiting the I/O rate. > > > > The "batching" logic there should allow a process to submit > a number of requests even above the nr_requests limit to > prevent this interleave and context switching. > > Are you using tagged command queueing? What depth? No, I'm not using tagged command queueing. The 3ware controller is not a real scsi controller, the driver just emulates one. It's a raid5 controller that drives SATA disks. It has an internal request queue ("can_queu") of 254 outstanding commands. Because that is way bigger than nr_requests this happens - if I set nr_requests to 512, the problem goes away. But that shouldn't happen ;) I'm preparing a proof-of-concept patch now, if it works and I don't wedge the remote machine I'm testing this on I'll post it in a few minutes. Mike. > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 9:00 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 9:00 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel, Joe Thornber On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > Miquel van Smoorenburg wrote: > > >I found out what causes this. It's get_request_wait(). > > > >When the request queue is full, and a new request needs to be created, > >__make_request() blocks in get_request_wait(). > > > >Another process wakes up first (pdflush / process submitting I/O itself / > >xfsdatad / etc) and sends the next bio's to __make_request(). > >In the mean time some free requests have become available, and the bios > >are merged into a new request. Those requests are submitted to the device. > > > >Then, get_request_wait() returns but the bio is not mergeable anymore - > >and that results in a backwards seek, severely limiting the I/O rate. > > > > The "batching" logic there should allow a process to submit > a number of requests even above the nr_requests limit to > prevent this interleave and context switching. > > Are you using tagged command queueing? What depth? No, I'm not using tagged command queueing. The 3ware controller is not a real scsi controller, the driver just emulates one. It's a raid5 controller that drives SATA disks. It has an internal request queue ("can_queu") of 254 outstanding commands. Because that is way bigger than nr_requests this happens - if I set nr_requests to 512, the problem goes away. But that shouldn't happen ;) I'm preparing a proof-of-concept patch now, if it works and I don't wedge the remote machine I'm testing this on I'll post it in a few minutes. Mike. > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 9:00 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 9:00 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 2:01 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > >>Miquel van Smoorenburg wrote: >> >> >>>I found out what causes this. It's get_request_wait(). >>> >>>When the request queue is full, and a new request needs to be created, >>>__make_request() blocks in get_request_wait(). >>> >>>Another process wakes up first (pdflush / process submitting I/O itself / >>>xfsdatad / etc) and sends the next bio's to __make_request(). >>>In the mean time some free requests have become available, and the bios >>>are merged into a new request. Those requests are submitted to the device. >>> >>>Then, get_request_wait() returns but the bio is not mergeable anymore - >>>and that results in a backwards seek, severely limiting the I/O rate. >>> >>> >>The "batching" logic there should allow a process to submit >>a number of requests even above the nr_requests limit to >>prevent this interleave and context switching. >> >>Are you using tagged command queueing? What depth? >> > >No, I'm not using tagged command queueing. The 3ware controller is not a >real scsi controller, the driver just emulates one. It's a raid5 controller >that drives SATA disks. It has an internal request queue ("can_queu") >of 254 outstanding commands. > This is what I mean by tagged command queueing. > Because that is way bigger than nr_requests >this happens - if I set nr_requests to 512, the problem goes away. But >that shouldn't happen ;) > > What shouldn't happen? >I'm preparing a proof-of-concept patch now, if it works and I don't wedge >the remote machine I'm testing this on I'll post it in a few minutes. > > I'm not very happy with forcing a process to sleep _after_ it has submitted a request... but I'd be interested to see exactly what your patch does. By far the best option is to use appropriately sized queues. The below patch is a start, but it unfortunately doesn't help drivers which use private queueing implementations. http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 9:00 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 9:00 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > >>Miquel van Smoorenburg wrote: >> >> >>>I found out what causes this. It's get_request_wait(). >>> >>>When the request queue is full, and a new request needs to be created, >>>__make_request() blocks in get_request_wait(). >>> >>>Another process wakes up first (pdflush / process submitting I/O itself / >>>xfsdatad / etc) and sends the next bio's to __make_request(). >>>In the mean time some free requests have become available, and the bios >>>are merged into a new request. Those requests are submitted to the device. >>> >>>Then, get_request_wait() returns but the bio is not mergeable anymore - >>>and that results in a backwards seek, severely limiting the I/O rate. >>> >>> >>The "batching" logic there should allow a process to submit >>a number of requests even above the nr_requests limit to >>prevent this interleave and context switching. >> >>Are you using tagged command queueing? What depth? >> > >No, I'm not using tagged command queueing. The 3ware controller is not a >real scsi controller, the driver just emulates one. It's a raid5 controller >that drives SATA disks. It has an internal request queue ("can_queu") >of 254 outstanding commands. > This is what I mean by tagged command queueing. > Because that is way bigger than nr_requests >this happens - if I set nr_requests to 512, the problem goes away. But >that shouldn't happen ;) > > What shouldn't happen? >I'm preparing a proof-of-concept patch now, if it works and I don't wedge >the remote machine I'm testing this on I'll post it in a few minutes. > > I'm not very happy with forcing a process to sleep _after_ it has submitted a request... but I'd be interested to see exactly what your patch does. By far the best option is to use appropriately sized queues. The below patch is a start, but it unfortunately doesn't help drivers which use private queueing implementations. http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 8:51 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 8:50 ` Andrew Morton -1 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 1:26 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: axboe, miquels, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > For some reason, when using LVM, write requests get queued out > > of order to the 3ware controller, which results in quite a bit > > of seeking and thus performance loss. > [..] > > Okay I repeated some earlier tests, and I added some debug code in > > several places. > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > start sector and length of each request. It logs something like: > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > With a perl script, I can check if the requests are sent to the > > host in order. That outputs something like this: > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are actually referring to a single request which has 31 128k BIOs in it, yes? > > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > I found out what causes this. It's get_request_wait(). > > When the request queue is full, and a new request needs to be created, > __make_request() blocks in get_request_wait(). > > Another process wakes up first (pdflush / process submitting I/O itself / > xfsdatad / etc) and sends the next bio's to __make_request(). > In the mean time some free requests have become available, and the bios > are merged into a new request. Those requests are submitted to the device. > > Then, get_request_wait() returns but the bio is not mergeable anymore - > and that results in a backwards seek, severely limiting the I/O rate. > > Wouldn't it be better to allow the request allocation and queue the > request, and /then/ put the process to sleep ? The queue will grow larger > than nr_requests, but it does that anyway. That would help, but is a bit kludgy. What _should_ have happened was: a) The queue gets plugged b) The maximal-sized 31-bio request is queued c) The 4k request gets inserted *before* the 3968k request d) The queue gets unplugged, and both requests stream smoothly. (And this assumes that the queue was initially empty. ALmost certainly that was not the case). So the question is, why did the little, later 4k request not get itself inserted in front of the earlier, large 3968k request? Are you using md as well? If so, try removing the blk_run_queues() statements from md_thread() and md_do_sync(). And generally hunt around and try nuking blk_run_queues() statements from the device driver/md/dm layer - they may well be wrong. Is this problem specific to md? Can it be reproduced on a boring-old-disk? Does elevator=deadline change anything? ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 8:50 ` Andrew Morton 0 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 8:50 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > For some reason, when using LVM, write requests get queued out > > of order to the 3ware controller, which results in quite a bit > > of seeking and thus performance loss. > [..] > > Okay I repeated some earlier tests, and I added some debug code in > > several places. > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > start sector and length of each request. It logs something like: > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > With a perl script, I can check if the requests are sent to the > > host in order. That outputs something like this: > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are actually referring to a single request which has 31 128k BIOs in it, yes? > > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > I found out what causes this. It's get_request_wait(). > > When the request queue is full, and a new request needs to be created, > __make_request() blocks in get_request_wait(). > > Another process wakes up first (pdflush / process submitting I/O itself / > xfsdatad / etc) and sends the next bio's to __make_request(). > In the mean time some free requests have become available, and the bios > are merged into a new request. Those requests are submitted to the device. > > Then, get_request_wait() returns but the bio is not mergeable anymore - > and that results in a backwards seek, severely limiting the I/O rate. > > Wouldn't it be better to allow the request allocation and queue the > request, and /then/ put the process to sleep ? The queue will grow larger > than nr_requests, but it does that anyway. That would help, but is a bit kludgy. What _should_ have happened was: a) The queue gets plugged b) The maximal-sized 31-bio request is queued c) The 4k request gets inserted *before* the 3968k request d) The queue gets unplugged, and both requests stream smoothly. (And this assumes that the queue was initially empty. ALmost certainly that was not the case). So the question is, why did the little, later 4k request not get itself inserted in front of the earlier, large 3968k request? Are you using md as well? If so, try removing the blk_run_queues() statements from md_thread() and md_do_sync(). And generally hunt around and try nuking blk_run_queues() statements from the device driver/md/dm layer - they may well be wrong. Is this problem specific to md? Can it be reproduced on a boring-old-disk? Does elevator=deadline change anything? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 8:50 ` [linux-lvm] " Andrew Morton @ 2004-02-19 9:08 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 2:11 UTC (permalink / raw) To: Andrew Morton Cc: Miquel van Smoorenburg, axboe, miquels, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > > For some reason, when using LVM, write requests get queued out > > > of order to the 3ware controller, which results in quite a bit > > > of seeking and thus performance loss. > > [..] > > > Okay I repeated some earlier tests, and I added some debug code in > > > several places. > > > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > > start sector and length of each request. It logs something like: > > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > > > With a perl script, I can check if the requests are sent to the > > > host in order. That outputs something like this: > > > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are > actually referring to a single request which has 31 128k BIOs in it, yes? No, I'm actually referring to a struct request. I'm logging this in the SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have in fact logged all the bio's submitted to __make_request, and the output of the elevator from elv_next_request(). The bio's are submitted sequentially, the resulting requests aren't. But this is because nr_requests is 128, while the 3ware device has a queue of 254 entries (no tagging though). Upping nr_requests to 512 makes this go away .. That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome with a lot of debugging (unless I set nr_requests lower again), which points to a timing issue. > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > > > I found out what causes this. It's get_request_wait(). > > > > When the request queue is full, and a new request needs to be created, > > __make_request() blocks in get_request_wait(). > > > > Another process wakes up first (pdflush / process submitting I/O itself / > > xfsdatad / etc) and sends the next bio's to __make_request(). > > In the mean time some free requests have become available, and the bios > > are merged into a new request. Those requests are submitted to the device. > > > > Then, get_request_wait() returns but the bio is not mergeable anymore - > > and that results in a backwards seek, severely limiting the I/O rate. > > > > Wouldn't it be better to allow the request allocation and queue the > > request, and /then/ put the process to sleep ? The queue will grow larger > > than nr_requests, but it does that anyway. > > That would help, but is a bit kludgy. Patch that does exactly this (it's probably butt-ugly, but it proves the point) below. > What _should_ have happened was: > > a) The queue gets plugged > > b) The maximal-sized 31-bio request is queued > > c) The 4k request gets inserted *before* the 3968k request > > d) The queue gets unplugged, and both requests stream smoothly. > > (And this assumes that the queue was initially empty. ALmost certainly > that was not the case). The thing is, the bio's are submitted perfectly sequentially. It's just that every so often a request (with just its initial bio) gets stuck for a while. Because the bio's after it are merged and sent to the device, it's not possible to merge that stuck request later on when it gets unstuck, because the other bio's have already left the building so to speak. I'm not using MD at all, and I tried all elevators. All the same results, so I stayed with elevator=noop because it's so much easier to understand ;) This proof-of-concept patch fixes things: ll_rw_blk.c.patch (Hmm, 03:11, time to go to bed. I'm happy I finally found this though!) --- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG 2004-02-04 04:43:10.000000000 +0100 +++ linux-2.6.3/drivers/block/ll_rw_blk.c 2004-02-19 02:51:06.000000000 +0100 @@ -1545,7 +1545,8 @@ /* * Get a free request, queue_lock must not be held */ -static struct request *get_request(request_queue_t *q, int rw, int gfp_mask) +static struct request *get_request(request_queue_t *q, int rw, + int gfp_mask, int *qfull) { struct request *rq = NULL; struct request_list *rl = &q->rq; @@ -1569,10 +1570,15 @@ && !ioc_batching(ioc) && !elv_may_queue(q, rw)) { /* * The queue is full and the allocating process is not a - * "batcher", and not exempted by the IO scheduler + * "batcher", and not exempted by the IO scheduler. + * If "qfull" is a valid pointer, set it to 1 to return + * this info to the caller so that it can sleep. */ - spin_unlock_irq(q->queue_lock); - goto out; + if (qfull == NULL) { + spin_unlock_irq(q->queue_lock); + goto out; + } else + *qfull = 1; } rl->count[rw]++; @@ -1627,7 +1633,7 @@ * No available requests for this queue, unplug the device and wait for some * requests to become available. */ -static struct request *get_request_wait(request_queue_t *q, int rw) +static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq) { DEFINE_WAIT(wait); struct request *rq; @@ -1639,7 +1645,7 @@ prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE); - rq = get_request(q, rw, GFP_NOIO); + rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL; if (!rq) { struct io_context *ioc; @@ -1657,7 +1663,7 @@ put_io_context(ioc); } finish_wait(&rl->wait[rw], &wait); - } while (!rq); + } while (!rq && wantreq); return rq; } @@ -1669,9 +1675,9 @@ BUG_ON(rw != READ && rw != WRITE); if (gfp_mask & __GFP_WAIT) - rq = get_request_wait(q, rw); + rq = get_request_wait(q, rw, 1); else - rq = get_request(q, rw, gfp_mask); + rq = get_request(q, rw, gfp_mask, NULL); return rq; } @@ -2002,6 +2008,7 @@ struct request *req, *freereq = NULL; int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra; sector_t sector; + int qfull; sector = bio->bi_sector; nr_sectors = bio_sectors(bio); @@ -2023,6 +2030,7 @@ ra = bio->bi_rw & (1 << BIO_RW_AHEAD); again: + qfull = 0; spin_lock_irq(q->queue_lock); if (elv_queue_empty(q)) { @@ -2096,14 +2104,15 @@ freereq = NULL; } else { spin_unlock_irq(q->queue_lock); - if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) { + freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull); + if (freereq == NULL) { /* * READA bit set */ if (ra) goto end_io; - freereq = get_request_wait(q, rw); + freereq = get_request_wait(q, rw, 1); } goto again; } @@ -2141,6 +2150,13 @@ req->start_time = jiffies; add_request(q, req); + + if (qfull) { + spin_unlock_irq(q->queue_lock); + get_request_wait(q, rw, 0); + spin_lock_irq(q->queue_lock); + } + out: if (freereq) __blk_put_request(q, freereq); ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 9:08 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 9:08 UTC (permalink / raw) To: Andrew Morton; +Cc: Miquel van Smoorenburg, axboe On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > > For some reason, when using LVM, write requests get queued out > > > of order to the 3ware controller, which results in quite a bit > > > of seeking and thus performance loss. > > [..] > > > Okay I repeated some earlier tests, and I added some debug code in > > > several places. > > > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > > start sector and length of each request. It logs something like: > > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > > > With a perl script, I can check if the requests are sent to the > > > host in order. That outputs something like this: > > > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are > actually referring to a single request which has 31 128k BIOs in it, yes? No, I'm actually referring to a struct request. I'm logging this in the SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have in fact logged all the bio's submitted to __make_request, and the output of the elevator from elv_next_request(). The bio's are submitted sequentially, the resulting requests aren't. But this is because nr_requests is 128, while the 3ware device has a queue of 254 entries (no tagging though). Upping nr_requests to 512 makes this go away .. That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome with a lot of debugging (unless I set nr_requests lower again), which points to a timing issue. > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > > > I found out what causes this. It's get_request_wait(). > > > > When the request queue is full, and a new request needs to be created, > > __make_request() blocks in get_request_wait(). > > > > Another process wakes up first (pdflush / process submitting I/O itself / > > xfsdatad / etc) and sends the next bio's to __make_request(). > > In the mean time some free requests have become available, and the bios > > are merged into a new request. Those requests are submitted to the device. > > > > Then, get_request_wait() returns but the bio is not mergeable anymore - > > and that results in a backwards seek, severely limiting the I/O rate. > > > > Wouldn't it be better to allow the request allocation and queue the > > request, and /then/ put the process to sleep ? The queue will grow larger > > than nr_requests, but it does that anyway. > > That would help, but is a bit kludgy. Patch that does exactly this (it's probably butt-ugly, but it proves the point) below. > What _should_ have happened was: > > a) The queue gets plugged > > b) The maximal-sized 31-bio request is queued > > c) The 4k request gets inserted *before* the 3968k request > > d) The queue gets unplugged, and both requests stream smoothly. > > (And this assumes that the queue was initially empty. ALmost certainly > that was not the case). The thing is, the bio's are submitted perfectly sequentially. It's just that every so often a request (with just its initial bio) gets stuck for a while. Because the bio's after it are merged and sent to the device, it's not possible to merge that stuck request later on when it gets unstuck, because the other bio's have already left the building so to speak. I'm not using MD at all, and I tried all elevators. All the same results, so I stayed with elevator=noop because it's so much easier to understand ;) This proof-of-concept patch fixes things: ll_rw_blk.c.patch (Hmm, 03:11, time to go to bed. I'm happy I finally found this though!) --- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG 2004-02-04 04:43:10.000000000 +0100 +++ linux-2.6.3/drivers/block/ll_rw_blk.c 2004-02-19 02:51:06.000000000 +0100 @@ -1545,7 +1545,8 @@ /* * Get a free request, queue_lock must not be held */ -static struct request *get_request(request_queue_t *q, int rw, int gfp_mask) +static struct request *get_request(request_queue_t *q, int rw, + int gfp_mask, int *qfull) { struct request *rq = NULL; struct request_list *rl = &q->rq; @@ -1569,10 +1570,15 @@ && !ioc_batching(ioc) && !elv_may_queue(q, rw)) { /* * The queue is full and the allocating process is not a - * "batcher", and not exempted by the IO scheduler + * "batcher", and not exempted by the IO scheduler. + * If "qfull" is a valid pointer, set it to 1 to return + * this info to the caller so that it can sleep. */ - spin_unlock_irq(q->queue_lock); - goto out; + if (qfull == NULL) { + spin_unlock_irq(q->queue_lock); + goto out; + } else + *qfull = 1; } rl->count[rw]++; @@ -1627,7 +1633,7 @@ * No available requests for this queue, unplug the device and wait for some * requests to become available. */ -static struct request *get_request_wait(request_queue_t *q, int rw) +static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq) { DEFINE_WAIT(wait); struct request *rq; @@ -1639,7 +1645,7 @@ prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE); - rq = get_request(q, rw, GFP_NOIO); + rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL; if (!rq) { struct io_context *ioc; @@ -1657,7 +1663,7 @@ put_io_context(ioc); } finish_wait(&rl->wait[rw], &wait); - } while (!rq); + } while (!rq && wantreq); return rq; } @@ -1669,9 +1675,9 @@ BUG_ON(rw != READ && rw != WRITE); if (gfp_mask & __GFP_WAIT) - rq = get_request_wait(q, rw); + rq = get_request_wait(q, rw, 1); else - rq = get_request(q, rw, gfp_mask); + rq = get_request(q, rw, gfp_mask, NULL); return rq; } @@ -2002,6 +2008,7 @@ struct request *req, *freereq = NULL; int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra; sector_t sector; + int qfull; sector = bio->bi_sector; nr_sectors = bio_sectors(bio); @@ -2023,6 +2030,7 @@ ra = bio->bi_rw & (1 << BIO_RW_AHEAD); again: + qfull = 0; spin_lock_irq(q->queue_lock); if (elv_queue_empty(q)) { @@ -2096,14 +2104,15 @@ freereq = NULL; } else { spin_unlock_irq(q->queue_lock); - if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) { + freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull); + if (freereq == NULL) { /* * READA bit set */ if (ra) goto end_io; - freereq = get_request_wait(q, rw); + freereq = get_request_wait(q, rw, 1); } goto again; } @@ -2141,6 +2150,13 @@ req->start_time = jiffies; add_request(q, req); + + if (qfull) { + spin_unlock_irq(q->queue_lock); + get_request_wait(q, rw, 0); + spin_lock_irq(q->queue_lock); + } + out: if (freereq) __blk_put_request(q, freereq); ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 9:08 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 9:08 ` Andrew Morton -1 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 2:26 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > The thing is, the bio's are submitted perfectly sequentially. It's just that > every so often a request (with just its initial bio) gets stuck for a while. > Because the bio's after it are merged and sent to the device, it's not > possible to merge that stuck request later on when it gets unstuck, because > the other bio's have already left the building so to speak. Oh. So the raid controller's queue depth is larger than the kernel's. So everything gets immediately shovelled into the device and the kernel is left with nothing to merge the little request against. Shouldn't the controller itself be performing the insertion? ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 9:08 ` Andrew Morton 0 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 9:08 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > The thing is, the bio's are submitted perfectly sequentially. It's just that > every so often a request (with just its initial bio) gets stuck for a while. > Because the bio's after it are merged and sent to the device, it's not > possible to merge that stuck request later on when it gets unstuck, because > the other bio's have already left the building so to speak. Oh. So the raid controller's queue depth is larger than the kernel's. So everything gets immediately shovelled into the device and the kernel is left with nothing to merge the little request against. Shouldn't the controller itself be performing the insertion? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 9:08 ` [linux-lvm] " Andrew Morton @ 2004-02-19 10:23 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 10:15 UTC (permalink / raw) To: Andrew Morton Cc: Miquel van Smoorenburg, miquels, axboe, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > every so often a request (with just its initial bio) gets stuck for a while. > > Because the bio's after it are merged and sent to the device, it's not > > possible to merge that stuck request later on when it gets unstuck, because > > the other bio's have already left the building so to speak. > > Oh. So the raid controller's queue depth is larger than the kernel's. So > everything gets immediately shovelled into the device and the kernel is > left with nothing to merge the little request against. Well, the request queue of the kernel is max'ed out too, otherwise get_request_wait() wouldn't be called. It's just an unfortunate timing issue. > Shouldn't the controller itself be performing the insertion? Well, you would indeed expect the 3ware hardware to be smarter than that, but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags at all. It just has a large queue on the host, in hardware. Perhaps this info should be exported into the request queue of the device, so that ll_rw_blk knows about this and can do something similar to the hack I posted ? Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c. blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's not being used at all yet ? Or am I being dense ? Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 10:23 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 10:23 UTC (permalink / raw) To: Andrew Morton; +Cc: Miquel van Smoorenburg On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > every so often a request (with just its initial bio) gets stuck for a while. > > Because the bio's after it are merged and sent to the device, it's not > > possible to merge that stuck request later on when it gets unstuck, because > > the other bio's have already left the building so to speak. > > Oh. So the raid controller's queue depth is larger than the kernel's. So > everything gets immediately shovelled into the device and the kernel is > left with nothing to merge the little request against. Well, the request queue of the kernel is max'ed out too, otherwise get_request_wait() wouldn't be called. It's just an unfortunate timing issue. > Shouldn't the controller itself be performing the insertion? Well, you would indeed expect the 3ware hardware to be smarter than that, but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags at all. It just has a large queue on the host, in hardware. Perhaps this info should be exported into the request queue of the device, so that ll_rw_blk knows about this and can do something similar to the hack I posted ? Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c. blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's not being used at all yet ? Or am I being dense ? Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 10:23 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 10:26 ` Jens Axboe -1 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-19 10:19 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > > every so often a request (with just its initial bio) gets stuck for a while. > > > Because the bio's after it are merged and sent to the device, it's not > > > possible to merge that stuck request later on when it gets unstuck, because > > > the other bio's have already left the building so to speak. > > > > Oh. So the raid controller's queue depth is larger than the kernel's. So > > everything gets immediately shovelled into the device and the kernel is > > left with nothing to merge the little request against. > > Well, the request queue of the kernel is max'ed out too, otherwise > get_request_wait() wouldn't be called. It's just an unfortunate timing > issue. Indeed > > Shouldn't the controller itself be performing the insertion? > > Well, you would indeed expect the 3ware hardware to be smarter than > that, but in its defence, the driver doesn't set sdev->simple_tags or > sdev->ordered_tags at all. It just has a large queue on the host, in > hardware. A too large queue. IMHO the simple and correct solution to your problem is to diminish the host queue (sane solution), or bump the block layer queue size (dumb solution). > Perhaps this info should be exported into the request queue of the device, > so that ll_rw_blk knows about this and can do something similar to the > hack I posted ? > > Note that AFAICS nothing in drivers/scsi uses the tagging stuff in > ll_rw_blk.c. blk_queue_init_tags() is only called by > scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c > driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing > in drivers/block calls it either. It's not being used at all yet ? Or > am I being dense ? No you are correct, I already outlined that to you explicitly in the very first mail in this thread. Hopefully this will change with 2.7 so we have some block layer control over tagging in general. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 10:26 ` Jens Axboe 0 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-19 10:26 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > > every so often a request (with just its initial bio) gets stuck for a while. > > > Because the bio's after it are merged and sent to the device, it's not > > > possible to merge that stuck request later on when it gets unstuck, because > > > the other bio's have already left the building so to speak. > > > > Oh. So the raid controller's queue depth is larger than the kernel's. So > > everything gets immediately shovelled into the device and the kernel is > > left with nothing to merge the little request against. > > Well, the request queue of the kernel is max'ed out too, otherwise > get_request_wait() wouldn't be called. It's just an unfortunate timing > issue. Indeed > > Shouldn't the controller itself be performing the insertion? > > Well, you would indeed expect the 3ware hardware to be smarter than > that, but in its defence, the driver doesn't set sdev->simple_tags or > sdev->ordered_tags at all. It just has a large queue on the host, in > hardware. A too large queue. IMHO the simple and correct solution to your problem is to diminish the host queue (sane solution), or bump the block layer queue size (dumb solution). > Perhaps this info should be exported into the request queue of the device, > so that ll_rw_blk knows about this and can do something similar to the > hack I posted ? > > Note that AFAICS nothing in drivers/scsi uses the tagging stuff in > ll_rw_blk.c. blk_queue_init_tags() is only called by > scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c > driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing > in drivers/block calls it either. It's not being used at all yet ? Or > am I being dense ? No you are correct, I already outlined that to you explicitly in the very first mail in this thread. Hopefully this will change with 2.7 so we have some block layer control over tagging in general. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-19 10:26 ` [linux-lvm] " Jens Axboe @ 2004-02-19 20:59 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 15:58 UTC (permalink / raw) To: Jens Axboe Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > > > > Shouldn't the controller itself be performing the insertion? > > > > Well, you would indeed expect the 3ware hardware to be smarter than > > that, but in its defence, the driver doesn't set sdev->simple_tags or > > sdev->ordered_tags at all. It just has a large queue on the host, in > > hardware. > > A too large queue. IMHO the simple and correct solution to your problem > is to diminish the host queue (sane solution), or bump the block layer > queue size (dumb solution). Well, I did that. Lowering the queue size of the 3ware controller to 64 does help a bit, but performance is still not optimal - leaving it at 254 and increasing the nr_requests of the queue to 512 helps the most. But the patch I posted does just as well, without any tuning. I changed it a little though - it only has the "new" behaviour (instead of blocking on allocating a request, allocate it, queue it, _then_ block) for WRITEs. That results in the best performance I've seen, by far. Now the style of my patch might be ugly, but what is conceptually wrong with allocating the request and queueing it, then block if the queue is full, versus blocking on allocating the request and keeping a bio "stuck" for quite some time, resulting in out-of-order requests to the hardware ? Note that this is not an issue of '2 processes writing to 1 file', really. It's one process and pdflush writing the same dirty pages of the same file. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 20:59 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 20:59 UTC (permalink / raw) To: Jens Axboe Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > > > > Shouldn't the controller itself be performing the insertion? > > > > Well, you would indeed expect the 3ware hardware to be smarter than > > that, but in its defence, the driver doesn't set sdev->simple_tags or > > sdev->ordered_tags at all. It just has a large queue on the host, in > > hardware. > > A too large queue. IMHO the simple and correct solution to your problem > is to diminish the host queue (sane solution), or bump the block layer > queue size (dumb solution). Well, I did that. Lowering the queue size of the 3ware controller to 64 does help a bit, but performance is still not optimal - leaving it at 254 and increasing the nr_requests of the queue to 512 helps the most. But the patch I posted does just as well, without any tuning. I changed it a little though - it only has the "new" behaviour (instead of blocking on allocating a request, allocate it, queue it, _then_ block) for WRITEs. That results in the best performance I've seen, by far. Now the style of my patch might be ugly, but what is conceptually wrong with allocating the request and queueing it, then block if the queue is full, versus blocking on allocating the request and keeping a bio "stuck" for quite some time, resulting in out-of-order requests to the hardware ? Note that this is not an issue of '2 processes writing to 1 file', really. It's one process and pdflush writing the same dirty pages of the same file. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-19 20:59 ` Miquel van Smoorenburg @ 2004-02-19 22:52 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 17:52 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: >> >> >>>>Shouldn't the controller itself be performing the insertion? >>>> >>>Well, you would indeed expect the 3ware hardware to be smarter than >>>that, but in its defence, the driver doesn't set sdev->simple_tags or >>>sdev->ordered_tags at all. It just has a large queue on the host, in >>>hardware. >>> >>A too large queue. IMHO the simple and correct solution to your problem >>is to diminish the host queue (sane solution), or bump the block layer >>queue size (dumb solution). >> > >Well, I did that. Lowering the queue size of the 3ware controller to 64 >does help a bit, but performance is still not optimal - leaving it at 254 >and increasing the nr_requests of the queue to 512 helps the most. > >But the patch I posted does just as well, without any tuning. I changed >it a little though - it only has the "new" behaviour (instead of blocking >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. >That results in the best performance I've seen, by far. > > That's because you are half introducing per-process limits. >Now the style of my patch might be ugly, but what is conceptually wrong >with allocating the request and queueing it, then block if the queue is >full, versus blocking on allocating the request and keeping a bio >"stuck" for quite some time, resulting in out-of-order requests to the >hardware ? > > Conceptually? The concept that you have everything you need to continue and yet you block anyway is wrong. >Note that this is not an issue of '2 processes writing to 1 file', really. >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 22:52 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 22:52 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: >> >> >>>>Shouldn't the controller itself be performing the insertion? >>>> >>>Well, you would indeed expect the 3ware hardware to be smarter than >>>that, but in its defence, the driver doesn't set sdev->simple_tags or >>>sdev->ordered_tags at all. It just has a large queue on the host, in >>>hardware. >>> >>A too large queue. IMHO the simple and correct solution to your problem >>is to diminish the host queue (sane solution), or bump the block layer >>queue size (dumb solution). >> > >Well, I did that. Lowering the queue size of the 3ware controller to 64 >does help a bit, but performance is still not optimal - leaving it at 254 >and increasing the nr_requests of the queue to 512 helps the most. > >But the patch I posted does just as well, without any tuning. I changed >it a little though - it only has the "new" behaviour (instead of blocking >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. >That results in the best performance I've seen, by far. > > That's because you are half introducing per-process limits. >Now the style of my patch might be ugly, but what is conceptually wrong >with allocating the request and queueing it, then block if the queue is >full, versus blocking on allocating the request and keeping a bio >"stuck" for quite some time, resulting in out-of-order requests to the >hardware ? > > Conceptually? The concept that you have everything you need to continue and yet you block anyway is wrong. >Note that this is not an issue of '2 processes writing to 1 file', really. >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-19 22:52 ` Nick Piggin @ 2004-02-19 23:53 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 18:52 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > > > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > >> > >> > >>>>Shouldn't the controller itself be performing the insertion? > >>>> > >>>Well, you would indeed expect the 3ware hardware to be smarter than > >>>that, but in its defence, the driver doesn't set sdev->simple_tags or > >>>sdev->ordered_tags at all. It just has a large queue on the host, in > >>>hardware. > >>> > >>A too large queue. IMHO the simple and correct solution to your problem > >>is to diminish the host queue (sane solution), or bump the block layer > >>queue size (dumb solution). > >> > > > >Well, I did that. Lowering the queue size of the 3ware controller to 64 > >does help a bit, but performance is still not optimal - leaving it at 254 > >and increasing the nr_requests of the queue to 512 helps the most. > > > >But the patch I posted does just as well, without any tuning. I changed > >it a little though - it only has the "new" behaviour (instead of blocking > >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. > >That results in the best performance I've seen, by far. > > > > > > That's because you are half introducing per-process limits. > > >Now the style of my patch might be ugly, but what is conceptually wrong > >with allocating the request and queueing it, then block if the queue is > >full, versus blocking on allocating the request and keeping a bio > >"stuck" for quite some time, resulting in out-of-order requests to the > >hardware ? > > > > > > Conceptually? The concept that you have everything you need to > continue and yet you block anyway is wrong. For reading, I agree. For writing .. ah well, English is not my first language, let's not argue about language semantics. > >Note that this is not an issue of '2 processes writing to 1 file', really. > >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. I understand that when the two processes are unrelated, the patch as I sent it will do the wrong thing. But the thing is, you get this: - "dd" process writes requests - pdflush triggers to write dirty pages - too many pages are dirty so "dd" blocks as well to write synchronously - "dd" process triggers "queue full" but gets marked as "batching" so can continue (get_request) - pdflush tries to submit one bio and gets blocked (get_request_wait) - "dd" continues, but that one bio from pdflush remains stuck for a while That's stupid, that one bio from pdflush should really be allowed on the queue, since "dd" is adding requests from the same source to it anyway. Perhaps writes from pdflush should be handled differently to prevent this specific case ? Say, if pdflush adds request #128, don't mark it as batching, but let it block. The next process will be the one marked as batching and can continue. If pdflush tries to add a request > 128, allow it, but _then_ block it. Would something like that work ? Would it be a good idea to never mark a pdflush process as batching, or would that have a negative impact for some things ? Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 23:53 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 23:53 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > > > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > >> > >> > >>>>Shouldn't the controller itself be performing the insertion? > >>>> > >>>Well, you would indeed expect the 3ware hardware to be smarter than > >>>that, but in its defence, the driver doesn't set sdev->simple_tags or > >>>sdev->ordered_tags at all. It just has a large queue on the host, in > >>>hardware. > >>> > >>A too large queue. IMHO the simple and correct solution to your problem > >>is to diminish the host queue (sane solution), or bump the block layer > >>queue size (dumb solution). > >> > > > >Well, I did that. Lowering the queue size of the 3ware controller to 64 > >does help a bit, but performance is still not optimal - leaving it at 254 > >and increasing the nr_requests of the queue to 512 helps the most. > > > >But the patch I posted does just as well, without any tuning. I changed > >it a little though - it only has the "new" behaviour (instead of blocking > >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. > >That results in the best performance I've seen, by far. > > > > > > That's because you are half introducing per-process limits. > > >Now the style of my patch might be ugly, but what is conceptually wrong > >with allocating the request and queueing it, then block if the queue is > >full, versus blocking on allocating the request and keeping a bio > >"stuck" for quite some time, resulting in out-of-order requests to the > >hardware ? > > > > > > Conceptually? The concept that you have everything you need to > continue and yet you block anyway is wrong. For reading, I agree. For writing .. ah well, English is not my first language, let's not argue about language semantics. > >Note that this is not an issue of '2 processes writing to 1 file', really. > >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. I understand that when the two processes are unrelated, the patch as I sent it will do the wrong thing. But the thing is, you get this: - "dd" process writes requests - pdflush triggers to write dirty pages - too many pages are dirty so "dd" blocks as well to write synchronously - "dd" process triggers "queue full" but gets marked as "batching" so can continue (get_request) - pdflush tries to submit one bio and gets blocked (get_request_wait) - "dd" continues, but that one bio from pdflush remains stuck for a while That's stupid, that one bio from pdflush should really be allowed on the queue, since "dd" is adding requests from the same source to it anyway. Perhaps writes from pdflush should be handled differently to prevent this specific case ? Say, if pdflush adds request #128, don't mark it as batching, but let it block. The next process will be the one marked as batching and can continue. If pdflush tries to add a request > 128, allow it, but _then_ block it. Would something like that work ? Would it be a good idea to never mark a pdflush process as batching, or would that have a negative impact for some things ? Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests 2004-02-19 23:53 ` Miquel van Smoorenburg @ 2004-02-20 0:15 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 19:15 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > >> >>Miquel van Smoorenburg wrote: >> > ... >>>Note that this is not an issue of '2 processes writing to 1 file', really. >>>It's one process and pdflush writing the same dirty pages of the same file. >>> >>pdflush is a process though, that's all that matters. >> > >I understand that when the two processes are unrelated, the patch as I >sent it will do the wrong thing. > >But the thing is, you get this: > >- "dd" process writes requests >- pdflush triggers to write dirty pages >- too many pages are dirty so "dd" blocks as well to write synchronously >- "dd" process triggers "queue full" but gets marked as "batching" so > can continue (get_request) >- pdflush tries to submit one bio and gets blocked (get_request_wait) >- "dd" continues, but that one bio from pdflush remains stuck for a while > > The batching logic can probably all be ripped out with per process limits. It's too complex anyway really. >That's stupid, that one bio from pdflush should really be allowed on >the queue, since "dd" is adding requests from the same source to it >anyway. > > But the whole reason it is getting blocked in the first place is because your controller is sucking up all your requests. The whole problem is not a problem if you use properly sized queues. I'm a bit surprised that it wasn't working well with a controller queue depth of 64 and 128 nr_requests. I'll give you a per process request limit patch to try in a minute. >Perhaps writes from pdflush should be handled differently to prevent >this specific case ? > >Say, if pdflush adds request #128, don't mark it as batching, but >let it block. The next process will be the one marked as batching >and can continue. If pdflush tries to add a request > 128, allow it, >but _then_ block it. > >Would something like that work ? Would it be a good idea to never mark >a pdflush process as batching, or would that have a negative impact >for some things ? > > It's hard to know. Maybe a better solution would be to allow pdflush to be exempt from the limits entirely as long as it tries not to write to congested queues (which is what it does)... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests @ 2004-02-20 0:15 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-20 0:15 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > >> >>Miquel van Smoorenburg wrote: >> > ... >>>Note that this is not an issue of '2 processes writing to 1 file', really. >>>It's one process and pdflush writing the same dirty pages of the same file. >>> >>pdflush is a process though, that's all that matters. >> > >I understand that when the two processes are unrelated, the patch as I >sent it will do the wrong thing. > >But the thing is, you get this: > >- "dd" process writes requests >- pdflush triggers to write dirty pages >- too many pages are dirty so "dd" blocks as well to write synchronously >- "dd" process triggers "queue full" but gets marked as "batching" so > can continue (get_request) >- pdflush tries to submit one bio and gets blocked (get_request_wait) >- "dd" continues, but that one bio from pdflush remains stuck for a while > > The batching logic can probably all be ripped out with per process limits. It's too complex anyway really. >That's stupid, that one bio from pdflush should really be allowed on >the queue, since "dd" is adding requests from the same source to it >anyway. > > But the whole reason it is getting blocked in the first place is because your controller is sucking up all your requests. The whole problem is not a problem if you use properly sized queues. I'm a bit surprised that it wasn't working well with a controller queue depth of 64 and 128 nr_requests. I'll give you a per process request limit patch to try in a minute. >Perhaps writes from pdflush should be handled differently to prevent >this specific case ? > >Say, if pdflush adds request #128, don't mark it as batching, but >let it block. The next process will be the one marked as batching >and can continue. If pdflush tries to add a request > 128, allow it, >but _then_ block it. > >Would something like that work ? Would it be a good idea to never mark >a pdflush process as batching, or would that have a negative impact >for some things ? > > It's hard to know. Maybe a better solution would be to allow pdflush to be exempt from the limits entirely as long as it tries not to write to congested queues (which is what it does)... ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) 2004-02-19 23:53 ` Miquel van Smoorenburg @ 2004-02-20 1:12 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 20:16 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber [-- Attachment #1: Type: text/plain, Size: 152 bytes --] Hi Miquel, Can you see if this patch helps you? Even with this patch, it might still be a good idea to allow pdflush to disregard the limits... Nick [-- Attachment #2: blk-per-process-rqlim.patch --] [-- Type: text/plain, Size: 7564 bytes --] linux-2.6-npiggin/drivers/block/ll_rw_blk.c | 113 ++++++++-------------------- linux-2.6-npiggin/include/linux/blkdev.h | 8 - 2 files changed, 36 insertions(+), 85 deletions(-) diff -puN include/linux/blkdev.h~blk-per-process-rqlim include/linux/blkdev.h --- linux-2.6/include/linux/blkdev.h~blk-per-process-rqlim 2004-02-20 09:53:22.000000000 +1100 +++ linux-2.6-npiggin/include/linux/blkdev.h 2004-02-20 10:12:04.000000000 +1100 @@ -25,6 +25,7 @@ struct request_pm_state; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ +#define BLKDEV_MIN_PROC_RQ 64 /* * This is the per-process anticipatory I/O scheduler state. @@ -61,11 +62,7 @@ struct io_context { atomic_t refcount; pid_t pid; - /* - * For request batching - */ - unsigned long last_waited; /* Time last woken after wait for request */ - int nr_batch_requests; /* Number of requests left in the batch */ + unsigned int nr_requests; /* Outstanding requests */ struct as_io_context *aic; }; @@ -141,6 +138,7 @@ struct request { int ref_count; request_queue_t *q; struct request_list *rl; + struct io_context *ioc; struct completion *waiting; void *special; diff -puN drivers/block/ll_rw_blk.c~blk-per-process-rqlim drivers/block/ll_rw_blk.c --- linux-2.6/drivers/block/ll_rw_blk.c~blk-per-process-rqlim 2004-02-20 09:53:25.000000000 +1100 +++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c 2004-02-20 10:40:43.000000000 +1100 @@ -55,12 +55,6 @@ unsigned long blk_max_low_pfn, blk_max_p EXPORT_SYMBOL(blk_max_low_pfn); EXPORT_SYMBOL(blk_max_pfn); -/* Amount of time in which a process may batch requests */ -#define BLK_BATCH_TIME (HZ/50UL) - -/* Number of requests a "batching" process may submit */ -#define BLK_BATCH_REQ 32 - /* * Return the threshold (number of used requests) at which the queue is * considered to be congested. It include a little hysteresis to keep the @@ -1495,57 +1489,27 @@ static inline struct request *blk_alloc_ } /* - * ioc_batching returns true if the ioc is a valid batching request and - * should be given priority access to a request. - */ -static inline int ioc_batching(struct io_context *ioc) -{ - if (!ioc) - return 0; - - /* - * Make sure the process is able to allocate at least 1 request - * even if the batch times out, otherwise we could theoretically - * lose wakeups. - */ - return ioc->nr_batch_requests == BLK_BATCH_REQ || - (ioc->nr_batch_requests > 0 - && time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME)); -} - -/* - * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This - * will cause the process to be a "batcher" on all queues in the system. This - * is the behaviour we want though - once it gets a wakeup it should be given - * a nice run. - */ -void ioc_set_batching(struct io_context *ioc) -{ - if (!ioc || ioc_batching(ioc)) - return; - - ioc->nr_batch_requests = BLK_BATCH_REQ; - ioc->last_waited = jiffies; -} - -/* * A request has just been released. Account for it, update the full and * congestion status, wake up any waiters. Called under q->queue_lock. */ -static void freed_request(request_queue_t *q, int rw) +static void freed_request(request_queue_t *q, struct io_context *ioc, int rw) { struct request_list *rl = &q->rq; + ioc->nr_requests--; rl->count[rw]--; if (rl->count[rw] < queue_congestion_off_threshold(q)) clear_queue_congested(q, rw); if (rl->count[rw]+1 <= q->nr_requests) { - smp_mb(); - if (waitqueue_active(&rl->wait[rw])) - wake_up(&rl->wait[rw]); if (!waitqueue_active(&rl->wait[rw])) blk_clear_queue_full(q, rw); } + + if (rl->count[rw]-31 <= q->nr_requests + || ioc->nr_requests <= BLKDEV_MIN_PROC_RQ - 32) { + if (waitqueue_active(&rl->wait[rw])) + wake_up(&rl->wait[rw]); + } } #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist) @@ -1556,32 +1520,34 @@ static struct request *get_request(reque { struct request *rq = NULL; struct request_list *rl = &q->rq; - struct io_context *ioc = get_io_context(gfp_mask); + struct io_context *ioc; spin_lock_irq(q->queue_lock); if (rl->count[rw]+1 >= q->nr_requests) { /* * The queue will fill after this allocation, so set it as - * full, and mark this process as "batching". This process - * will be allowed to complete a batch of requests, others - * will be blocked. + * full. */ - if (!blk_queue_full(q, rw)) { - ioc_set_batching(ioc); + if (!blk_queue_full(q, rw)) blk_set_queue_full(q, rw); - } } + ioc = get_io_context(gfp_mask); + if (unlikely(ioc == NULL)) + goto out; + if (blk_queue_full(q, rw) - && !ioc_batching(ioc) && !elv_may_queue(q, rw)) { + && ioc->nr_requests >= BLKDEV_MIN_PROC_RQ + && !elv_may_queue(q, rw)) { /* - * The queue is full and the allocating process is not a - * "batcher", and not exempted by the IO scheduler + * The queue is full and the allocating process is over + * it's limit, and not exempted by the IO scheduler */ spin_unlock_irq(q->queue_lock); - goto out; + goto out_ioc; } + ioc->nr_requests++; rl->count[rw]++; if (rl->count[rw] >= queue_congestion_on_threshold(q)) set_queue_congested(q, rw); @@ -1597,14 +1563,11 @@ static struct request *get_request(reque * wait queue, but this is pretty rare. */ spin_lock_irq(q->queue_lock); - freed_request(q, rw); + freed_request(q, ioc, rw); spin_unlock_irq(q->queue_lock); - goto out; + goto out_ioc; } - if (ioc_batching(ioc)) - ioc->nr_batch_requests--; - INIT_LIST_HEAD(&rq->queuelist); /* @@ -1620,13 +1583,15 @@ static struct request *get_request(reque rq->ref_count = 1; rq->q = q; rq->rl = rl; + rq->ioc = ioc; rq->waiting = NULL; rq->special = NULL; rq->data = NULL; rq->sense = NULL; -out: +out_ioc: put_io_context(ioc); +out: return rq; } @@ -1644,25 +1609,12 @@ static struct request *get_request_wait( struct request_list *rl = &q->rq; prepare_to_wait_exclusive(&rl->wait[rw], &wait, - TASK_UNINTERRUPTIBLE); + TASK_UNINTERRUPTIBLE); rq = get_request(q, rw, GFP_NOIO); - - if (!rq) { - struct io_context *ioc; - + if (!rq) io_schedule(); - /* - * After sleeping, we become a "batching" process and - * will be able to allocate at least one request, and - * up to a big batch of them for a small period time. - * See ioc_batching, ioc_set_batching - */ - ioc = get_io_context(GFP_NOIO); - ioc_set_batching(ioc); - put_io_context(ioc); - } finish_wait(&rl->wait[rw], &wait); } while (!rq); @@ -1854,6 +1806,7 @@ void __blk_put_request(request_queue_t * * it didn't come out of our reserved rq pools */ if (rl) { + struct io_context *ioc = req->ioc; int rw = rq_data_dir(req); elv_completed_request(q, req); @@ -1861,7 +1814,8 @@ void __blk_put_request(request_queue_t * BUG_ON(!list_empty(&req->queuelist)); blk_free_request(q, req); - freed_request(q, rw); + freed_request(q, ioc, rw); + put_io_context(ioc); } } @@ -2721,7 +2675,7 @@ int __init blk_dev_init(void) */ void put_io_context(struct io_context *ioc) { - if (ioc == NULL) + if (unlikely(ioc == NULL)) return; BUG_ON(atomic_read(&ioc->refcount) == 0); @@ -2772,8 +2726,7 @@ struct io_context *get_io_context(int gf if (ret) { atomic_set(&ret->refcount, 1); ret->pid = tsk->pid; - ret->last_waited = jiffies; /* doesn't matter... */ - ret->nr_batch_requests = 0; /* because this is 0 */ + ret->nr_requests = 0; ret->aic = NULL; tsk->io_context = ret; } _ ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) @ 2004-02-20 1:12 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-20 1:12 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber [-- Attachment #1: Type: text/plain, Size: 152 bytes --] Hi Miquel, Can you see if this patch helps you? Even with this patch, it might still be a good idea to allow pdflush to disregard the limits... Nick [-- Attachment #2: blk-per-process-rqlim.patch --] [-- Type: text/plain, Size: 7564 bytes --] linux-2.6-npiggin/drivers/block/ll_rw_blk.c | 113 ++++++++-------------------- linux-2.6-npiggin/include/linux/blkdev.h | 8 - 2 files changed, 36 insertions(+), 85 deletions(-) diff -puN include/linux/blkdev.h~blk-per-process-rqlim include/linux/blkdev.h --- linux-2.6/include/linux/blkdev.h~blk-per-process-rqlim 2004-02-20 09:53:22.000000000 +1100 +++ linux-2.6-npiggin/include/linux/blkdev.h 2004-02-20 10:12:04.000000000 +1100 @@ -25,6 +25,7 @@ struct request_pm_state; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ +#define BLKDEV_MIN_PROC_RQ 64 /* * This is the per-process anticipatory I/O scheduler state. @@ -61,11 +62,7 @@ struct io_context { atomic_t refcount; pid_t pid; - /* - * For request batching - */ - unsigned long last_waited; /* Time last woken after wait for request */ - int nr_batch_requests; /* Number of requests left in the batch */ + unsigned int nr_requests; /* Outstanding requests */ struct as_io_context *aic; }; @@ -141,6 +138,7 @@ struct request { int ref_count; request_queue_t *q; struct request_list *rl; + struct io_context *ioc; struct completion *waiting; void *special; diff -puN drivers/block/ll_rw_blk.c~blk-per-process-rqlim drivers/block/ll_rw_blk.c --- linux-2.6/drivers/block/ll_rw_blk.c~blk-per-process-rqlim 2004-02-20 09:53:25.000000000 +1100 +++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c 2004-02-20 10:40:43.000000000 +1100 @@ -55,12 +55,6 @@ unsigned long blk_max_low_pfn, blk_max_p EXPORT_SYMBOL(blk_max_low_pfn); EXPORT_SYMBOL(blk_max_pfn); -/* Amount of time in which a process may batch requests */ -#define BLK_BATCH_TIME (HZ/50UL) - -/* Number of requests a "batching" process may submit */ -#define BLK_BATCH_REQ 32 - /* * Return the threshold (number of used requests) at which the queue is * considered to be congested. It include a little hysteresis to keep the @@ -1495,57 +1489,27 @@ static inline struct request *blk_alloc_ } /* - * ioc_batching returns true if the ioc is a valid batching request and - * should be given priority access to a request. - */ -static inline int ioc_batching(struct io_context *ioc) -{ - if (!ioc) - return 0; - - /* - * Make sure the process is able to allocate at least 1 request - * even if the batch times out, otherwise we could theoretically - * lose wakeups. - */ - return ioc->nr_batch_requests == BLK_BATCH_REQ || - (ioc->nr_batch_requests > 0 - && time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME)); -} - -/* - * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This - * will cause the process to be a "batcher" on all queues in the system. This - * is the behaviour we want though - once it gets a wakeup it should be given - * a nice run. - */ -void ioc_set_batching(struct io_context *ioc) -{ - if (!ioc || ioc_batching(ioc)) - return; - - ioc->nr_batch_requests = BLK_BATCH_REQ; - ioc->last_waited = jiffies; -} - -/* * A request has just been released. Account for it, update the full and * congestion status, wake up any waiters. Called under q->queue_lock. */ -static void freed_request(request_queue_t *q, int rw) +static void freed_request(request_queue_t *q, struct io_context *ioc, int rw) { struct request_list *rl = &q->rq; + ioc->nr_requests--; rl->count[rw]--; if (rl->count[rw] < queue_congestion_off_threshold(q)) clear_queue_congested(q, rw); if (rl->count[rw]+1 <= q->nr_requests) { - smp_mb(); - if (waitqueue_active(&rl->wait[rw])) - wake_up(&rl->wait[rw]); if (!waitqueue_active(&rl->wait[rw])) blk_clear_queue_full(q, rw); } + + if (rl->count[rw]-31 <= q->nr_requests + || ioc->nr_requests <= BLKDEV_MIN_PROC_RQ - 32) { + if (waitqueue_active(&rl->wait[rw])) + wake_up(&rl->wait[rw]); + } } #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist) @@ -1556,32 +1520,34 @@ static struct request *get_request(reque { struct request *rq = NULL; struct request_list *rl = &q->rq; - struct io_context *ioc = get_io_context(gfp_mask); + struct io_context *ioc; spin_lock_irq(q->queue_lock); if (rl->count[rw]+1 >= q->nr_requests) { /* * The queue will fill after this allocation, so set it as - * full, and mark this process as "batching". This process - * will be allowed to complete a batch of requests, others - * will be blocked. + * full. */ - if (!blk_queue_full(q, rw)) { - ioc_set_batching(ioc); + if (!blk_queue_full(q, rw)) blk_set_queue_full(q, rw); - } } + ioc = get_io_context(gfp_mask); + if (unlikely(ioc == NULL)) + goto out; + if (blk_queue_full(q, rw) - && !ioc_batching(ioc) && !elv_may_queue(q, rw)) { + && ioc->nr_requests >= BLKDEV_MIN_PROC_RQ + && !elv_may_queue(q, rw)) { /* - * The queue is full and the allocating process is not a - * "batcher", and not exempted by the IO scheduler + * The queue is full and the allocating process is over + * it's limit, and not exempted by the IO scheduler */ spin_unlock_irq(q->queue_lock); - goto out; + goto out_ioc; } + ioc->nr_requests++; rl->count[rw]++; if (rl->count[rw] >= queue_congestion_on_threshold(q)) set_queue_congested(q, rw); @@ -1597,14 +1563,11 @@ static struct request *get_request(reque * wait queue, but this is pretty rare. */ spin_lock_irq(q->queue_lock); - freed_request(q, rw); + freed_request(q, ioc, rw); spin_unlock_irq(q->queue_lock); - goto out; + goto out_ioc; } - if (ioc_batching(ioc)) - ioc->nr_batch_requests--; - INIT_LIST_HEAD(&rq->queuelist); /* @@ -1620,13 +1583,15 @@ static struct request *get_request(reque rq->ref_count = 1; rq->q = q; rq->rl = rl; + rq->ioc = ioc; rq->waiting = NULL; rq->special = NULL; rq->data = NULL; rq->sense = NULL; -out: +out_ioc: put_io_context(ioc); +out: return rq; } @@ -1644,25 +1609,12 @@ static struct request *get_request_wait( struct request_list *rl = &q->rq; prepare_to_wait_exclusive(&rl->wait[rw], &wait, - TASK_UNINTERRUPTIBLE); + TASK_UNINTERRUPTIBLE); rq = get_request(q, rw, GFP_NOIO); - - if (!rq) { - struct io_context *ioc; - + if (!rq) io_schedule(); - /* - * After sleeping, we become a "batching" process and - * will be able to allocate at least one request, and - * up to a big batch of them for a small period time. - * See ioc_batching, ioc_set_batching - */ - ioc = get_io_context(GFP_NOIO); - ioc_set_batching(ioc); - put_io_context(ioc); - } finish_wait(&rl->wait[rw], &wait); } while (!rq); @@ -1854,6 +1806,7 @@ void __blk_put_request(request_queue_t * * it didn't come out of our reserved rq pools */ if (rl) { + struct io_context *ioc = req->ioc; int rw = rq_data_dir(req); elv_completed_request(q, req); @@ -1861,7 +1814,8 @@ void __blk_put_request(request_queue_t * BUG_ON(!list_empty(&req->queuelist)); blk_free_request(q, req); - freed_request(q, rw); + freed_request(q, ioc, rw); + put_io_context(ioc); } } @@ -2721,7 +2675,7 @@ int __init blk_dev_init(void) */ void put_io_context(struct io_context *ioc) { - if (ioc == NULL) + if (unlikely(ioc == NULL)) return; BUG_ON(atomic_read(&ioc->refcount) == 0); @@ -2772,8 +2726,7 @@ struct io_context *get_io_context(int gf if (ret) { atomic_set(&ret->refcount, 1); ret->pid = tsk->pid; - ret->last_waited = jiffies; /* doesn't matter... */ - ret->nr_batch_requests = 0; /* because this is 0 */ + ret->nr_requests = 0; ret->aic = NULL; tsk->io_context = ret; } _ ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) 2004-02-20 1:12 ` Nick Piggin @ 2004-02-20 1:26 ` Andrew Morton -1 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 20:25 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Nick Piggin <piggin@cyberone.com.au> wrote: > > Even with this patch, it might still be a good idea to allow > pdflush to disregard the limits... Has it been confirmed that pdflush is blocking in get_request_wait()? I guess that can happen very occasionally because we don't bother with any locking around there but if it's happening a lot then something is bust. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) @ 2004-02-20 1:26 ` Andrew Morton 0 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-20 1:26 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Nick Piggin <piggin@cyberone.com.au> wrote: > > Even with this patch, it might still be a good idea to allow > pdflush to disregard the limits... Has it been confirmed that pdflush is blocking in get_request_wait()? I guess that can happen very occasionally because we don't bother with any locking around there but if it's happening a lot then something is bust. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) 2004-02-20 1:26 ` Andrew Morton @ 2004-02-20 1:40 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 20:40 UTC (permalink / raw) To: Andrew Morton; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Andrew Morton wrote: >Nick Piggin <piggin@cyberone.com.au> wrote: > >>Even with this patch, it might still be a good idea to allow >>pdflush to disregard the limits... >> > >Has it been confirmed that pdflush is blocking in get_request_wait()? I >guess that can happen very occasionally because we don't bother with any >locking around there but if it's happening a lot then something is bust. > > Miquel's analysis is pretty plausible, but I'm not sure if he's confirmed it or not, Miquel? Even if it isn't happening a lot, and something isn't bust it might be a good idea to do this. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) @ 2004-02-20 1:40 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-20 1:40 UTC (permalink / raw) To: Andrew Morton; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Andrew Morton wrote: >Nick Piggin <piggin@cyberone.com.au> wrote: > >>Even with this patch, it might still be a good idea to allow >>pdflush to disregard the limits... >> > >Has it been confirmed that pdflush is blocking in get_request_wait()? I >guess that can happen very occasionally because we don't bother with any >locking around there but if it's happening a lot then something is bust. > > Miquel's analysis is pretty plausible, but I'm not sure if he's confirmed it or not, Miquel? Even if it isn't happening a lot, and something isn't bust it might be a good idea to do this. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) 2004-02-20 1:40 ` Nick Piggin @ 2004-02-20 2:32 ` Andrew Morton -1 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-19 21:32 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Nick Piggin <piggin@cyberone.com.au> wrote: > > > > Andrew Morton wrote: > > >Nick Piggin <piggin@cyberone.com.au> wrote: > > > >>Even with this patch, it might still be a good idea to allow > >>pdflush to disregard the limits... > >> > > > >Has it been confirmed that pdflush is blocking in get_request_wait()? I > >guess that can happen very occasionally because we don't bother with any > >locking around there but if it's happening a lot then something is bust. > > > > > > Miquel's analysis is pretty plausible, but I'm not sure if > he's confirmed it or not, Miquel? Even if it isn't happening > a lot, and something isn't bust it might be a good idea to > do this. Seems OK from a quick check. pdflush will block in get_request_wait() occasionally, but not at all often. Perhaps we could move the write_congested test into the mpage_writepages() inner loop but it hardly seems worth the risk. Maybe things are different on Miquel's clockwork controller. drivers/block/ll_rw_blk.c | 2 ++ fs/fs-writeback.c | 2 ++ 2 files changed, 4 insertions(+) diff -puN drivers/block/ll_rw_blk.c~pdflush-blockage-check drivers/block/ll_rw_blk.c --- 25/drivers/block/ll_rw_blk.c~pdflush-blockage-check 2004-02-19 18:16:33.000000000 -0800 +++ 25-akpm/drivers/block/ll_rw_blk.c 2004-02-19 18:16:33.000000000 -0800 @@ -1651,6 +1651,8 @@ static struct request *get_request_wait( if (!rq) { struct io_context *ioc; + WARN_ON(current_is_pdflush()); + io_schedule(); /* diff -puN fs/fs-writeback.c~pdflush-blockage-check fs/fs-writeback.c --- 25/fs/fs-writeback.c~pdflush-blockage-check 2004-02-19 18:22:25.000000000 -0800 +++ 25-akpm/fs/fs-writeback.c 2004-02-19 18:22:43.000000000 -0800 @@ -279,6 +279,8 @@ sync_sb_inodes(struct super_block *sb, s } if (wbc->nonblocking && bdi_write_congested(bdi)) { + if (current_is_pdflush()) + printk("saved pdflush\n"); wbc->encountered_congestion = 1; if (sb != blockdev_superblock) break; /* Skip a congested fs */ _ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) @ 2004-02-20 2:32 ` Andrew Morton 0 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-20 2:32 UTC (permalink / raw) To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Nick Piggin <piggin@cyberone.com.au> wrote: > > > > Andrew Morton wrote: > > >Nick Piggin <piggin@cyberone.com.au> wrote: > > > >>Even with this patch, it might still be a good idea to allow > >>pdflush to disregard the limits... > >> > > > >Has it been confirmed that pdflush is blocking in get_request_wait()? I > >guess that can happen very occasionally because we don't bother with any > >locking around there but if it's happening a lot then something is bust. > > > > > > Miquel's analysis is pretty plausible, but I'm not sure if > he's confirmed it or not, Miquel? Even if it isn't happening > a lot, and something isn't bust it might be a good idea to > do this. Seems OK from a quick check. pdflush will block in get_request_wait() occasionally, but not at all often. Perhaps we could move the write_congested test into the mpage_writepages() inner loop but it hardly seems worth the risk. Maybe things are different on Miquel's clockwork controller. drivers/block/ll_rw_blk.c | 2 ++ fs/fs-writeback.c | 2 ++ 2 files changed, 4 insertions(+) diff -puN drivers/block/ll_rw_blk.c~pdflush-blockage-check drivers/block/ll_rw_blk.c --- 25/drivers/block/ll_rw_blk.c~pdflush-blockage-check 2004-02-19 18:16:33.000000000 -0800 +++ 25-akpm/drivers/block/ll_rw_blk.c 2004-02-19 18:16:33.000000000 -0800 @@ -1651,6 +1651,8 @@ static struct request *get_request_wait( if (!rq) { struct io_context *ioc; + WARN_ON(current_is_pdflush()); + io_schedule(); /* diff -puN fs/fs-writeback.c~pdflush-blockage-check fs/fs-writeback.c --- 25/fs/fs-writeback.c~pdflush-blockage-check 2004-02-19 18:22:25.000000000 -0800 +++ 25-akpm/fs/fs-writeback.c 2004-02-19 18:22:43.000000000 -0800 @@ -279,6 +279,8 @@ sync_sb_inodes(struct super_block *sb, s } if (wbc->nonblocking && bdi_write_congested(bdi)) { + if (current_is_pdflush()) + printk("saved pdflush\n"); wbc->encountered_congestion = 1; if (sb != blockdev_superblock) break; /* Skip a congested fs */ _ ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-20 2:32 ` Andrew Morton @ 2004-02-23 8:41 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-20 14:40 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber On 2004.02.20 03:32, Andrew Morton wrote: > Nick Piggin <piggin@cyberone.com.au> wrote: > > > > > > > > Andrew Morton wrote: > > > > >Nick Piggin <piggin@cyberone.com.au> wrote: > > > > > >>Even with this patch, it might still be a good idea to allow > > >>pdflush to disregard the limits... > > >> > > > > > >Has it been confirmed that pdflush is blocking in get_request_wait()? I > > >guess that can happen very occasionally because we don't bother with any > > >locking around there but if it's happening a lot then something is bust. > > > > > > > > > > Miquel's analysis is pretty plausible, but I'm not sure if > > he's confirmed it or not, Miquel? Yes, I've added current->comm to my debug printk's, and it's dd/pdflush that both block in get_request_wait (alternating between the two). > > Even if it isn't happening > > a lot, and something isn't bust it might be a good idea to > > do this. > > Seems OK from a quick check. pdflush will block in get_request_wait() > occasionally, but not at all often. Perhaps we could move the > write_congested test into the mpage_writepages() inner loop but it hardly > seems worth the risk. > > Maybe things are different on Miquel's clockwork controller. I haven't tested it yet because of the "This patch isn't actually so good" comment, but I found another explanation. > drivers/block/ll_rw_blk.c | 2 ++ > fs/fs-writeback.c | 2 ++ > 2 files changed, 4 insertions(+) *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens with an LVM device. Could it be that because LVM and the actual device have different struct request_queue's things go awry ? In fs-writeback.c, your're looking at the LVM device (and its request_queue, and its backing_dev_info). In__make_request, you're looking at the SCSI device. So that's why I saw "dd" and pdflush fighting over get_request. pdflush isn't supposed to run when the blockdev is congested, but ofcourse the LVM device is never marked as congested. The same can happen with MD devices, I think. Here's another proof-of-concept patch that fixes things for LVM. With this patch applied, I never see pdflush running when the queue is congested. You'll probably mark it as "horrible", but hey, it shows the problem clear enough.. This patch adds a function pointer to the backing_dev_info struct that can be called to check on the congestion if it's anything other than a simple device. Dm.c now sets this pointer to a function defined in dm-table.c that checks the congestion bits on all devices part of the logical volume. I'm not sure if it should be the other way around (ll_rw_blk setting the congested bit in the LVM device instead). Also if any queue of any device under LVM is congested, pdflush won't run on the whole device - but I think that harms less than running on a device with a congested queue. bdi-congestion-funp.patch --- linux-2.6.3/include/linux/backing-dev.h.ORIG 2004-02-04 04:43:38.000000000 +0100 +++ linux-2.6.3/include/linux/backing-dev.h 2004-02-20 15:17:52.000000000 +0100 @@ -24,6 +24,8 @@ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long state; /* Always use atomic bitops on this */ int memory_backed; /* Cannot clean pages with writepage */ + int (*congested)(int, void *); /* Function pointer if device is md/dm */ + void *aux; /* Pointer to aux data for congested func */ }; extern struct backing_dev_info default_backing_dev_info; @@ -34,11 +36,15 @@ static inline int bdi_read_congested(struct backing_dev_info *bdi) { + if (bdi->congested) + return bdi->congested(BDI_read_congested, bdi->aux); return test_bit(BDI_read_congested, &bdi->state); } static inline int bdi_write_congested(struct backing_dev_info *bdi) { + if (bdi->congested) + return bdi->congested(BDI_write_congested, bdi->aux); return test_bit(BDI_write_congested, &bdi->state); } --- linux-2.6.3/drivers/md/dm.h.ORIG 2004-02-20 15:15:16.000000000 +0100 +++ linux-2.6.3/drivers/md/dm.h 2004-02-20 15:12:30.000000000 +0100 @@ -115,6 +115,7 @@ int dm_table_get_mode(struct dm_table *t); void dm_table_suspend_targets(struct dm_table *t); void dm_table_resume_targets(struct dm_table *t); +int dm_table_any_congested(int bdi_state, void *aux); /*----------------------------------------------------------------- * A registry of target types. --- linux-2.6.3/drivers/md/dm-table.c.ORIG 2004-02-04 04:44:59.000000000 +0100 +++ linux-2.6.3/drivers/md/dm-table.c 2004-02-20 15:14:35.000000000 +0100 @@ -857,6 +857,30 @@ } } +/* + * See if any device in the logical volume has a queue + * that is congested - in that case, pdflush skips it. + */ +int dm_table_any_congested(int bdi_state, void *aux) +{ + struct mapped_device *md = aux; + struct dm_table *t; + struct list_head *d, *devices; + int r = 0; + + if ((t = dm_get_table(md)) == NULL) + return 0; + + devices = dm_table_get_devices(t); + for (d = devices->next; d != devices; d = d->next) { + struct dm_dev *dd = list_entry(d, struct dm_dev, list); + request_queue_t *q = bdev_get_queue(dd->bdev); + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); + } + dm_table_put(t); + + return r; +} EXPORT_SYMBOL(dm_get_device); EXPORT_SYMBOL(dm_put_device); --- linux-2.6.3/drivers/md/dm.c.ORIG 2004-02-20 15:15:02.000000000 +0100 +++ linux-2.6.3/drivers/md/dm.c 2004-02-20 15:14:51.000000000 +0100 @@ -608,6 +608,8 @@ } md->queue->queuedata = md; + md->queue->backing_dev_info.congested = dm_table_any_congested; + md->queue->backing_dev_info.aux = md; blk_queue_make_request(md->queue, dm_request); md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab, ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-23 8:41 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-23 8:41 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber On 2004.02.20 03:32, Andrew Morton wrote: > Nick Piggin <piggin@cyberone.com.au> wrote: > > > > > > > > Andrew Morton wrote: > > > > >Nick Piggin <piggin@cyberone.com.au> wrote: > > > > > >>Even with this patch, it might still be a good idea to allow > > >>pdflush to disregard the limits... > > >> > > > > > >Has it been confirmed that pdflush is blocking in get_request_wait()? I > > >guess that can happen very occasionally because we don't bother with any > > >locking around there but if it's happening a lot then something is bust. > > > > > > > > > > Miquel's analysis is pretty plausible, but I'm not sure if > > he's confirmed it or not, Miquel? Yes, I've added current->comm to my debug printk's, and it's dd/pdflush that both block in get_request_wait (alternating between the two). > > Even if it isn't happening > > a lot, and something isn't bust it might be a good idea to > > do this. > > Seems OK from a quick check. pdflush will block in get_request_wait() > occasionally, but not at all often. Perhaps we could move the > write_congested test into the mpage_writepages() inner loop but it hardly > seems worth the risk. > > Maybe things are different on Miquel's clockwork controller. I haven't tested it yet because of the "This patch isn't actually so good" comment, but I found another explanation. > drivers/block/ll_rw_blk.c | 2 ++ > fs/fs-writeback.c | 2 ++ > 2 files changed, 4 insertions(+) *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens with an LVM device. Could it be that because LVM and the actual device have different struct request_queue's things go awry ? In fs-writeback.c, your're looking at the LVM device (and its request_queue, and its backing_dev_info). In__make_request, you're looking at the SCSI device. So that's why I saw "dd" and pdflush fighting over get_request. pdflush isn't supposed to run when the blockdev is congested, but ofcourse the LVM device is never marked as congested. The same can happen with MD devices, I think. Here's another proof-of-concept patch that fixes things for LVM. With this patch applied, I never see pdflush running when the queue is congested. You'll probably mark it as "horrible", but hey, it shows the problem clear enough.. This patch adds a function pointer to the backing_dev_info struct that can be called to check on the congestion if it's anything other than a simple device. Dm.c now sets this pointer to a function defined in dm-table.c that checks the congestion bits on all devices part of the logical volume. I'm not sure if it should be the other way around (ll_rw_blk setting the congested bit in the LVM device instead). Also if any queue of any device under LVM is congested, pdflush won't run on the whole device - but I think that harms less than running on a device with a congested queue. bdi-congestion-funp.patch --- linux-2.6.3/include/linux/backing-dev.h.ORIG 2004-02-04 04:43:38.000000000 +0100 +++ linux-2.6.3/include/linux/backing-dev.h 2004-02-20 15:17:52.000000000 +0100 @@ -24,6 +24,8 @@ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long state; /* Always use atomic bitops on this */ int memory_backed; /* Cannot clean pages with writepage */ + int (*congested)(int, void *); /* Function pointer if device is md/dm */ + void *aux; /* Pointer to aux data for congested func */ }; extern struct backing_dev_info default_backing_dev_info; @@ -34,11 +36,15 @@ static inline int bdi_read_congested(struct backing_dev_info *bdi) { + if (bdi->congested) + return bdi->congested(BDI_read_congested, bdi->aux); return test_bit(BDI_read_congested, &bdi->state); } static inline int bdi_write_congested(struct backing_dev_info *bdi) { + if (bdi->congested) + return bdi->congested(BDI_write_congested, bdi->aux); return test_bit(BDI_write_congested, &bdi->state); } --- linux-2.6.3/drivers/md/dm.h.ORIG 2004-02-20 15:15:16.000000000 +0100 +++ linux-2.6.3/drivers/md/dm.h 2004-02-20 15:12:30.000000000 +0100 @@ -115,6 +115,7 @@ int dm_table_get_mode(struct dm_table *t); void dm_table_suspend_targets(struct dm_table *t); void dm_table_resume_targets(struct dm_table *t); +int dm_table_any_congested(int bdi_state, void *aux); /*----------------------------------------------------------------- * A registry of target types. --- linux-2.6.3/drivers/md/dm-table.c.ORIG 2004-02-04 04:44:59.000000000 +0100 +++ linux-2.6.3/drivers/md/dm-table.c 2004-02-20 15:14:35.000000000 +0100 @@ -857,6 +857,30 @@ } } +/* + * See if any device in the logical volume has a queue + * that is congested - in that case, pdflush skips it. + */ +int dm_table_any_congested(int bdi_state, void *aux) +{ + struct mapped_device *md = aux; + struct dm_table *t; + struct list_head *d, *devices; + int r = 0; + + if ((t = dm_get_table(md)) == NULL) + return 0; + + devices = dm_table_get_devices(t); + for (d = devices->next; d != devices; d = d->next) { + struct dm_dev *dd = list_entry(d, struct dm_dev, list); + request_queue_t *q = bdev_get_queue(dd->bdev); + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); + } + dm_table_put(t); + + return r; +} EXPORT_SYMBOL(dm_get_device); EXPORT_SYMBOL(dm_put_device); --- linux-2.6.3/drivers/md/dm.c.ORIG 2004-02-20 15:15:02.000000000 +0100 +++ linux-2.6.3/drivers/md/dm.c 2004-02-20 15:14:51.000000000 +0100 @@ -608,6 +608,8 @@ } md->queue->queuedata = md; + md->queue->backing_dev_info.congested = dm_table_any_congested; + md->queue->backing_dev_info.aux = md; blk_queue_make_request(md->queue, dm_request); md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab, ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-23 8:41 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-20 14:59 ` Joe Thornber -1 siblings, 0 replies; 59+ messages in thread From: Joe Thornber @ 2004-02-20 9:56 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber On Fri, Feb 20, 2004 at 03:40:42PM +0100, Miquel van Smoorenburg wrote: > --- linux-2.6.3/drivers/md/dm-table.c.ORIG 2004-02-04 04:44:59.000000000 +0100 > +++ linux-2.6.3/drivers/md/dm-table.c 2004-02-20 15:14:35.000000000 +0100 <snip> > + if ((t = dm_get_table(md)) == NULL) > + return 0; struct mapped_device has no business in this file. You should move this function to dm.c, and provide accessor fns in dm-table.c. > + devices = dm_table_get_devices(t); > + for (d = devices->next; d != devices; d = d->next) { > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > + request_queue_t *q = bdev_get_queue(dd->bdev); > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); Shouldn't this be calling your bdi_*_congested function rather than assuming it is a real device under dm ? (often not true). I'm also very slightly worried that or'ing together the congestion results for all the seperate devices isn't always the right thing. These devices include anything that the targets are using, exception stores for snapshots, logs for mirror, all paths for multipath (or'ing is most likely to be wrong for multipath). - Joe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-20 14:59 ` Joe Thornber 0 siblings, 0 replies; 59+ messages in thread From: Joe Thornber @ 2004-02-20 14:59 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber On Fri, Feb 20, 2004 at 03:40:42PM +0100, Miquel van Smoorenburg wrote: > --- linux-2.6.3/drivers/md/dm-table.c.ORIG 2004-02-04 04:44:59.000000000 +0100 > +++ linux-2.6.3/drivers/md/dm-table.c 2004-02-20 15:14:35.000000000 +0100 <snip> > + if ((t = dm_get_table(md)) == NULL) > + return 0; struct mapped_device has no business in this file. You should move this function to dm.c, and provide accessor fns in dm-table.c. > + devices = dm_table_get_devices(t); > + for (d = devices->next; d != devices; d = d->next) { > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > + request_queue_t *q = bdev_get_queue(dd->bdev); > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); Shouldn't this be calling your bdi_*_congested function rather than assuming it is a real device under dm ? (often not true). I'm also very slightly worried that or'ing together the congestion results for all the seperate devices isn't always the right thing. These devices include anything that the targets are using, exception stores for snapshots, logs for mirror, all paths for multipath (or'ing is most likely to be wrong for multipath). - Joe ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-20 14:59 ` Joe Thornber @ 2004-02-20 15:00 ` Jens Axboe -1 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-20 9:59 UTC (permalink / raw) To: Joe Thornber Cc: Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel On Fri, Feb 20 2004, Joe Thornber wrote: > > + devices = dm_table_get_devices(t); > > + for (d = devices->next; d != devices; d = d->next) { > > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > > + request_queue_t *q = bdev_get_queue(dd->bdev); > > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); > > Shouldn't this be calling your bdi_*_congested function rather than > assuming it is a real device under dm ? (often not true). > > I'm also very slightly worried that or'ing together the congestion > results for all the seperate devices isn't always the right thing. > These devices include anything that the targets are using, exception > stores for snapshots, logs for mirror, all paths for multipath (or'ing > is most likely to be wrong for multipath). Yeah the patch is pretty much crap in that area, I don't think Miquel was aiming for inclusion :) I'd suggest making queue functions for congestion state as well so it stacks properly. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-20 15:00 ` Jens Axboe 0 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-20 15:00 UTC (permalink / raw) To: Joe Thornber Cc: Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel On Fri, Feb 20 2004, Joe Thornber wrote: > > + devices = dm_table_get_devices(t); > > + for (d = devices->next; d != devices; d = d->next) { > > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > > + request_queue_t *q = bdev_get_queue(dd->bdev); > > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); > > Shouldn't this be calling your bdi_*_congested function rather than > assuming it is a real device under dm ? (often not true). > > I'm also very slightly worried that or'ing together the congestion > results for all the seperate devices isn't always the right thing. > These devices include anything that the targets are using, exception > stores for snapshots, logs for mirror, all paths for multipath (or'ing > is most likely to be wrong for multipath). Yeah the patch is pretty much crap in that area, I don't think Miquel was aiming for inclusion :) I'd suggest making queue functions for congestion state as well so it stacks properly. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-20 15:00 ` Jens Axboe @ 2004-02-23 8:45 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-22 14:02 UTC (permalink / raw) To: Jens Axboe Cc: Joe Thornber, Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel On Fri, 20 Feb 2004 16:00:13, Jens Axboe wrote: > On Fri, Feb 20 2004, Joe Thornber wrote: > > > + devices = dm_table_get_devices(t); > > > + for (d = devices->next; d != devices; d = d->next) { > > > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > > > + request_queue_t *q = bdev_get_queue(dd->bdev); > > > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); > > > > Shouldn't this be calling your bdi_*_congested function rather than > > assuming it is a real device under dm ? (often not true). > > > > I'm also very slightly worried that or'ing together the congestion > > results for all the seperate devices isn't always the right thing. > > These devices include anything that the targets are using, exception > > stores for snapshots, logs for mirror, all paths for multipath (or'ing > > is most likely to be wrong for multipath). > > Yeah the patch is pretty much crap in that area, I don't think Miquel > was aiming for inclusion :) > > I'd suggest making queue functions for congestion state as well so it > stacks properly. I've been looking at this. If you want to keep the struct backing_dev_info reasonably stand-alone (no function pointer and aux data like I did) you probably need to add a list of parent_queues to every request_list. If the queue get congested, you mark the parent queue(s) congested as well. Congested state would need to be changed into a counter instead of a bitfield, I think. The pdflush-is-running-on-this-queue bit can probably remain as-is. It's mostly meant to prevent 2 pdflush daemons from running the same queue. I don't see much harm in pdflush #1 running on /dev/md0 which consists of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ? Would that be the way to go? If so, I'll take a stab at it. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-23 8:45 ` Miquel van Smoorenburg 0 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-23 8:45 UTC (permalink / raw) To: Jens Axboe Cc: Joe Thornber, Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel On Fri, 20 Feb 2004 16:00:13, Jens Axboe wrote: > On Fri, Feb 20 2004, Joe Thornber wrote: > > > + devices = dm_table_get_devices(t); > > > + for (d = devices->next; d != devices; d = d->next) { > > > + struct dm_dev *dd = list_entry(d, struct dm_dev, list); > > > + request_queue_t *q = bdev_get_queue(dd->bdev); > > > + r |= test_bit(bdi_state, &(q->backing_dev_info.state)); > > > > Shouldn't this be calling your bdi_*_congested function rather than > > assuming it is a real device under dm ? (often not true). > > > > I'm also very slightly worried that or'ing together the congestion > > results for all the seperate devices isn't always the right thing. > > These devices include anything that the targets are using, exception > > stores for snapshots, logs for mirror, all paths for multipath (or'ing > > is most likely to be wrong for multipath). > > Yeah the patch is pretty much crap in that area, I don't think Miquel > was aiming for inclusion :) > > I'd suggest making queue functions for congestion state as well so it > stacks properly. I've been looking at this. If you want to keep the struct backing_dev_info reasonably stand-alone (no function pointer and aux data like I did) you probably need to add a list of parent_queues to every request_list. If the queue get congested, you mark the parent queue(s) congested as well. Congested state would need to be changed into a counter instead of a bitfield, I think. The pdflush-is-running-on-this-queue bit can probably remain as-is. It's mostly meant to prevent 2 pdflush daemons from running the same queue. I don't see much harm in pdflush #1 running on /dev/md0 which consists of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ? Would that be the way to go? If so, I'll take a stab at it. Mike. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-23 8:45 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-22 19:55 ` Andrew Morton -1 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-22 14:55 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: axboe, thornber, piggin, miquels, linux-lvm, linux-kernel Miquel van Smoorenburg <miquels@cistron.net> wrote: > > The pdflush-is-running-on-this-queue bit can probably remain as-is. It's > mostly meant to prevent 2 pdflush daemons from running the same queue. > I don't see much harm in pdflush #1 running on /dev/md0 which consists > of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ? Yes, having two pdflushes working the same spindle is a bit pointless but is presumably fairly harmless. The most important thing here is to prevent pdflush from blocking on request exhaustion. Because while pdflush sleeps on a particular disk, all the others remain unserviced. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-22 19:55 ` Andrew Morton 0 siblings, 0 replies; 59+ messages in thread From: Andrew Morton @ 2004-02-22 19:55 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: axboe, thornber, miquels, piggin, miquels, linux-lvm, linux-kernel Miquel van Smoorenburg <miquels@cistron.net> wrote: > > The pdflush-is-running-on-this-queue bit can probably remain as-is. It's > mostly meant to prevent 2 pdflush daemons from running the same queue. > I don't see much harm in pdflush #1 running on /dev/md0 which consists > of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ? Yes, having two pdflushes working the same spindle is a bit pointless but is presumably fairly harmless. The most important thing here is to prevent pdflush from blocking on request exhaustion. Because while pdflush sleeps on a particular disk, all the others remain unserviced. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) 2004-02-23 8:41 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-20 14:57 ` Jens Axboe -1 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-20 9:57 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel, thornber On Fri, Feb 20 2004, Miquel van Smoorenburg wrote: > > > Even if it isn't happening > > > a lot, and something isn't bust it might be a good idea to > > > do this. > > > > Seems OK from a quick check. pdflush will block in get_request_wait() > > occasionally, but not at all often. Perhaps we could move the > > write_congested test into the mpage_writepages() inner loop but it hardly > > seems worth the risk. > > > > Maybe things are different on Miquel's clockwork controller. > > I haven't tested it yet because of the "This patch isn't actually so good" > comment, but I found another explanation. > > > drivers/block/ll_rw_blk.c | 2 ++ > > fs/fs-writeback.c | 2 ++ > > 2 files changed, 4 insertions(+) > > *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens > with an LVM device. Could it be that because LVM and the actual device > have different struct request_queue's things go awry ? > > In fs-writeback.c, your're looking at the LVM device (and its > request_queue, and its backing_dev_info). In__make_request, you're > looking at the SCSI device. In principle, the lvm/md queues themselves will never be congested. But the underlying queues can be, of course. Now this approach is _much_ better, imo. I don't particularly care very much for how you solved it, though, I'd much rather just see both setting and testing passed down (and kill the ->aux as well). Regardless of the initial hw depth vs block depth (which is also a generic device problem, not just dm related), this would be a good addition to the congestion logic. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) @ 2004-02-20 14:57 ` Jens Axboe 0 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-20 14:57 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel, thornber On Fri, Feb 20 2004, Miquel van Smoorenburg wrote: > > > Even if it isn't happening > > > a lot, and something isn't bust it might be a good idea to > > > do this. > > > > Seems OK from a quick check. pdflush will block in get_request_wait() > > occasionally, but not at all often. Perhaps we could move the > > write_congested test into the mpage_writepages() inner loop but it hardly > > seems worth the risk. > > > > Maybe things are different on Miquel's clockwork controller. > > I haven't tested it yet because of the "This patch isn't actually so good" > comment, but I found another explanation. > > > drivers/block/ll_rw_blk.c | 2 ++ > > fs/fs-writeback.c | 2 ++ > > 2 files changed, 4 insertions(+) > > *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens > with an LVM device. Could it be that because LVM and the actual device > have different struct request_queue's things go awry ? > > In fs-writeback.c, your're looking at the LVM device (and its > request_queue, and its backing_dev_info). In__make_request, you're > looking at the SCSI device. In principle, the lvm/md queues themselves will never be congested. But the underlying queues can be, of course. Now this approach is _much_ better, imo. I don't particularly care very much for how you solved it, though, I'd much rather just see both setting and testing passed down (and kill the ->aux as well). Regardless of the initial hw depth vs block depth (which is also a generic device problem, not just dm related), this would be a good addition to the congestion logic. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Queue congestion: passing down vs passing up [PATCH] 2004-02-20 14:57 ` Jens Axboe (?) @ 2004-02-24 12:54 ` Miquel van Smoorenburg -1 siblings, 0 replies; 59+ messages in thread From: Miquel van Smoorenburg @ 2004-02-24 12:54 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-lvm [weeded out the enormous Cc: list] On 2004.02.20 15:57, Jens Axboe wrote: > On Fri, Feb 20 2004, Miquel van Smoorenburg wrote: > > > > Even if it isn't happening > > > > a lot, and something isn't bust it might be a good idea to > > > > do this. > > > > > > Seems OK from a quick check. pdflush will block in get_request_wait() > > > occasionally, but not at all often. Perhaps we could move the > > > write_congested test into the mpage_writepages() inner loop but it hardly > > > seems worth the risk. > > > > > > Maybe things are different on Miquel's clockwork controller. > > > > I haven't tested it yet because of the "This patch isn't actually so good" > > comment, but I found another explanation. > > > > > drivers/block/ll_rw_blk.c | 2 ++ > > > fs/fs-writeback.c | 2 ++ > > > 2 files changed, 4 insertions(+) > > > > *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens > > with an LVM device. Could it be that because LVM and the actual device > > have different struct request_queue's things go awry ? > > > > In fs-writeback.c, your're looking at the LVM device (and its > > request_queue, and its backing_dev_info). In__make_request, you're > > looking at the SCSI device. > > In principle, the lvm/md queues themselves will never be congested. But > the underlying queues can be, of course. > > Now this approach is _much_ better, imo. I don't particularly care very > much for how you solved it, though, I'd much rather just see both > setting and testing passed down (and kill the ->aux as well). I just sent a patch to linux-lvm / Joe Thornber that does this. It adds a void *request_queue to the struct backing_dev_info, and a congestion_fn to the request_queue_t struct. bdi_{read,write}_congested got moved to ll_rw_blk.c. dm.c sets the congestion_fn to an internal function that checks the congestion of all the underlying block devices. See https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html for the actual patch. So that's testing being passed down. The setting you cannot pass down - you need to pass it "up". I implemented that too, added a list of "parent queues" to the request_queue, and a function so that for example dm can register its queue as "parent" of an underlying device. It needs to be a list, since different partitions on one device can be members of different meta-devices. When the congestion functions in ll_rw_blk.c are called, if a queue is congested, that info is also passed "up" to all of the "parent" queues. backing_dev_info uses a counter instead of a bit to mark read/write as congested. Perhaps it would be better to call a function (congestion_fn). I think that we only need one of the two - either testing being passed down or setting being passed up. The first one is the easiest. The second one keeps the struct backing_dev_info clean (no request_queue pointer). What would you prefer ? For reference, a first cut at the "passing up queue congestion" is below. Mike. queue-parent.patch --- linux-2.6.3.orig/include/linux/backing-dev.h 2004-02-04 04:43:38.000000000 +0100 +++ linux-2.6.3-queue_parent/include/linux/backing-dev.h 2004-02-23 16:17:53.000000000 +0100 @@ -24,6 +24,8 @@ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long state; /* Always use atomic bitops on this */ int memory_backed; /* Cannot clean pages with writepage */ + int read_congested; /* Write queue or child write queue congested */ + int write_congested; /* Read queue or child read queue congested */ }; extern struct backing_dev_info default_backing_dev_info; @@ -34,12 +36,12 @@ static inline int bdi_read_congested(struct backing_dev_info *bdi) { - return test_bit(BDI_read_congested, &bdi->state); + return bdi->read_congested; } static inline int bdi_write_congested(struct backing_dev_info *bdi) { - return test_bit(BDI_write_congested, &bdi->state); + return bdi->write_congested; } #endif /* _LINUX_BACKING_DEV_H */ --- linux-2.6.3.orig/include/linux/blkdev.h 2004-02-22 13:52:17.000000000 +0100 +++ linux-2.6.3-queue_parent/include/linux/blkdev.h 2004-02-24 15:02:49.000000000 +0100 @@ -267,6 +267,11 @@ atomic_t refcnt; /* map can be shared */ }; +struct rqueue_list { + struct list_head rqueue_list_head; + struct request_queue *rqueue; +}; + struct request_queue { /* @@ -301,6 +306,8 @@ struct backing_dev_info backing_dev_info; + struct list_head parents; + /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. @@ -514,6 +521,8 @@ extern void __blk_stop_queue(request_queue_t *q); extern void blk_run_queue(request_queue_t *q); extern void blk_queue_activity_fn(request_queue_t *, activity_fn *, void *); +extern int blk_queue_add_parent(request_queue_t *q, request_queue_t *parent); +extern void blk_queue_remove_parent(request_queue_t *q, request_queue_t *parent); static inline request_queue_t *bdev_get_queue(struct block_device *bdev) { --- linux-2.6.3.orig/drivers/block/ll_rw_blk.c 2004-02-04 04:43:10.000000000 +0100 +++ linux-2.6.3-queue_parent/drivers/block/ll_rw_blk.c 2004-02-24 16:50:20.000000000 +0100 @@ -93,9 +93,31 @@ } /* - * A queue has just exitted congestion. Note this in the global counter of - * congested queues, and wake up anyone who was waiting for requests to be - * put back. + * Increment/decrement the congestion counter of this queue + * and recurse over the parent queues. + */ +static void set_queues_congested(request_queue_t *q, int rw, int how) +{ + struct rqueue_list *p; + struct list_head *l; + + if (rw == WRITE) + q->backing_dev_info.write_congested += how; + else + q->backing_dev_info.read_congested += how; + + __list_for_each(l, &q->parents) { + p = list_entry(l, struct rqueue_list, rqueue_list_head); + spin_lock(p->rqueue->queue_lock); + set_queues_congested(p->rqueue, rw, how); + spin_unlock(p->rqueue->queue_lock); + } +} + +/* + * A queue has just exitted congestion. Flag that in the queue's + * VM-visible state flags, and wake up anyone who was waiting + * for requests to be put back. */ static void clear_queue_congested(request_queue_t *q, int rw) { @@ -103,23 +125,91 @@ wait_queue_head_t *wqh = &congestion_wqh[rw]; bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested; - clear_bit(bit, &q->backing_dev_info.state); + if (!test_and_clear_bit(bit, &q->backing_dev_info.state)) + return; + set_queues_congested(q, rw, -1); if (waitqueue_active(wqh)) wake_up(wqh); } /* - * A queue has just entered congestion. Flag that in the queue's VM-visible - * state flags and increment the global gounter of congested queues. + * A queue has just entered congestion. Flag that in the queue's + * VM-visible state flags. */ static void set_queue_congested(request_queue_t *q, int rw) { enum bdi_state bit; bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested; - set_bit(bit, &q->backing_dev_info.state); + if (test_and_set_bit(bit, &q->backing_dev_info.state)) + return; + set_queues_congested(q, rw, 1); +} + +/** + * blk_queue_add_parent - add a queue to the list of parents. + * @q: this queue + * @parent: parent queue + * + * Adds a queue to the list of parents. Locks both queues. + */ +int blk_queue_add_parent(request_queue_t *q, request_queue_t *parent) +{ + struct rqueue_list *r; + + r = kmalloc(sizeof(*r), GFP_KERNEL); + if (!r) + return -ENOMEM; + INIT_LIST_HEAD(&r->rqueue_list_head); + r->rqueue = parent; + + spin_lock(q->queue_lock); + spin_lock(parent->queue_lock); + list_add_tail(&r->rqueue_list_head, &q->parents); + parent->backing_dev_info.read_congested += + q->backing_dev_info.read_congested; + parent->backing_dev_info.write_congested += + q->backing_dev_info.write_congested; + spin_unlock(parent->queue_lock); + spin_unlock(q->queue_lock); + + return 0; } +EXPORT_SYMBOL(blk_queue_add_parent); + +/** + * blk_queue_remove_parent - remove a queue from the list of parents. + * @q: this queue + * @parent: parent queue + * + * Removes a queue from the list of parents. Locks both queues. + */ +void blk_queue_remove_parent(request_queue_t *q, request_queue_t *parent) +{ + struct list_head *l; + struct rqueue_list *r; + + spin_lock(q->queue_lock); + __list_for_each(l, &q->parents) { + r = list_entry(l, struct rqueue_list, rqueue_list_head); + if (r->rqueue == parent) { + spin_lock(parent->queue_lock); + parent->backing_dev_info.read_congested -= + q->backing_dev_info.read_congested; + parent->backing_dev_info.write_congested -= + q->backing_dev_info.write_congested; + spin_unlock(parent->queue_lock); + list_del(l); + kfree(r); + break; + } + } + spin_unlock(q->queue_lock); +} + +EXPORT_SYMBOL(blk_queue_remove_parent); + /** * blk_get_backing_dev_info - get the address of a queue's backing_dev_info * @bdev: device @@ -1372,6 +1462,7 @@ memset(q, 0, sizeof(*q)); init_timer(&q->unplug_timer); atomic_set(&q->refcnt, 1); + INIT_LIST_HEAD(&q->parents); return q; } @@ -2813,8 +2904,11 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count) { struct request_list *rl = &q->rq; + int ret; + + spin_lock(q->queue_lock); - int ret = queue_var_store(&q->nr_requests, page, count); + ret = queue_var_store(&q->nr_requests, page, count); if (q->nr_requests < BLKDEV_MIN_RQ) q->nr_requests = BLKDEV_MIN_RQ; @@ -2841,6 +2935,9 @@ blk_clear_queue_full(q, WRITE); wake_up(&rl->wait[WRITE]); } + + spin_unlock(q->queue_lock); + return ret; } --- linux-2.6.3.orig/drivers/md/dm.h 2004-02-04 04:43:45.000000000 +0100 +++ linux-2.6.3-queue_parent/drivers/md/dm.h 2004-02-23 21:32:40.000000000 +0100 @@ -110,6 +110,8 @@ struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector); void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q); +void dm_table_add_parent(struct dm_table *t, struct request_queue *q); +void dm_table_remove_parent(struct dm_table *t, struct request_queue *q); unsigned int dm_table_get_num_targets(struct dm_table *t); struct list_head *dm_table_get_devices(struct dm_table *t); int dm_table_get_mode(struct dm_table *t); --- linux-2.6.3.orig/drivers/md/dm-table.c 2004-02-04 04:44:59.000000000 +0100 +++ linux-2.6.3-queue_parent/drivers/md/dm-table.c 2004-02-24 15:17:06.000000000 +0100 @@ -818,6 +818,30 @@ q->seg_boundary_mask = t->limits.seg_boundary_mask; } +void dm_table_add_parent(struct dm_table *t, struct request_queue *q) +{ + struct list_head *d, *devices; + + devices = dm_table_get_devices(t); + for (d = devices->next; d != devices; d = d->next) { + struct dm_dev *dd = list_entry(d, struct dm_dev, list); + request_queue_t *bq = bdev_get_queue(dd->bdev); + blk_queue_add_parent(bq, q); + } +} + +void dm_table_remove_parent(struct dm_table *t, struct request_queue *q) +{ + struct list_head *d, *devices; + + devices = dm_table_get_devices(t); + for (d = devices->next; d != devices; d = d->next) { + struct dm_dev *dd = list_entry(d, struct dm_dev, list); + request_queue_t *bq = bdev_get_queue(dd->bdev); + blk_queue_remove_parent(bq, q); + } +} + unsigned int dm_table_get_num_targets(struct dm_table *t) { return t->num_targets; --- linux-2.6.3.orig/drivers/md/dm.c 2004-02-22 13:52:15.000000000 +0100 +++ linux-2.6.3-queue_parent/drivers/md/dm.c 2004-02-24 16:35:32.000000000 +0100 @@ -41,6 +41,7 @@ struct mapped_device { struct rw_semaphore lock; atomic_t holders; + spinlock_t queue_lock; unsigned long flags; @@ -600,7 +601,7 @@ memset(md, 0, sizeof(*md)); init_rwsem(&md->lock); atomic_set(&md->holders, 1); - + md->queue_lock = SPIN_LOCK_UNLOCKED; md->queue = blk_alloc_queue(GFP_KERNEL); if (!md->queue) { kfree(md); @@ -608,6 +609,7 @@ } md->queue->queuedata = md; + md->queue->queue_lock = &md->queue_lock; blk_queue_make_request(md->queue, dm_request); md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab, @@ -695,6 +697,7 @@ dm_table_get(t); dm_table_set_restrictions(t, q); + dm_table_add_parent(t, q); return 0; } @@ -704,6 +707,7 @@ return; dm_table_event_callback(md->map, NULL, NULL); + dm_table_remove_parent(md->map, md->queue); dm_table_put(md->map); md->map = NULL; } ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) 2004-02-20 1:12 ` Nick Piggin @ 2004-02-20 1:45 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 20:52 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Nick Piggin wrote: > Hi Miquel, > Can you see if this patch helps you? > This patch isn't actually so good because it doesn't wake a specific process when it gets below it's limit... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) @ 2004-02-20 1:45 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-20 1:45 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Nick Piggin wrote: > Hi Miquel, > Can you see if this patch helps you? > This patch isn't actually so good because it doesn't wake a specific process when it gets below it's limit... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 9:08 ` [linux-lvm] " Miquel van Smoorenburg @ 2004-02-19 9:11 ` Nick Piggin -1 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 2:51 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: > >No, I'm actually referring to a struct request. I'm logging this in the >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have >in fact logged all the bio's submitted to __make_request, and the output >of the elevator from elv_next_request(). The bio's are submitted sequentially, >the resulting requests aren't. But this is because nr_requests is 128, while >the 3ware device has a queue of 254 entries (no tagging though). Upping >nr_requests to 512 makes this go away .. > >That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome >with a lot of debugging (unless I set nr_requests lower again), which points >to a timing issue. > > So the problem you are seeing is due to "unlucky" timing between two processes submitting IO. And the very efficient mechanisms (merging, sorting) we have to improve situations exactly like this is effectively disabled. And to make it worse, it appears that your controller shits itself on this trivially simple pattern. Your hack makes a baby step in the direction of per *process* request limits, which I happen to be an advocate of. As it stands though, I don't like it. Jens has the final say when it comes to the block layer though. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 9:11 ` Nick Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nick Piggin @ 2004-02-19 9:11 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: > >No, I'm actually referring to a struct request. I'm logging this in the >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have >in fact logged all the bio's submitted to __make_request, and the output >of the elevator from elv_next_request(). The bio's are submitted sequentially, >the resulting requests aren't. But this is because nr_requests is 128, while >the 3ware device has a queue of 254 entries (no tagging though). Upping >nr_requests to 512 makes this go away .. > >That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome >with a lot of debugging (unless I set nr_requests lower again), which points >to a timing issue. > > So the problem you are seeing is due to "unlucky" timing between two processes submitting IO. And the very efficient mechanisms (merging, sorting) we have to improve situations exactly like this is effectively disabled. And to make it worse, it appears that your controller shits itself on this trivially simple pattern. Your hack makes a baby step in the direction of per *process* request limits, which I happen to be an advocate of. As it stands though, I don't like it. Jens has the final say when it comes to the block layer though. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 9:11 ` [linux-lvm] " Nick Piggin @ 2004-02-19 10:26 ` Jens Axboe -1 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-19 10:21 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > > > >No, I'm actually referring to a struct request. I'm logging this in the > >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have > >in fact logged all the bio's submitted to __make_request, and the output > >of the elevator from elv_next_request(). The bio's are submitted > >sequentially, > >the resulting requests aren't. But this is because nr_requests is 128, > >while > >the 3ware device has a queue of 254 entries (no tagging though). Upping > >nr_requests to 512 makes this go away .. > > > >That shouldn't be necessary though. I only see this with LVM over > >3ware-raid5, > >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less > >troublesome > >with a lot of debugging (unless I set nr_requests lower again), which > >points > >to a timing issue. > > > > > > So the problem you are seeing is due to "unlucky" timing between > two processes submitting IO. And the very efficient mechanisms > (merging, sorting) we have to improve situations exactly like this > is effectively disabled. And to make it worse, it appears that your > controller shits itself on this trivially simple pattern. > > Your hack makes a baby step in the direction of per *process* > request limits, which I happen to be an advocate of. As it stands > though, I don't like it. I'm very much an advocate for per process request limits as well. Would be trivial to add... Miquels patch is horrible, I appreciate it being posted as a cry for help. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
* [linux-lvm] Re: IO scheduler, queue depth, nr_requests @ 2004-02-19 10:26 ` Jens Axboe 0 siblings, 0 replies; 59+ messages in thread From: Jens Axboe @ 2004-02-19 10:26 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > > > >No, I'm actually referring to a struct request. I'm logging this in the > >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have > >in fact logged all the bio's submitted to __make_request, and the output > >of the elevator from elv_next_request(). The bio's are submitted > >sequentially, > >the resulting requests aren't. But this is because nr_requests is 128, > >while > >the 3ware device has a queue of 254 entries (no tagging though). Upping > >nr_requests to 512 makes this go away .. > > > >That shouldn't be necessary though. I only see this with LVM over > >3ware-raid5, > >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less > >troublesome > >with a lot of debugging (unless I set nr_requests lower again), which > >points > >to a timing issue. > > > > > > So the problem you are seeing is due to "unlucky" timing between > two processes submitting IO. And the very efficient mechanisms > (merging, sorting) we have to improve situations exactly like this > is effectively disabled. And to make it worse, it appears that your > controller shits itself on this trivially simple pattern. > > Your hack makes a baby step in the direction of per *process* > request limits, which I happen to be an advocate of. As it stands > though, I don't like it. I'm very much an advocate for per process request limits as well. Would be trivial to add... Miquels patch is horrible, I appreciate it being posted as a cry for help. -- Jens Axboe ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2004-02-24 15:58 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-02-16 9:11 [linux-lvm] IO scheduler, queue depth, nr_requests Miquel van Smoorenburg 2004-02-16 8:30 ` [linux-lvm] " Jens Axboe 2004-02-16 9:01 ` Joe Thornber 2004-02-16 19:32 ` Kevin P. Fleming 2004-02-17 1:46 ` Kevin P. Fleming 2004-02-18 10:29 ` Miquel van Smoorenburg 2004-02-18 23:52 ` Miquel van Smoorenburg 2004-02-19 8:51 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-19 1:24 ` Nick Piggin 2004-02-19 8:51 ` [linux-lvm] " Nick Piggin 2004-02-19 1:52 ` Miquel van Smoorenburg 2004-02-19 9:00 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-19 2:01 ` Nick Piggin 2004-02-19 9:00 ` [linux-lvm] " Nick Piggin 2004-02-19 1:26 ` Andrew Morton 2004-02-19 8:50 ` [linux-lvm] " Andrew Morton 2004-02-19 2:11 ` Miquel van Smoorenburg 2004-02-19 9:08 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-19 2:26 ` Andrew Morton 2004-02-19 9:08 ` [linux-lvm] " Andrew Morton 2004-02-19 10:15 ` Miquel van Smoorenburg 2004-02-19 10:23 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-19 10:19 ` Jens Axboe 2004-02-19 10:26 ` [linux-lvm] " Jens Axboe 2004-02-19 15:58 ` Miquel van Smoorenburg 2004-02-19 20:59 ` Miquel van Smoorenburg 2004-02-19 17:52 ` [linux-lvm] " Nick Piggin 2004-02-19 22:52 ` Nick Piggin 2004-02-19 18:52 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-19 23:53 ` Miquel van Smoorenburg 2004-02-19 19:15 ` [linux-lvm] " Nick Piggin 2004-02-20 0:15 ` Nick Piggin 2004-02-19 20:16 ` [linux-lvm] [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin 2004-02-20 1:12 ` Nick Piggin 2004-02-19 20:25 ` [linux-lvm] " Andrew Morton 2004-02-20 1:26 ` Andrew Morton 2004-02-19 20:40 ` [linux-lvm] " Nick Piggin 2004-02-20 1:40 ` Nick Piggin 2004-02-19 21:32 ` [linux-lvm] " Andrew Morton 2004-02-20 2:32 ` Andrew Morton 2004-02-20 14:40 ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg 2004-02-23 8:41 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-20 9:56 ` [linux-lvm] " Joe Thornber 2004-02-20 14:59 ` Joe Thornber 2004-02-20 9:59 ` [linux-lvm] " Jens Axboe 2004-02-20 15:00 ` Jens Axboe 2004-02-22 14:02 ` Miquel van Smoorenburg 2004-02-23 8:45 ` [linux-lvm] " Miquel van Smoorenburg 2004-02-22 14:55 ` Andrew Morton 2004-02-22 19:55 ` Andrew Morton 2004-02-20 9:57 ` [linux-lvm] " Jens Axboe 2004-02-20 14:57 ` Jens Axboe 2004-02-24 12:54 ` [linux-lvm] Queue congestion: passing down vs passing up [PATCH] Miquel van Smoorenburg 2004-02-19 20:52 ` [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin 2004-02-20 1:45 ` Nick Piggin 2004-02-19 2:51 ` IO scheduler, queue depth, nr_requests Nick Piggin 2004-02-19 9:11 ` [linux-lvm] " Nick Piggin 2004-02-19 10:21 ` Jens Axboe 2004-02-19 10:26 ` [linux-lvm] " Jens Axboe
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.