[linux-lvm] IO scheduler, queue depth, nr

All of lore.kernel.org
 help / color / mirror / Atom feed

* [linux-lvm] IO scheduler, queue depth, nr_requests
@ 2004-02-16  9:11 Miquel van Smoorenburg
  2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-16  9:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-lvm

Hello,

	as you might have seen from the linux-kernel mailinglist
I have been testing for months now with a fileserver set up to
use XFS over LVM2 on a 3ware RAID5 controller.

I asked for help several times on the list, but nobody really
replied, so now I'm taking a shot at mailing you directly, since
you appear to be the I/O request queueing guru of the kernel ;)
Cc: sent to linux-lvm@sistina.com. Any hint appreciated.

For some reason, when using LVM, write requests get queued out
of order to the 3ware controller, which results in quite a bit
of seeking and thus performance loss.

The default queue depth of the 3ware controller is 254. I found
out that lowering it to 64 in the driver fixed my problems, and
I advised 3ware support about this. They weren't really convinced..

By fiddling about today I just found that changing
/sys/block/sda/queue/nr_requests from 128 to something above
the queue depth of the 3ware controller (256 doesn't work,
384 and up do) also fixes the problem.

Does that actually make sense ?

Ah yes, I'm currently using 2.6.2 with a 3ware 8506-8 in
hardware raid5 mode, deadline scheduler, PIV 3.0 Ghz, 2 GB RAM.

Debug output ("mydd" works just like "dd", but has an fsync option):

- /mnt is an XFS filesystem on a LVM2 volume on the 3ware
- /mnt2 is an XFS filesystem directly on /dev/sda1 of the 3ware

- First on /mnt, the LVM partition. Note that a small "dd" runs
  fast, a larger one runs slower:

# cd /mnt
# cat /sys/block/sda/device/queue_depth
254
# cat /sys/block/sda/queue/nr_requests
128
# ~/mydd --if /dev/zero --of file --bs 4096 --count 50000 --fsync
204800000 bytes transferred in 2.679812 seconds (76423271 bytes/sec)
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 9.501549 seconds (43108760 bytes/sec)

- Now I set the nr_requests to 512:
# echo 512 > /sys/block/sda/queue/nr_requests
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 5.374437 seconds (76212634 bytes/sec)

See that ? Weird thing is, it's only on LVM, directly on /dev/sda1
no problem at all:

# cat /sys/block/sda/device/queue_depth
254
# cat /sys/block/sda/queue/nr_requests
128
# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec)

Somehow, LVM is causing the requests to the underlying 3ware
device to get out of order, and increasing nr_requests to be
larger than the queue_depth of the device fixes this.

I tried the latest dm-patches in -mm (applied those to vanilla
2.6.2), which include a patch called dm-04-maintain-bio-ordering.patch
but that doesn't really help (at first I though otherwise, but the
tests scripts I used lowered the queue_depth of the 3ware to 64
by accident) - if anything, it makes things worse.

# ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
409600000 bytes transferred in 13.138224 seconds (31176208 bytes/sec)

Setting nr_requests to 512 fixes things up again.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-16  9:11 [linux-lvm] IO scheduler, queue depth, nr_requests Miquel van Smoorenburg
@ 2004-02-16  8:30 ` Jens Axboe
  2004-02-16  9:01   ` Joe Thornber
                     ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-16  8:30 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-lvm

On Mon, Feb 16 2004, Miquel van Smoorenburg wrote:
> Hello,
> 
> 	as you might have seen from the linux-kernel mailinglist
> I have been testing for months now with a fileserver set up to
> use XFS over LVM2 on a 3ware RAID5 controller.
> 
> I asked for help several times on the list, but nobody really
> replied, so now I'm taking a shot at mailing you directly, since
> you appear to be the I/O request queueing guru of the kernel ;)
> Cc: sent to linux-lvm@sistina.com. Any hint appreciated.
> 
> For some reason, when using LVM, write requests get queued out
> of order to the 3ware controller, which results in quite a bit
> of seeking and thus performance loss.
> 
> The default queue depth of the 3ware controller is 254. I found
> out that lowering it to 64 in the driver fixed my problems, and
> I advised 3ware support about this. They weren't really convinced..
> 
> By fiddling about today I just found that changing
> /sys/block/sda/queue/nr_requests from 128 to something above
> the queue depth of the 3ware controller (256 doesn't work,
> 384 and up do) also fixes the problem.
> 
> Does that actually make sense ?

Yes, it makes perfect sense, I've been aware of this problem for quite
some time. If you look init_tag_map() in ll_rw_blk.c:

	if (depth > q->nr_requests / 2) {
		q->nr_requests = depth * 2;
		printk(KERN_INFO "%s: large TCQ depth: adjusted nr_requests "
                         "to %lu\n", __FUNCTION__, q->nr_requests);
	}

it pretty much matches the problem you outlined. Unfortunately, the
tagging depth of SCSI drivers cannot be controlled unless they use the
generic block tagging helpers, and to my knowledge only a single driver
does...

> Ah yes, I'm currently using 2.6.2 with a 3ware 8506-8 in
> hardware raid5 mode, deadline scheduler, PIV 3.0 Ghz, 2 GB RAM.
> 
> Debug output ("mydd" works just like "dd", but has an fsync option):
> 
> - /mnt is an XFS filesystem on a LVM2 volume on the 3ware
> - /mnt2 is an XFS filesystem directly on /dev/sda1 of the 3ware
> 
> - First on /mnt, the LVM partition. Note that a small "dd" runs
>   fast, a larger one runs slower:
> 
> # cd /mnt
> # cat /sys/block/sda/device/queue_depth
> 254
> # cat /sys/block/sda/queue/nr_requests
> 128
> # ~/mydd --if /dev/zero --of file --bs 4096 --count 50000 --fsync
> 204800000 bytes transferred in 2.679812 seconds (76423271 bytes/sec)
> # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
> 409600000 bytes transferred in 9.501549 seconds (43108760 bytes/sec)
> 
> - Now I set the nr_requests to 512:
> # echo 512 > /sys/block/sda/queue/nr_requests
> # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
> 409600000 bytes transferred in 5.374437 seconds (76212634 bytes/sec)
> 
> See that ? Weird thing is, it's only on LVM, directly on /dev/sda1
> no problem at all:
> 
> # cat /sys/block/sda/device/queue_depth
> 254
> # cat /sys/block/sda/queue/nr_requests
> 128
> # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
> 409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec)
> 
> Somehow, LVM is causing the requests to the underlying 3ware
> device to get out of order, and increasing nr_requests to be
> larger than the queue_depth of the device fixes this.
> 
> I tried the latest dm-patches in -mm (applied those to vanilla
> 2.6.2), which include a patch called dm-04-maintain-bio-ordering.patch
> but that doesn't really help (at first I though otherwise, but the
> tests scripts I used lowered the queue_depth of the 3ware to 64
> by accident) - if anything, it makes things worse.
> 
> # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
> 409600000 bytes transferred in 13.138224 seconds (31176208 bytes/sec)
> 
> Setting nr_requests to 512 fixes things up again.

Seems there's an extra problem here, the nr_requests vs depth problem
should not be too problematic unless you have heavy random io. Doesn't
look like dm is reordering (bio_list_add() adds to tail,
flush_deferred_io() processes from head. direct queueing doesn't look
like it's reordering). Can the dm folks verify this?

Or, you are just being hit by the problem first listed - requests get no
hold time in the io scheduler for merging, because the driver drains
them too quickly because of this artificially huge queue depth. If you
did some stats on average request size and io/sec rate that should tell
you for sure. I don't know what you have behind the 3ware, but it's
generally not advised to use more than 4 tags per spindle.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
@ 2004-02-16  9:01   ` Joe Thornber
  2004-02-16 19:32   ` Kevin P. Fleming
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Joe Thornber @ 2004-02-16  9:01 UTC (permalink / raw)
  To: linux-lvm, Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm

On Mon, Feb 16, 2004 at 02:30:47PM +0100, Jens Axboe wrote:
> Seems there's an extra problem here, the nr_requests vs depth problem
> should not be too problematic unless you have heavy random io. Doesn't
> look like dm is reordering (bio_list_add() adds to tail,
> flush_deferred_io() processes from head. direct queueing doesn't look
> like it's reordering). Can the dm folks verify this?

Ordering is certainly maintained for the simple targets (linear,
striped).  Snapshots and mirroring still need some work in this
regard, but I don't think these are begin used for the tests.

The place that I am expecting dm to cause poor performance is due to
this bit of stupidity in dm-table.c.  It's on my TODO list, but is not
getting to the op since no-one seems to be complaining about it yet.

                /* FIXME: Device-Mapper on top of RAID-0 breaks because DM
                 *        currently doesn't honor MD's merge_bvec_fn routine.
                 *        In this case, we'll force DM to use PAGE_SIZE or
                 *        smaller I/O, just to be safe. A better fix is in the
                 *        works, but add this for the time being so it will at
                 *        least operate correctly.
                 */
                if (q->merge_bvec_fn)
                        rs->max_sectors =
                                min_not_zero(rs->max_sectors,
                                             (unsigned short)(PAGE_SIZE >> 9));



- Joe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
  2004-02-16  9:01   ` Joe Thornber
@ 2004-02-16 19:32   ` Kevin P. Fleming
  2004-02-17  1:46   ` Kevin P. Fleming
  2004-02-18 10:29   ` Miquel van Smoorenburg
  3 siblings, 0 replies; 59+ messages in thread
From: Kevin P. Fleming @ 2004-02-16 19:32 UTC (permalink / raw)
  To: linux-lvm; +Cc: Miquel van Smoorenburg

Jens Axboe wrote:
 > On Mon, Feb 16 2004, Miquel van Smoorenburg wrote:
 >
 >>Hello,
 >>
 >>	as you might have seen from the linux-kernel mailinglist
 >>I have been testing for months now with a fileserver set up to
 >>use XFS over LVM2 on a 3ware RAID5 controller.
 >>

I too have been fighting lower-than-expected performance with this 
identical combination, other than using a P4 2.8GHz instead of 3.0GHz, 
and 1GB RAM instead of 2GB. My 3ware 8506-8 has six Seagate Barracuda 
SATA drives behind it, which are very fast on their own :-) Try as I 
might, with the 2.6 kernel I have not been able to generate a bonnie++ 
or iozone run that provided performance better than a single disk, even 
with the 3ware card in JBOD, RAID-0 or RAID-1 mode. I had been 
conversing with 3ware about it, but to no real effect given than they 
don't support 2.6 yet.

 >>By fiddling about today I just found that changing
 >>/sys/block/sda/queue/nr_requests from 128 to something above
 >>the queue depth of the 3ware controller (256 doesn't work,
 >>384 and up do) also fixes the problem.

I will try this as well this evening and report my results.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
  2004-02-16  9:01   ` Joe Thornber
  2004-02-16 19:32   ` Kevin P. Fleming
@ 2004-02-17  1:46   ` Kevin P. Fleming
  2004-02-18 10:29   ` Miquel van Smoorenburg
  3 siblings, 0 replies; 59+ messages in thread
From: Kevin P. Fleming @ 2004-02-17  1:46 UTC (permalink / raw)
  To: linux-lvm; +Cc: Miquel van Smoorenburg

Jens Axboe wrote:

>>By fiddling about today I just found that changing
>>/sys/block/sda/queue/nr_requests from 128 to something above
>>the queue depth of the 3ware controller (256 doesn't work,
>>384 and up do) also fixes the problem.

OK, I have run three sets of bonnie++ tests on my system. This was using 
the 2.6-bk kernel as of about midnight GMT of 2004-02-17.

System is single P4HT 2.8GHz (SMP kernel), 1GB RAM (4GB highmem 
enabled), Intel 865G chipset, 3ware 8506-8 with six Seagate 160GB 
Barracuda SATA disks in a single RAID-5. bonnie++ tests run on an XFS 
filesystem (made with default mkfs.xfs parameters using a recent 
xfstools package) over a 200GB LVM2 volume (dm-linear).

bonnie++ -r 512 -s 40960 -f -b

Anticipatory scheduler, nr_requests=128 (default)

Seq. Write   53772
Rand. Write  15557
Seq. Read    47730
Rand. Seek   146.5

Anticipatory scheduler, nr_requests=384

Seq. Write   53843
Rand. Write  17595
Seq. Read    44663
Rand. Seek   143.1

Deadline scheduler, nr_requests=384

Seq. Write   54973
Rand. Write  18897
Seq. Read    41476
Rand. Seek   227.3 (!)

Miquel's suggestion has a definite positive effect, except on sequential 
reads. This system still doesn't produce anywhere near the throughput 
I'd expect though, and still doesn't come close to 2.4 numbers either 
(but those are somewhat bogus, as they rely on setting extremely large 
readahead settings). There may be some sort of hardware problem that is 
limiting me to ~60MB/s, although with 2.4 I was able to get sequential 
read numbers around 95MB/s, so I think the hardware is OK.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
                     ` (2 preceding siblings ...)
  2004-02-17  1:46   ` Kevin P. Fleming
@ 2004-02-18 10:29   ` Miquel van Smoorenburg
  2004-02-19  8:51       ` [linux-lvm] " Miquel van Smoorenburg
  3 siblings, 1 reply; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-18 10:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, Joe Thornber

On 2004.02.16 14:30, Jens Axboe wrote:
> On Mon, Feb 16 2004, Miquel van Smoorenburg wrote:
>
> > By fiddling about today I just found that changing
> > /sys/block/sda/queue/nr_requests from 128 to something above
> > the queue depth of the 3ware controller (256 doesn't work,
> > 384 and up do) also fixes the problem.
> > 

[....]

> > See that ? Weird thing is, it's only on LVM, directly on /dev/sda1
> > no problem at all:
> > 
> > # cat /sys/block/sda/device/queue_depth
> > 254
> > # cat /sys/block/sda/queue/nr_requests
> > 128
> > # ~/mydd --if /dev/zero --of file --bs 4096 --count 100000 --fsync
> > 409600000 bytes transferred in 5.135642 seconds (79756338 bytes/sec)
> > 
> > Somehow, LVM is causing the requests to the underlying 3ware
> > device to get out of order, and increasing nr_requests to be
> > larger than the queue_depth of the device fixes this.
>
> Seems there's an extra problem here, the nr_requests vs depth problem
> should not be too problematic unless you have heavy random io. Doesn't
> look like dm is reordering (bio_list_add() adds to tail,
> flush_deferred_io() processes from head. direct queueing doesn't look
> like it's reordering). Can the dm folks verify this?
> 
> Or, you are just being hit by the problem first listed - requests get no
> hold time in the io scheduler for merging, because the driver drains
> them too quickly because of this artificially huge queue depth. If you
> did some stats on average request size and io/sec rate that should tell
> you for sure. I don't know what you have behind the 3ware, but it's
> generally not advised to use more than 4 tags per spindle.

Okay I repeated some earlier tests, and I added some debug code in
several places.

I added logging to tw_scsi_queue() in the 3ware driver to log the
start sector and length of each request. It logs something like:
3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256

With a perl script, I can check if the requests are sent to the
host in order. That outputs something like this:

Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31

See, 31 requests in order, then one request "backwards", then 31 in order, etc.

I added some queue debug code as well, both the LVM2 queue and 3ware
queue have the following settings:

max_sectors: 256
max_phys_segments: 128
max_hw_segments: 62
hardsect_size: 512
max_segment_size: 65536
seg_boundary_mask: ffffffff

Now 31 * 2 == 62 == max_hw_segments .. coincidence ?

Weird thing is, still still only happens with LVM over 3ware raid5, not
on /dev/sda1 of the 3ware directly.

I added some printk's to scsi_request_fn() in scsi_lib.c to see if a
requests was getting requeued - but no.

Upping nr_requests to 2 * queue_depth does still fix things, but as you
said that should not be necessary. This bugs me, I want to find out why
this only happens with LVM and not without ... 

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-18 10:29   ` Miquel van Smoorenburg
@ 2004-02-19  8:51       ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-18 23:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber

On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> For some reason, when using LVM, write requests get queued out
> of order to the 3ware controller, which results in quite a bit
> of seeking and thus performance loss.
[..]
> Okay I repeated some earlier tests, and I added some debug code in
> several places.
> 
> I added logging to tw_scsi_queue() in the 3ware driver to log the
> start sector and length of each request. It logs something like:
> 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> 
> With a perl script, I can check if the requests are sent to the
> host in order. That outputs something like this:
> 
> Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> See, 31 requests in order, then one request "backwards", then 31 in order, etc.

I found out what causes this. It's get_request_wait().

When the request queue is full, and a new request needs to be created,
__make_request() blocks in get_request_wait().

Another process wakes up first (pdflush / process submitting I/O itself /
xfsdatad / etc) and sends the next bio's to __make_request().
In the mean time some free requests have become available, and the bios
are merged into a new request. Those requests are submitted to the device.

Then, get_request_wait() returns but the bio is not mergeable anymore -
and that results in a backwards seek, severely limiting the I/O rate.

Wouldn't it be better to allow the request allocation and queue the
request, and /then/ put the process to sleep ? The queue will grow larger
than nr_requests, but it does that anyway.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  8:51       ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  8:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber

On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> For some reason, when using LVM, write requests get queued out
> of order to the 3ware controller, which results in quite a bit
> of seeking and thus performance loss.
[..]
> Okay I repeated some earlier tests, and I added some debug code in
> several places.
> 
> I added logging to tw_scsi_queue() in the 3ware driver to log the
> start sector and length of each request. It logs something like:
> 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> 
> With a perl script, I can check if the requests are sent to the
> host in order. That outputs something like this:
> 
> Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> See, 31 requests in order, then one request "backwards", then 31 in order, etc.

I found out what causes this. It's get_request_wait().

When the request queue is full, and a new request needs to be created,
__make_request() blocks in get_request_wait().

Another process wakes up first (pdflush / process submitting I/O itself /
xfsdatad / etc) and sends the next bio's to __make_request().
In the mean time some free requests have become available, and the bios
are merged into a new request. Those requests are submitted to the device.

Then, get_request_wait() returns but the bio is not mergeable anymore -
and that results in a backwards seek, severely limiting the I/O rate.

Wouldn't it be better to allow the request allocation and queue the
request, and /then/ put the process to sleep ? The queue will grow larger
than nr_requests, but it does that anyway.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  8:51       ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19  8:51         ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  1:24 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber



Miquel van Smoorenburg wrote:

>On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
>
>>For some reason, when using LVM, write requests get queued out
>>of order to the 3ware controller, which results in quite a bit
>>of seeking and thus performance loss.
>>
>[..]
>
>>Okay I repeated some earlier tests, and I added some debug code in
>>several places.
>>
>>I added logging to tw_scsi_queue() in the 3ware driver to log the
>>start sector and length of each request. It logs something like:
>>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
>>
>>With a perl script, I can check if the requests are sent to the
>>host in order. That outputs something like this:
>>
>>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
>>
>>See, 31 requests in order, then one request "backwards", then 31 in order, etc.
>>
>
>I found out what causes this. It's get_request_wait().
>
>When the request queue is full, and a new request needs to be created,
>__make_request() blocks in get_request_wait().
>
>Another process wakes up first (pdflush / process submitting I/O itself /
>xfsdatad / etc) and sends the next bio's to __make_request().
>In the mean time some free requests have become available, and the bios
>are merged into a new request. Those requests are submitted to the device.
>
>Then, get_request_wait() returns but the bio is not mergeable anymore -
>and that results in a backwards seek, severely limiting the I/O rate.
>
>Wouldn't it be better to allow the request allocation and queue the
>request, and /then/ put the process to sleep ? The queue will grow larger
>than nr_requests, but it does that anyway.
>
>

The "batching" logic there should allow a process to submit
a number of requests even above the nr_requests limit to
prevent this interleave and context switching.

Are you using tagged command queueing? What depth?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  8:51         ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  8:51 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber


Miquel van Smoorenburg wrote:

>On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
>
>>For some reason, when using LVM, write requests get queued out
>>of order to the 3ware controller, which results in quite a bit
>>of seeking and thus performance loss.
>>
>[..]
>
>>Okay I repeated some earlier tests, and I added some debug code in
>>several places.
>>
>>I added logging to tw_scsi_queue() in the 3ware driver to log the
>>start sector and length of each request. It logs something like:
>>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
>>
>>With a perl script, I can check if the requests are sent to the
>>host in order. That outputs something like this:
>>
>>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
>>
>>See, 31 requests in order, then one request "backwards", then 31 in order, etc.
>>
>
>I found out what causes this. It's get_request_wait().
>
>When the request queue is full, and a new request needs to be created,
>__make_request() blocks in get_request_wait().
>
>Another process wakes up first (pdflush / process submitting I/O itself /
>xfsdatad / etc) and sends the next bio's to __make_request().
>In the mean time some free requests have become available, and the bios
>are merged into a new request. Those requests are submitted to the device.
>
>Then, get_request_wait() returns but the bio is not mergeable anymore -
>and that results in a backwards seek, severely limiting the I/O rate.
>
>Wouldn't it be better to allow the request allocation and queue the
>request, and /then/ put the process to sleep ? The queue will grow larger
>than nr_requests, but it does that anyway.
>
>

The "batching" logic there should allow a process to submit
a number of requests even above the nr_requests limit to
prevent this interleave and context switching.

Are you using tagged command queueing? What depth?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  8:51         ` [linux-lvm] " Nick Piggin
@ 2004-02-19  9:00           ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  1:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel,
	Joe Thornber

On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
> Miquel van Smoorenburg wrote:
>
> >I found out what causes this. It's get_request_wait().
> >
> >When the request queue is full, and a new request needs to be created,
> >__make_request() blocks in get_request_wait().
> >
> >Another process wakes up first (pdflush / process submitting I/O itself /
> >xfsdatad / etc) and sends the next bio's to __make_request().
> >In the mean time some free requests have become available, and the bios
> >are merged into a new request. Those requests are submitted to the device.
> >
> >Then, get_request_wait() returns but the bio is not mergeable anymore -
> >and that results in a backwards seek, severely limiting the I/O rate.
> >
> 
> The "batching" logic there should allow a process to submit
> a number of requests even above the nr_requests limit to
> prevent this interleave and context switching.
> 
> Are you using tagged command queueing? What depth?

No, I'm not using tagged command queueing. The 3ware controller is not a
real scsi controller, the driver just emulates one. It's a raid5 controller
that drives SATA disks. It has an internal request queue ("can_queu")
of 254 outstanding commands. Because that is way bigger than nr_requests
this happens - if I set nr_requests to 512, the problem goes away. But
that shouldn't happen ;)

I'm preparing a proof-of-concept patch now, if it works and I don't wedge
the remote machine I'm testing this on I'll post it in a few minutes.

Mike.
> 
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  9:00           ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  9:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel,
	Joe Thornber

On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
> Miquel van Smoorenburg wrote:
>
> >I found out what causes this. It's get_request_wait().
> >
> >When the request queue is full, and a new request needs to be created,
> >__make_request() blocks in get_request_wait().
> >
> >Another process wakes up first (pdflush / process submitting I/O itself /
> >xfsdatad / etc) and sends the next bio's to __make_request().
> >In the mean time some free requests have become available, and the bios
> >are merged into a new request. Those requests are submitted to the device.
> >
> >Then, get_request_wait() returns but the bio is not mergeable anymore -
> >and that results in a backwards seek, severely limiting the I/O rate.
> >
> 
> The "batching" logic there should allow a process to submit
> a number of requests even above the nr_requests limit to
> prevent this interleave and context switching.
> 
> Are you using tagged command queueing? What depth?

No, I'm not using tagged command queueing. The 3ware controller is not a
real scsi controller, the driver just emulates one. It's a raid5 controller
that drives SATA disks. It has an internal request queue ("can_queu")
of 254 outstanding commands. Because that is way bigger than nr_requests
this happens - if I set nr_requests to 512, the problem goes away. But
that shouldn't happen ;)

I'm preparing a proof-of-concept patch now, if it works and I don't wedge
the remote machine I'm testing this on I'll post it in a few minutes.

Mike.
> 
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  9:00           ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19  9:00             ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  2:01 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
>
>>Miquel van Smoorenburg wrote:
>>
>>
>>>I found out what causes this. It's get_request_wait().
>>>
>>>When the request queue is full, and a new request needs to be created,
>>>__make_request() blocks in get_request_wait().
>>>
>>>Another process wakes up first (pdflush / process submitting I/O itself /
>>>xfsdatad / etc) and sends the next bio's to __make_request().
>>>In the mean time some free requests have become available, and the bios
>>>are merged into a new request. Those requests are submitted to the device.
>>>
>>>Then, get_request_wait() returns but the bio is not mergeable anymore -
>>>and that results in a backwards seek, severely limiting the I/O rate.
>>>
>>>
>>The "batching" logic there should allow a process to submit
>>a number of requests even above the nr_requests limit to
>>prevent this interleave and context switching.
>>
>>Are you using tagged command queueing? What depth?
>>
>
>No, I'm not using tagged command queueing. The 3ware controller is not a
>real scsi controller, the driver just emulates one. It's a raid5 controller
>that drives SATA disks. It has an internal request queue ("can_queu")
>of 254 outstanding commands.
>

This is what I mean by tagged command queueing.

> Because that is way bigger than nr_requests
>this happens - if I set nr_requests to 512, the problem goes away. But
>that shouldn't happen ;)
>
>

What shouldn't happen?

>I'm preparing a proof-of-concept patch now, if it works and I don't wedge
>the remote machine I'm testing this on I'll post it in a few minutes.
>
>

I'm not very happy with forcing a process to sleep _after_ it
has submitted a request... but I'd be interested to see exactly
what your patch does.

By far the best option is to use appropriately sized queues.
The below patch is a start, but it unfortunately doesn't help
drivers which use private queueing implementations.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  9:00             ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  9:00 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber


Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
>
>>Miquel van Smoorenburg wrote:
>>
>>
>>>I found out what causes this. It's get_request_wait().
>>>
>>>When the request queue is full, and a new request needs to be created,
>>>__make_request() blocks in get_request_wait().
>>>
>>>Another process wakes up first (pdflush / process submitting I/O itself /
>>>xfsdatad / etc) and sends the next bio's to __make_request().
>>>In the mean time some free requests have become available, and the bios
>>>are merged into a new request. Those requests are submitted to the device.
>>>
>>>Then, get_request_wait() returns but the bio is not mergeable anymore -
>>>and that results in a backwards seek, severely limiting the I/O rate.
>>>
>>>
>>The "batching" logic there should allow a process to submit
>>a number of requests even above the nr_requests limit to
>>prevent this interleave and context switching.
>>
>>Are you using tagged command queueing? What depth?
>>
>
>No, I'm not using tagged command queueing. The 3ware controller is not a
>real scsi controller, the driver just emulates one. It's a raid5 controller
>that drives SATA disks. It has an internal request queue ("can_queu")
>of 254 outstanding commands.
>

This is what I mean by tagged command queueing.

> Because that is way bigger than nr_requests
>this happens - if I set nr_requests to 512, the problem goes away. But
>that shouldn't happen ;)
>
>

What shouldn't happen?

>I'm preparing a proof-of-concept patch now, if it works and I don't wedge
>the remote machine I'm testing this on I'll post it in a few minutes.
>
>

I'm not very happy with forcing a process to sleep _after_ it
has submitted a request... but I'd be interested to see exactly
what your patch does.

By far the best option is to use appropriately sized queues.
The below patch is a start, but it unfortunately doesn't help
drivers which use private queueing implementations.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  8:51       ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19  8:50         ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19  1:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: axboe, miquels, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > For some reason, when using LVM, write requests get queued out
> > of order to the 3ware controller, which results in quite a bit
> > of seeking and thus performance loss.
> [..]
> > Okay I repeated some earlier tests, and I added some debug code in
> > several places.
> > 
> > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > start sector and length of each request. It logs something like:
> > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > 
> > With a perl script, I can check if the requests are sent to the
> > host in order. That outputs something like this:
> > 
> > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31

3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
actually referring to a single request which has 31 128k BIOs in it, yes?

> > 
> > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> 
> I found out what causes this. It's get_request_wait().
> 
> When the request queue is full, and a new request needs to be created,
> __make_request() blocks in get_request_wait().
> 
> Another process wakes up first (pdflush / process submitting I/O itself /
> xfsdatad / etc) and sends the next bio's to __make_request().
> In the mean time some free requests have become available, and the bios
> are merged into a new request. Those requests are submitted to the device.
> 
> Then, get_request_wait() returns but the bio is not mergeable anymore -
> and that results in a backwards seek, severely limiting the I/O rate.
> 
> Wouldn't it be better to allow the request allocation and queue the
> request, and /then/ put the process to sleep ? The queue will grow larger
> than nr_requests, but it does that anyway.

That would help, but is a bit kludgy.

What _should_ have happened was:

a) The queue gets plugged

b) The maximal-sized 31-bio request is queued

c) The 4k request gets inserted *before* the 3968k request

d) The queue gets unplugged, and both requests stream smoothly.

(And this assumes that the queue was initially empty.  ALmost certainly
that was not the case).

So the question is, why did the little, later 4k request not get itself
inserted in front of the earlier, large 3968k request?

Are you using md as well?  If so, try removing the blk_run_queues()
statements from md_thread() and md_do_sync().  And generally hunt around
and try nuking blk_run_queues() statements from the device driver/md/dm
layer - they may well be wrong.

Is this problem specific to md?  Can it be reproduced on a boring-old-disk?

Does elevator=deadline change anything?



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  8:50         ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19  8:50 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > For some reason, when using LVM, write requests get queued out
> > of order to the 3ware controller, which results in quite a bit
> > of seeking and thus performance loss.
> [..]
> > Okay I repeated some earlier tests, and I added some debug code in
> > several places.
> > 
> > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > start sector and length of each request. It logs something like:
> > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > 
> > With a perl script, I can check if the requests are sent to the
> > host in order. That outputs something like this:
> > 
> > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31

3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
actually referring to a single request which has 31 128k BIOs in it, yes?

> > 
> > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> 
> I found out what causes this. It's get_request_wait().
> 
> When the request queue is full, and a new request needs to be created,
> __make_request() blocks in get_request_wait().
> 
> Another process wakes up first (pdflush / process submitting I/O itself /
> xfsdatad / etc) and sends the next bio's to __make_request().
> In the mean time some free requests have become available, and the bios
> are merged into a new request. Those requests are submitted to the device.
> 
> Then, get_request_wait() returns but the bio is not mergeable anymore -
> and that results in a backwards seek, severely limiting the I/O rate.
> 
> Wouldn't it be better to allow the request allocation and queue the
> request, and /then/ put the process to sleep ? The queue will grow larger
> than nr_requests, but it does that anyway.

That would help, but is a bit kludgy.

What _should_ have happened was:

a) The queue gets plugged

b) The maximal-sized 31-bio request is queued

c) The 4k request gets inserted *before* the 3968k request

d) The queue gets unplugged, and both requests stream smoothly.

(And this assumes that the queue was initially empty.  ALmost certainly
that was not the case).

So the question is, why did the little, later 4k request not get itself
inserted in front of the earlier, large 3968k request?

Are you using md as well?  If so, try removing the blk_run_queues()
statements from md_thread() and md_do_sync().  And generally hunt around
and try nuking blk_run_queues() statements from the device driver/md/dm
layer - they may well be wrong.

Is this problem specific to md?  Can it be reproduced on a boring-old-disk?

Does elevator=deadline change anything?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  8:50         ` [linux-lvm] " Andrew Morton
@ 2004-02-19  9:08           ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  2:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miquel van Smoorenburg, axboe, miquels, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > > For some reason, when using LVM, write requests get queued out
> > > of order to the 3ware controller, which results in quite a bit
> > > of seeking and thus performance loss.
> > [..]
> > > Okay I repeated some earlier tests, and I added some debug code in
> > > several places.
> > > 
> > > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > > start sector and length of each request. It logs something like:
> > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > > 
> > > With a perl script, I can check if the requests are sent to the
> > > host in order. That outputs something like this:
> > > 
> > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> 3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
> actually referring to a single request which has 31 128k BIOs in it, yes?

No, I'm actually referring to a struct request. I'm logging this in the
SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
in fact logged all the bio's submitted to __make_request, and the output
of the elevator from elv_next_request(). The bio's are submitted sequentially,
the resulting requests aren't. But this is because nr_requests is 128, while
the 3ware device has a queue of 254 entries (no tagging though). Upping
nr_requests to 512 makes this go away ..

That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
with a lot of debugging (unless I set nr_requests lower again), which points
to a timing issue.


> > > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> > 
> > I found out what causes this. It's get_request_wait().
> > 
> > When the request queue is full, and a new request needs to be created,
> > __make_request() blocks in get_request_wait().
> > 
> > Another process wakes up first (pdflush / process submitting I/O itself /
> > xfsdatad / etc) and sends the next bio's to __make_request().
> > In the mean time some free requests have become available, and the bios
> > are merged into a new request. Those requests are submitted to the device.
> > 
> > Then, get_request_wait() returns but the bio is not mergeable anymore -
> > and that results in a backwards seek, severely limiting the I/O rate.
> > 
> > Wouldn't it be better to allow the request allocation and queue the
> > request, and /then/ put the process to sleep ? The queue will grow larger
> > than nr_requests, but it does that anyway.
> 
> That would help, but is a bit kludgy.

Patch that does exactly this (it's probably butt-ugly, but it proves the
point) below.

> What _should_ have happened was:
> 
> a) The queue gets plugged
> 
> b) The maximal-sized 31-bio request is queued
> 
> c) The 4k request gets inserted *before* the 3968k request
> 
> d) The queue gets unplugged, and both requests stream smoothly.
> 
> (And this assumes that the queue was initially empty.  ALmost certainly
> that was not the case).

The thing is, the bio's are submitted perfectly sequentially. It's just that
every so often a request (with just its initial bio) gets stuck for a while.
Because the bio's after it are merged and sent to the device, it's not
possible to merge that stuck request later on when it gets unstuck, because
the other bio's have already left the building so to speak.

I'm not using MD at all, and I tried all elevators. All the same results,
so I stayed with elevator=noop because it's so much easier to understand ;)

This proof-of-concept patch fixes things:

ll_rw_blk.c.patch

(Hmm, 03:11, time to go to bed. I'm happy I finally found this though!)

--- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG	2004-02-04 04:43:10.000000000 +0100
+++ linux-2.6.3/drivers/block/ll_rw_blk.c	2004-02-19 02:51:06.000000000 +0100
@@ -1545,7 +1545,8 @@
 /*
  * Get a free request, queue_lock must not be held
  */
-static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
+static struct request *get_request(request_queue_t *q, int rw,
+					int gfp_mask, int *qfull)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
@@ -1569,10 +1570,15 @@
 			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
 		/*
 		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * "batcher", and not exempted by the IO scheduler.
+		 * If "qfull" is a valid pointer, set it to 1 to return
+		 * this info to the caller so that it can sleep.
 		 */
-		spin_unlock_irq(q->queue_lock);
-		goto out;
+		if (qfull == NULL) {
+			spin_unlock_irq(q->queue_lock);
+			goto out;
+		} else
+			*qfull = 1;
 	}
 
 	rl->count[rw]++;
@@ -1627,7 +1633,7 @@
  * No available requests for this queue, unplug the device and wait for some
  * requests to become available.
  */
-static struct request *get_request_wait(request_queue_t *q, int rw)
+static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq)
 {
 	DEFINE_WAIT(wait);
 	struct request *rq;
@@ -1639,7 +1645,7 @@
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
 				TASK_UNINTERRUPTIBLE);
 
-		rq = get_request(q, rw, GFP_NOIO);
+		rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL;
 
 		if (!rq) {
 			struct io_context *ioc;
@@ -1657,7 +1663,7 @@
 			put_io_context(ioc);
 		}
 		finish_wait(&rl->wait[rw], &wait);
-	} while (!rq);
+	} while (!rq && wantreq);
 
 	return rq;
 }
@@ -1669,9 +1675,9 @@
 	BUG_ON(rw != READ && rw != WRITE);
 
 	if (gfp_mask & __GFP_WAIT)
-		rq = get_request_wait(q, rw);
+		rq = get_request_wait(q, rw, 1);
 	else
-		rq = get_request(q, rw, gfp_mask);
+		rq = get_request(q, rw, gfp_mask, NULL);
 
 	return rq;
 }
@@ -2002,6 +2008,7 @@
 	struct request *req, *freereq = NULL;
 	int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra;
 	sector_t sector;
+	int qfull;
 
 	sector = bio->bi_sector;
 	nr_sectors = bio_sectors(bio);
@@ -2023,6 +2030,7 @@
 	ra = bio->bi_rw & (1 << BIO_RW_AHEAD);
 
 again:
+	qfull = 0;
 	spin_lock_irq(q->queue_lock);
 
 	if (elv_queue_empty(q)) {
@@ -2096,14 +2104,15 @@
 		freereq = NULL;
 	} else {
 		spin_unlock_irq(q->queue_lock);
-		if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
+		freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull);
+		if (freereq == NULL) {
 			/*
 			 * READA bit set
 			 */
 			if (ra)
 				goto end_io;
 	
-			freereq = get_request_wait(q, rw);
+			freereq = get_request_wait(q, rw, 1);
 		}
 		goto again;
 	}
@@ -2141,6 +2150,13 @@
 	req->start_time = jiffies;
 
 	add_request(q, req);
+
+	if (qfull) {
+		spin_unlock_irq(q->queue_lock);
+		get_request_wait(q, rw, 0);
+		spin_lock_irq(q->queue_lock);
+	}
+
 out:
 	if (freereq)
 		__blk_put_request(q, freereq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  9:08           ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  9:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg, axboe

On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > > For some reason, when using LVM, write requests get queued out
> > > of order to the 3ware controller, which results in quite a bit
> > > of seeking and thus performance loss.
> > [..]
> > > Okay I repeated some earlier tests, and I added some debug code in
> > > several places.
> > > 
> > > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > > start sector and length of each request. It logs something like:
> > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > > 
> > > With a perl script, I can check if the requests are sent to the
> > > host in order. That outputs something like this:
> > > 
> > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> 3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
> actually referring to a single request which has 31 128k BIOs in it, yes?

No, I'm actually referring to a struct request. I'm logging this in the
SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
in fact logged all the bio's submitted to __make_request, and the output
of the elevator from elv_next_request(). The bio's are submitted sequentially,
the resulting requests aren't. But this is because nr_requests is 128, while
the 3ware device has a queue of 254 entries (no tagging though). Upping
nr_requests to 512 makes this go away ..

That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
with a lot of debugging (unless I set nr_requests lower again), which points
to a timing issue.


> > > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> > 
> > I found out what causes this. It's get_request_wait().
> > 
> > When the request queue is full, and a new request needs to be created,
> > __make_request() blocks in get_request_wait().
> > 
> > Another process wakes up first (pdflush / process submitting I/O itself /
> > xfsdatad / etc) and sends the next bio's to __make_request().
> > In the mean time some free requests have become available, and the bios
> > are merged into a new request. Those requests are submitted to the device.
> > 
> > Then, get_request_wait() returns but the bio is not mergeable anymore -
> > and that results in a backwards seek, severely limiting the I/O rate.
> > 
> > Wouldn't it be better to allow the request allocation and queue the
> > request, and /then/ put the process to sleep ? The queue will grow larger
> > than nr_requests, but it does that anyway.
> 
> That would help, but is a bit kludgy.

Patch that does exactly this (it's probably butt-ugly, but it proves the
point) below.

> What _should_ have happened was:
> 
> a) The queue gets plugged
> 
> b) The maximal-sized 31-bio request is queued
> 
> c) The 4k request gets inserted *before* the 3968k request
> 
> d) The queue gets unplugged, and both requests stream smoothly.
> 
> (And this assumes that the queue was initially empty.  ALmost certainly
> that was not the case).

The thing is, the bio's are submitted perfectly sequentially. It's just that
every so often a request (with just its initial bio) gets stuck for a while.
Because the bio's after it are merged and sent to the device, it's not
possible to merge that stuck request later on when it gets unstuck, because
the other bio's have already left the building so to speak.

I'm not using MD at all, and I tried all elevators. All the same results,
so I stayed with elevator=noop because it's so much easier to understand ;)

This proof-of-concept patch fixes things:

ll_rw_blk.c.patch

(Hmm, 03:11, time to go to bed. I'm happy I finally found this though!)

--- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG	2004-02-04 04:43:10.000000000 +0100
+++ linux-2.6.3/drivers/block/ll_rw_blk.c	2004-02-19 02:51:06.000000000 +0100
@@ -1545,7 +1545,8 @@
 /*
  * Get a free request, queue_lock must not be held
  */
-static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
+static struct request *get_request(request_queue_t *q, int rw,
+					int gfp_mask, int *qfull)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
@@ -1569,10 +1570,15 @@
 			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
 		/*
 		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * "batcher", and not exempted by the IO scheduler.
+		 * If "qfull" is a valid pointer, set it to 1 to return
+		 * this info to the caller so that it can sleep.
 		 */
-		spin_unlock_irq(q->queue_lock);
-		goto out;
+		if (qfull == NULL) {
+			spin_unlock_irq(q->queue_lock);
+			goto out;
+		} else
+			*qfull = 1;
 	}
 
 	rl->count[rw]++;
@@ -1627,7 +1633,7 @@
  * No available requests for this queue, unplug the device and wait for some
  * requests to become available.
  */
-static struct request *get_request_wait(request_queue_t *q, int rw)
+static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq)
 {
 	DEFINE_WAIT(wait);
 	struct request *rq;
@@ -1639,7 +1645,7 @@
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
 				TASK_UNINTERRUPTIBLE);
 
-		rq = get_request(q, rw, GFP_NOIO);
+		rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL;
 
 		if (!rq) {
 			struct io_context *ioc;
@@ -1657,7 +1663,7 @@
 			put_io_context(ioc);
 		}
 		finish_wait(&rl->wait[rw], &wait);
-	} while (!rq);
+	} while (!rq && wantreq);
 
 	return rq;
 }
@@ -1669,9 +1675,9 @@
 	BUG_ON(rw != READ && rw != WRITE);
 
 	if (gfp_mask & __GFP_WAIT)
-		rq = get_request_wait(q, rw);
+		rq = get_request_wait(q, rw, 1);
 	else
-		rq = get_request(q, rw, gfp_mask);
+		rq = get_request(q, rw, gfp_mask, NULL);
 
 	return rq;
 }
@@ -2002,6 +2008,7 @@
 	struct request *req, *freereq = NULL;
 	int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra;
 	sector_t sector;
+	int qfull;
 
 	sector = bio->bi_sector;
 	nr_sectors = bio_sectors(bio);
@@ -2023,6 +2030,7 @@
 	ra = bio->bi_rw & (1 << BIO_RW_AHEAD);
 
 again:
+	qfull = 0;
 	spin_lock_irq(q->queue_lock);
 
 	if (elv_queue_empty(q)) {
@@ -2096,14 +2104,15 @@
 		freereq = NULL;
 	} else {
 		spin_unlock_irq(q->queue_lock);
-		if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
+		freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull);
+		if (freereq == NULL) {
 			/*
 			 * READA bit set
 			 */
 			if (ra)
 				goto end_io;
 	
-			freereq = get_request_wait(q, rw);
+			freereq = get_request_wait(q, rw, 1);
 		}
 		goto again;
 	}
@@ -2141,6 +2150,13 @@
 	req->start_time = jiffies;
 
 	add_request(q, req);
+
+	if (qfull) {
+		spin_unlock_irq(q->queue_lock);
+		get_request_wait(q, rw, 0);
+		spin_lock_irq(q->queue_lock);
+	}
+
 out:
 	if (freereq)
 		__blk_put_request(q, freereq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  9:08           ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19  9:08             ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19  2:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> The thing is, the bio's are submitted perfectly sequentially. It's just that
>  every so often a request (with just its initial bio) gets stuck for a while.
>  Because the bio's after it are merged and sent to the device, it's not
>  possible to merge that stuck request later on when it gets unstuck, because
>  the other bio's have already left the building so to speak.

Oh.  So the raid controller's queue depth is larger than the kernel's.  So
everything gets immediately shovelled into the device and the kernel is
left with nothing to merge the little request against.

Shouldn't the controller itself be performing the insertion?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  9:08             ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19  9:08 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> The thing is, the bio's are submitted perfectly sequentially. It's just that
>  every so often a request (with just its initial bio) gets stuck for a while.
>  Because the bio's after it are merged and sent to the device, it's not
>  possible to merge that stuck request later on when it gets unstuck, because
>  the other bio's have already left the building so to speak.

Oh.  So the raid controller's queue depth is larger than the kernel's.  So
everything gets immediately shovelled into the device and the kernel is
left with nothing to merge the little request against.

Shouldn't the controller itself be performing the insertion?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  9:08             ` [linux-lvm] " Andrew Morton
@ 2004-02-19 10:23               ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miquel van Smoorenburg, miquels, axboe, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > The thing is, the bio's are submitted perfectly sequentially. It's just that
> >  every so often a request (with just its initial bio) gets stuck for a while.
> >  Because the bio's after it are merged and sent to the device, it's not
> >  possible to merge that stuck request later on when it gets unstuck, because
> >  the other bio's have already left the building so to speak.
> 
> Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> everything gets immediately shovelled into the device and the kernel is
> left with nothing to merge the little request against.

Well, the request queue of the kernel is max'ed out too, otherwise
get_request_wait() wouldn't be called. It's just an unfortunate timing issue.

> Shouldn't the controller itself be performing the insertion?

Well, you would indeed expect the 3ware hardware to be smarter than that,
but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags
at all. It just has a large queue on the host, in hardware.

Perhaps this info should be exported into the request queue of the device,
so that ll_rw_blk knows about this and can do something similar to the
hack I posted ?

Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c.
blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing
ever calls that (except the 53c700.c driver). So you can't just check for
QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's
not being used at all yet ? Or am I being dense ?

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 10:23               ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 10:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miquel van Smoorenburg

On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > The thing is, the bio's are submitted perfectly sequentially. It's just that
> >  every so often a request (with just its initial bio) gets stuck for a while.
> >  Because the bio's after it are merged and sent to the device, it's not
> >  possible to merge that stuck request later on when it gets unstuck, because
> >  the other bio's have already left the building so to speak.
> 
> Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> everything gets immediately shovelled into the device and the kernel is
> left with nothing to merge the little request against.

Well, the request queue of the kernel is max'ed out too, otherwise
get_request_wait() wouldn't be called. It's just an unfortunate timing issue.

> Shouldn't the controller itself be performing the insertion?

Well, you would indeed expect the 3ware hardware to be smarter than that,
but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags
at all. It just has a large queue on the host, in hardware.

Perhaps this info should be exported into the request queue of the device,
so that ll_rw_blk knows about this and can do something similar to the
hack I posted ?

Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c.
blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing
ever calls that (except the 53c700.c driver). So you can't just check for
QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's
not being used at all yet ? Or am I being dense ?

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 10:23               ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19 10:26                 ` Jens Axboe
  -1 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-19 10:19 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber

On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> > Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> > >
> > > The thing is, the bio's are submitted perfectly sequentially. It's just that
> > >  every so often a request (with just its initial bio) gets stuck for a while.
> > >  Because the bio's after it are merged and sent to the device, it's not
> > >  possible to merge that stuck request later on when it gets unstuck, because
> > >  the other bio's have already left the building so to speak.
> > 
> > Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> > everything gets immediately shovelled into the device and the kernel is
> > left with nothing to merge the little request against.
> 
> Well, the request queue of the kernel is max'ed out too, otherwise
> get_request_wait() wouldn't be called. It's just an unfortunate timing
> issue.

Indeed

> > Shouldn't the controller itself be performing the insertion?
> 
> Well, you would indeed expect the 3ware hardware to be smarter than
> that, but in its defence, the driver doesn't set sdev->simple_tags or
> sdev->ordered_tags at all. It just has a large queue on the host, in
> hardware.

A too large queue. IMHO the simple and correct solution to your problem
is to diminish the host queue (sane solution), or bump the block layer
queue size (dumb solution).

> Perhaps this info should be exported into the request queue of the device,
> so that ll_rw_blk knows about this and can do something similar to the
> hack I posted ?
> 
> Note that AFAICS nothing in drivers/scsi uses the tagging stuff in
> ll_rw_blk.c.  blk_queue_init_tags() is only called by
> scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c
> driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing
> in drivers/block calls it either. It's not being used at all yet ? Or
> am I being dense ?

No you are correct, I already outlined that to you explicitly in the
very first mail in this thread. Hopefully this will change with 2.7 so
we have some block layer control over tagging in general.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 10:26                 ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-19 10:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber

On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> > Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> > >
> > > The thing is, the bio's are submitted perfectly sequentially. It's just that
> > >  every so often a request (with just its initial bio) gets stuck for a while.
> > >  Because the bio's after it are merged and sent to the device, it's not
> > >  possible to merge that stuck request later on when it gets unstuck, because
> > >  the other bio's have already left the building so to speak.
> > 
> > Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> > everything gets immediately shovelled into the device and the kernel is
> > left with nothing to merge the little request against.
> 
> Well, the request queue of the kernel is max'ed out too, otherwise
> get_request_wait() wouldn't be called. It's just an unfortunate timing
> issue.

Indeed

> > Shouldn't the controller itself be performing the insertion?
> 
> Well, you would indeed expect the 3ware hardware to be smarter than
> that, but in its defence, the driver doesn't set sdev->simple_tags or
> sdev->ordered_tags at all. It just has a large queue on the host, in
> hardware.

A too large queue. IMHO the simple and correct solution to your problem
is to diminish the host queue (sane solution), or bump the block layer
queue size (dumb solution).

> Perhaps this info should be exported into the request queue of the device,
> so that ll_rw_blk knows about this and can do something similar to the
> hack I posted ?
> 
> Note that AFAICS nothing in drivers/scsi uses the tagging stuff in
> ll_rw_blk.c.  blk_queue_init_tags() is only called by
> scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c
> driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing
> in drivers/block calls it either. It's not being used at all yet ? Or
> am I being dense ?

No you are correct, I already outlined that to you explicitly in the
very first mail in this thread. Hopefully this will change with 2.7 so
we have some block layer control over tagging in general.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-19 10:26                 ` [linux-lvm] " Jens Axboe
@ 2004-02-19 20:59                   ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 15:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>
> > > Shouldn't the controller itself be performing the insertion?
> > 
> > Well, you would indeed expect the 3ware hardware to be smarter than
> > that, but in its defence, the driver doesn't set sdev->simple_tags or
> > sdev->ordered_tags at all. It just has a large queue on the host, in
> > hardware.
> 
> A too large queue. IMHO the simple and correct solution to your problem
> is to diminish the host queue (sane solution), or bump the block layer
> queue size (dumb solution).

Well, I did that. Lowering the queue size of the 3ware controller to 64
does help a bit, but performance is still not optimal - leaving it at 254
and increasing the nr_requests of the queue to 512 helps the most.

But the patch I posted does just as well, without any tuning. I changed
it a little though - it only has the "new" behaviour (instead of blocking
on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
That results in the best performance I've seen, by far.

Now the style of my patch might be ugly, but what is conceptually wrong
with allocating the request and queueing it, then block if the queue is
full, versus blocking on allocating the request and keeping a bio
"stuck" for quite some time, resulting in out-of-order requests to the
hardware ?

Note that this is not an issue of '2 processes writing to 1 file', really.
It's one process and pdflush writing the same dirty pages of the same file.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 20:59                   ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 20:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>
> > > Shouldn't the controller itself be performing the insertion?
> > 
> > Well, you would indeed expect the 3ware hardware to be smarter than
> > that, but in its defence, the driver doesn't set sdev->simple_tags or
> > sdev->ordered_tags at all. It just has a large queue on the host, in
> > hardware.
> 
> A too large queue. IMHO the simple and correct solution to your problem
> is to diminish the host queue (sane solution), or bump the block layer
> queue size (dumb solution).

Well, I did that. Lowering the queue size of the 3ware controller to 64
does help a bit, but performance is still not optimal - leaving it at 254
and increasing the nr_requests of the queue to 512 helps the most.

But the patch I posted does just as well, without any tuning. I changed
it a little though - it only has the "new" behaviour (instead of blocking
on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
That results in the best performance I've seen, by far.

Now the style of my patch might be ugly, but what is conceptually wrong
with allocating the request and queueing it, then block if the queue is
full, versus blocking on allocating the request and keeping a bio
"stuck" for quite some time, resulting in out-of-order requests to the
hardware ?

Note that this is not an issue of '2 processes writing to 1 file', really.
It's one process and pdflush writing the same dirty pages of the same file.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-19 20:59                   ` Miquel van Smoorenburg
@ 2004-02-19 22:52                     ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 17:52 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber


Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
>
>>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>>
>>
>>>>Shouldn't the controller itself be performing the insertion?
>>>>
>>>Well, you would indeed expect the 3ware hardware to be smarter than
>>>that, but in its defence, the driver doesn't set sdev->simple_tags or
>>>sdev->ordered_tags at all. It just has a large queue on the host, in
>>>hardware.
>>>
>>A too large queue. IMHO the simple and correct solution to your problem
>>is to diminish the host queue (sane solution), or bump the block layer
>>queue size (dumb solution).
>>
>
>Well, I did that. Lowering the queue size of the 3ware controller to 64
>does help a bit, but performance is still not optimal - leaving it at 254
>and increasing the nr_requests of the queue to 512 helps the most.
>
>But the patch I posted does just as well, without any tuning. I changed
>it a little though - it only has the "new" behaviour (instead of blocking
>on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
>That results in the best performance I've seen, by far.
>
>

That's because you are half introducing per-process limits.

>Now the style of my patch might be ugly, but what is conceptually wrong
>with allocating the request and queueing it, then block if the queue is
>full, versus blocking on allocating the request and keeping a bio
>"stuck" for quite some time, resulting in out-of-order requests to the
>hardware ?
>
>

Conceptually? The concept that you have everything you need to
continue and yet you block anyway is wrong.

>Note that this is not an issue of '2 processes writing to 1 file', really.
>It's one process and pdflush writing the same dirty pages of the same file.
>
>

pdflush is a process though, that's all that matters.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 22:52                     ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 22:52 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
>
>>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>>
>>
>>>>Shouldn't the controller itself be performing the insertion?
>>>>
>>>Well, you would indeed expect the 3ware hardware to be smarter than
>>>that, but in its defence, the driver doesn't set sdev->simple_tags or
>>>sdev->ordered_tags at all. It just has a large queue on the host, in
>>>hardware.
>>>
>>A too large queue. IMHO the simple and correct solution to your problem
>>is to diminish the host queue (sane solution), or bump the block layer
>>queue size (dumb solution).
>>
>
>Well, I did that. Lowering the queue size of the 3ware controller to 64
>does help a bit, but performance is still not optimal - leaving it at 254
>and increasing the nr_requests of the queue to 512 helps the most.
>
>But the patch I posted does just as well, without any tuning. I changed
>it a little though - it only has the "new" behaviour (instead of blocking
>on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
>That results in the best performance I've seen, by far.
>
>

That's because you are half introducing per-process limits.

>Now the style of my patch might be ugly, but what is conceptually wrong
>with allocating the request and queueing it, then block if the queue is
>full, versus blocking on allocating the request and keeping a bio
>"stuck" for quite some time, resulting in out-of-order requests to the
>hardware ?
>
>

Conceptually? The concept that you have everything you need to
continue and yet you block anyway is wrong.

>Note that this is not an issue of '2 processes writing to 1 file', really.
>It's one process and pdflush writing the same dirty pages of the same file.
>
>

pdflush is a process though, that's all that matters.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-19 22:52                     ` Nick Piggin
@ 2004-02-19 23:53                       ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 18:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber

On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> >
> >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> >>
> >>
> >>>>Shouldn't the controller itself be performing the insertion?
> >>>>
> >>>Well, you would indeed expect the 3ware hardware to be smarter than
> >>>that, but in its defence, the driver doesn't set sdev->simple_tags or
> >>>sdev->ordered_tags at all. It just has a large queue on the host, in
> >>>hardware.
> >>>
> >>A too large queue. IMHO the simple and correct solution to your problem
> >>is to diminish the host queue (sane solution), or bump the block layer
> >>queue size (dumb solution).
> >>
> >
> >Well, I did that. Lowering the queue size of the 3ware controller to 64
> >does help a bit, but performance is still not optimal - leaving it at 254
> >and increasing the nr_requests of the queue to 512 helps the most.
> >
> >But the patch I posted does just as well, without any tuning. I changed
> >it a little though - it only has the "new" behaviour (instead of blocking
> >on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
> >That results in the best performance I've seen, by far.
> >
> >
> 
> That's because you are half introducing per-process limits.
> 
> >Now the style of my patch might be ugly, but what is conceptually wrong
> >with allocating the request and queueing it, then block if the queue is
> >full, versus blocking on allocating the request and keeping a bio
> >"stuck" for quite some time, resulting in out-of-order requests to the
> >hardware ?
> >
> >
> 
> Conceptually? The concept that you have everything you need to
> continue and yet you block anyway is wrong.

For reading, I agree. For writing .. ah well, English is not my first
language, let's not argue about language semantics.

> >Note that this is not an issue of '2 processes writing to 1 file', really.
> >It's one process and pdflush writing the same dirty pages of the same file.
> 
> pdflush is a process though, that's all that matters.

I understand that when the two processes are unrelated, the patch as I
sent it will do the wrong thing.

But the thing is, you get this:

- "dd" process writes requests
- pdflush triggers to write dirty pages
- too many pages are dirty so "dd" blocks as well to write synchronously
- "dd" process triggers "queue full" but gets marked as "batching" so
  can continue (get_request)
- pdflush tries to submit one bio and gets blocked (get_request_wait)
- "dd" continues, but that one bio from pdflush remains stuck for a while

That's stupid, that one bio from pdflush should really be allowed on
the queue, since "dd" is adding requests from the same source to it
anyway.

Perhaps writes from pdflush should be handled differently to prevent
this specific case ?

Say, if pdflush adds request #128, don't mark it as batching, but
let it block. The next process will be the one marked as batching
and can continue. If pdflush tries to add a request > 128, allow it,
but _then_ block it.

Would something like that work ? Would it be a good idea to never mark
a pdflush process as batching, or would that have a negative impact
for some things ?

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 23:53                       ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 23:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber

On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> >
> >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> >>
> >>
> >>>>Shouldn't the controller itself be performing the insertion?
> >>>>
> >>>Well, you would indeed expect the 3ware hardware to be smarter than
> >>>that, but in its defence, the driver doesn't set sdev->simple_tags or
> >>>sdev->ordered_tags at all. It just has a large queue on the host, in
> >>>hardware.
> >>>
> >>A too large queue. IMHO the simple and correct solution to your problem
> >>is to diminish the host queue (sane solution), or bump the block layer
> >>queue size (dumb solution).
> >>
> >
> >Well, I did that. Lowering the queue size of the 3ware controller to 64
> >does help a bit, but performance is still not optimal - leaving it at 254
> >and increasing the nr_requests of the queue to 512 helps the most.
> >
> >But the patch I posted does just as well, without any tuning. I changed
> >it a little though - it only has the "new" behaviour (instead of blocking
> >on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
> >That results in the best performance I've seen, by far.
> >
> >
> 
> That's because you are half introducing per-process limits.
> 
> >Now the style of my patch might be ugly, but what is conceptually wrong
> >with allocating the request and queueing it, then block if the queue is
> >full, versus blocking on allocating the request and keeping a bio
> >"stuck" for quite some time, resulting in out-of-order requests to the
> >hardware ?
> >
> >
> 
> Conceptually? The concept that you have everything you need to
> continue and yet you block anyway is wrong.

For reading, I agree. For writing .. ah well, English is not my first
language, let's not argue about language semantics.

> >Note that this is not an issue of '2 processes writing to 1 file', really.
> >It's one process and pdflush writing the same dirty pages of the same file.
> 
> pdflush is a process though, that's all that matters.

I understand that when the two processes are unrelated, the patch as I
sent it will do the wrong thing.

But the thing is, you get this:

- "dd" process writes requests
- pdflush triggers to write dirty pages
- too many pages are dirty so "dd" blocks as well to write synchronously
- "dd" process triggers "queue full" but gets marked as "batching" so
  can continue (get_request)
- pdflush tries to submit one bio and gets blocked (get_request_wait)
- "dd" continues, but that one bio from pdflush remains stuck for a while

That's stupid, that one bio from pdflush should really be allowed on
the queue, since "dd" is adding requests from the same source to it
anyway.

Perhaps writes from pdflush should be handled differently to prevent
this specific case ?

Say, if pdflush adds request #128, don't mark it as batching, but
let it block. The next process will be the one marked as batching
and can continue. If pdflush tries to add a request > 128, allow it,
but _then_ block it.

Would something like that work ? Would it be a good idea to never mark
a pdflush process as batching, or would that have a negative impact
for some things ?

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
  2004-02-19 23:53                       ` Miquel van Smoorenburg
@ 2004-02-20  0:15                         ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 19:15 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber


Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
>
>>
>>Miquel van Smoorenburg wrote:
>>
>
...

>>>Note that this is not an issue of '2 processes writing to 1 file', really.
>>>It's one process and pdflush writing the same dirty pages of the same file.
>>>
>>pdflush is a process though, that's all that matters.
>>
>
>I understand that when the two processes are unrelated, the patch as I
>sent it will do the wrong thing.
>
>But the thing is, you get this:
>
>- "dd" process writes requests
>- pdflush triggers to write dirty pages
>- too many pages are dirty so "dd" blocks as well to write synchronously
>- "dd" process triggers "queue full" but gets marked as "batching" so
>  can continue (get_request)
>- pdflush tries to submit one bio and gets blocked (get_request_wait)
>- "dd" continues, but that one bio from pdflush remains stuck for a while
>
>

The batching logic can probably all be ripped out with per
process limits. It's too complex anyway really.

>That's stupid, that one bio from pdflush should really be allowed on
>the queue, since "dd" is adding requests from the same source to it
>anyway.
>
>

But the whole reason it is getting blocked in the first place
is because your controller is sucking up all your requests.
The whole problem is not a problem if you use properly sized
queues.

I'm a bit surprised that it wasn't working well with a controller
queue depth of 64 and 128 nr_requests. I'll give you a per process
request limit patch to try in a minute.

>Perhaps writes from pdflush should be handled differently to prevent
>this specific case ?
>
>Say, if pdflush adds request #128, don't mark it as batching, but
>let it block. The next process will be the one marked as batching
>and can continue. If pdflush tries to add a request > 128, allow it,
>but _then_ block it.
>
>Would something like that work ? Would it be a good idea to never mark
>a pdflush process as batching, or would that have a negative impact
>for some things ?
>
>

It's hard to know. Maybe a better solution would be to allow pdflush
to be exempt from the limits entirely as long as it tries not to write
to congested queues (which is what it does)...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
@ 2004-02-20  0:15                         ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-20  0:15 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
>
>>
>>Miquel van Smoorenburg wrote:
>>
>
...

>>>Note that this is not an issue of '2 processes writing to 1 file', really.
>>>It's one process and pdflush writing the same dirty pages of the same file.
>>>
>>pdflush is a process though, that's all that matters.
>>
>
>I understand that when the two processes are unrelated, the patch as I
>sent it will do the wrong thing.
>
>But the thing is, you get this:
>
>- "dd" process writes requests
>- pdflush triggers to write dirty pages
>- too many pages are dirty so "dd" blocks as well to write synchronously
>- "dd" process triggers "queue full" but gets marked as "batching" so
>  can continue (get_request)
>- pdflush tries to submit one bio and gets blocked (get_request_wait)
>- "dd" continues, but that one bio from pdflush remains stuck for a while
>
>

The batching logic can probably all be ripped out with per
process limits. It's too complex anyway really.

>That's stupid, that one bio from pdflush should really be allowed on
>the queue, since "dd" is adding requests from the same source to it
>anyway.
>
>

But the whole reason it is getting blocked in the first place
is because your controller is sucking up all your requests.
The whole problem is not a problem if you use properly sized
queues.

I'm a bit surprised that it wasn't working well with a controller
queue depth of 64 and 128 nr_requests. I'll give you a per process
request limit patch to try in a minute.

>Perhaps writes from pdflush should be handled differently to prevent
>this specific case ?
>
>Say, if pdflush adds request #128, don't mark it as batching, but
>let it block. The next process will be the one marked as batching
>and can continue. If pdflush tries to add a request > 128, allow it,
>but _then_ block it.
>
>Would something like that work ? Would it be a good idea to never mark
>a pdflush process as batching, or would that have a negative impact
>for some things ?
>
>

It's hard to know. Maybe a better solution would be to allow pdflush
to be exempt from the limits entirely as long as it tries not to write
to congested queues (which is what it does)...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-19 23:53                       ` Miquel van Smoorenburg
@ 2004-02-20  1:12                         ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 20:16 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber

[-- Attachment #1: Type: text/plain, Size: 152 bytes --]

Hi Miquel,
Can you see if this patch helps you?

Even with this patch, it might still be a good idea to allow
pdflush to disregard the limits...

Nick


[-- Attachment #2: blk-per-process-rqlim.patch --]
[-- Type: text/plain, Size: 7564 bytes --]

 linux-2.6-npiggin/drivers/block/ll_rw_blk.c |  113 ++++++++--------------------
 linux-2.6-npiggin/include/linux/blkdev.h    |    8 -
 2 files changed, 36 insertions(+), 85 deletions(-)

diff -puN include/linux/blkdev.h~blk-per-process-rqlim include/linux/blkdev.h
--- linux-2.6/include/linux/blkdev.h~blk-per-process-rqlim	2004-02-20 09:53:22.000000000 +1100
+++ linux-2.6-npiggin/include/linux/blkdev.h	2004-02-20 10:12:04.000000000 +1100
@@ -25,6 +25,7 @@ struct request_pm_state;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+#define BLKDEV_MIN_PROC_RQ	64
 
 /*
  * This is the per-process anticipatory I/O scheduler state.
@@ -61,11 +62,7 @@ struct io_context {
 	atomic_t refcount;
 	pid_t pid;
 
-	/*
-	 * For request batching
-	 */
-	unsigned long last_waited; /* Time last woken after wait for request */
-	int nr_batch_requests;     /* Number of requests left in the batch */
+	unsigned int nr_requests;	/* Outstanding requests */
 
 	struct as_io_context *aic;
 };
@@ -141,6 +138,7 @@ struct request {
 	int ref_count;
 	request_queue_t *q;
 	struct request_list *rl;
+	struct io_context *ioc;
 
 	struct completion *waiting;
 	void *special;
diff -puN drivers/block/ll_rw_blk.c~blk-per-process-rqlim drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c~blk-per-process-rqlim	2004-02-20 09:53:25.000000000 +1100
+++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c	2004-02-20 10:40:43.000000000 +1100
@@ -55,12 +55,6 @@ unsigned long blk_max_low_pfn, blk_max_p
 EXPORT_SYMBOL(blk_max_low_pfn);
 EXPORT_SYMBOL(blk_max_pfn);
 
-/* Amount of time in which a process may batch requests */
-#define BLK_BATCH_TIME	(HZ/50UL)
-
-/* Number of requests a "batching" process may submit */
-#define BLK_BATCH_REQ	32
-
 /*
  * Return the threshold (number of used requests) at which the queue is
  * considered to be congested.  It include a little hysteresis to keep the
@@ -1495,57 +1489,27 @@ static inline struct request *blk_alloc_
 }
 
 /*
- * ioc_batching returns true if the ioc is a valid batching request and
- * should be given priority access to a request.
- */
-static inline int ioc_batching(struct io_context *ioc)
-{
-	if (!ioc)
-		return 0;
-
-	/*
-	 * Make sure the process is able to allocate at least 1 request
-	 * even if the batch times out, otherwise we could theoretically
-	 * lose wakeups.
-	 */
-	return ioc->nr_batch_requests == BLK_BATCH_REQ ||
-		(ioc->nr_batch_requests > 0
-		&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
-}
-
-/*
- * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This
- * will cause the process to be a "batcher" on all queues in the system. This
- * is the behaviour we want though - once it gets a wakeup it should be given
- * a nice run.
- */
-void ioc_set_batching(struct io_context *ioc)
-{
-	if (!ioc || ioc_batching(ioc))
-		return;
-
-	ioc->nr_batch_requests = BLK_BATCH_REQ;
-	ioc->last_waited = jiffies;
-}
-
-/*
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(request_queue_t *q, int rw)
+static void freed_request(request_queue_t *q, struct io_context *ioc, int rw)
 {
 	struct request_list *rl = &q->rq;
 
+	ioc->nr_requests--;
 	rl->count[rw]--;
 	if (rl->count[rw] < queue_congestion_off_threshold(q))
 		clear_queue_congested(q, rw);
 	if (rl->count[rw]+1 <= q->nr_requests) {
-		smp_mb();
-		if (waitqueue_active(&rl->wait[rw]))
-			wake_up(&rl->wait[rw]);
 		if (!waitqueue_active(&rl->wait[rw]))
 			blk_clear_queue_full(q, rw);
 	}
+
+	if (rl->count[rw]-31 <= q->nr_requests
+			|| ioc->nr_requests <= BLKDEV_MIN_PROC_RQ - 32) {
+		if (waitqueue_active(&rl->wait[rw]))
+			wake_up(&rl->wait[rw]);
+	}
 }
 
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
@@ -1556,32 +1520,34 @@ static struct request *get_request(reque
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
-	struct io_context *ioc = get_io_context(gfp_mask);
+	struct io_context *ioc;
 
 	spin_lock_irq(q->queue_lock);
 	if (rl->count[rw]+1 >= q->nr_requests) {
 		/*
 		 * The queue will fill after this allocation, so set it as
-		 * full, and mark this process as "batching". This process
-		 * will be allowed to complete a batch of requests, others
-		 * will be blocked.
+		 * full.
 		 */
-		if (!blk_queue_full(q, rw)) {
-			ioc_set_batching(ioc);
+		if (!blk_queue_full(q, rw))
 			blk_set_queue_full(q, rw);
-		}
 	}
 
+	ioc = get_io_context(gfp_mask);
+	if (unlikely(ioc == NULL))
+		goto out;
+
 	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
+			&& ioc->nr_requests >= BLKDEV_MIN_PROC_RQ
+			&& !elv_may_queue(q, rw)) {
 		/*
-		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * The queue is full and the allocating process is over
+		 * it's limit, and not exempted by the IO scheduler
 		 */
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
+	ioc->nr_requests++;
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
 		set_queue_congested(q, rw);
@@ -1597,14 +1563,11 @@ static struct request *get_request(reque
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
-	if (ioc_batching(ioc))
-		ioc->nr_batch_requests--;
-	
 	INIT_LIST_HEAD(&rq->queuelist);
 
 	/*
@@ -1620,13 +1583,15 @@ static struct request *get_request(reque
 	rq->ref_count = 1;
 	rq->q = q;
 	rq->rl = rl;
+	rq->ioc = ioc;
 	rq->waiting = NULL;
 	rq->special = NULL;
 	rq->data = NULL;
 	rq->sense = NULL;
 
-out:
+out_ioc:
 	put_io_context(ioc);
+out:
 	return rq;
 }
 
@@ -1644,25 +1609,12 @@ static struct request *get_request_wait(
 		struct request_list *rl = &q->rq;
 
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
-				TASK_UNINTERRUPTIBLE);
+						TASK_UNINTERRUPTIBLE);
 
 		rq = get_request(q, rw, GFP_NOIO);
-
-		if (!rq) {
-			struct io_context *ioc;
-
+		if (!rq)
 			io_schedule();
 
-			/*
-			 * After sleeping, we become a "batching" process and
-			 * will be able to allocate at least one request, and
-			 * up to a big batch of them for a small period time.
-			 * See ioc_batching, ioc_set_batching
-			 */
-			ioc = get_io_context(GFP_NOIO);
-			ioc_set_batching(ioc);
-			put_io_context(ioc);
-		}
 		finish_wait(&rl->wait[rw], &wait);
 	} while (!rq);
 
@@ -1854,6 +1806,7 @@ void __blk_put_request(request_queue_t *
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (rl) {
+		struct io_context *ioc = req->ioc;
 		int rw = rq_data_dir(req);
 
 		elv_completed_request(q, req);
@@ -1861,7 +1814,8 @@ void __blk_put_request(request_queue_t *
 		BUG_ON(!list_empty(&req->queuelist));
 
 		blk_free_request(q, req);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
+		put_io_context(ioc);
 	}
 }
 
@@ -2721,7 +2675,7 @@ int __init blk_dev_init(void)
  */
 void put_io_context(struct io_context *ioc)
 {
-	if (ioc == NULL)
+	if (unlikely(ioc == NULL))
 		return;
 
 	BUG_ON(atomic_read(&ioc->refcount) == 0);
@@ -2772,8 +2726,7 @@ struct io_context *get_io_context(int gf
 		if (ret) {
 			atomic_set(&ret->refcount, 1);
 			ret->pid = tsk->pid;
-			ret->last_waited = jiffies; /* doesn't matter... */
-			ret->nr_batch_requests = 0; /* because this is 0 */
+			ret->nr_requests = 0;
 			ret->aic = NULL;
 			tsk->io_context = ret;
 		}

_

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
@ 2004-02-20  1:12                         ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-20  1:12 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber

[-- Attachment #1: Type: text/plain, Size: 152 bytes --]

Hi Miquel,
Can you see if this patch helps you?

Even with this patch, it might still be a good idea to allow
pdflush to disregard the limits...

Nick


[-- Attachment #2: blk-per-process-rqlim.patch --]
[-- Type: text/plain, Size: 7564 bytes --]

 linux-2.6-npiggin/drivers/block/ll_rw_blk.c |  113 ++++++++--------------------
 linux-2.6-npiggin/include/linux/blkdev.h    |    8 -
 2 files changed, 36 insertions(+), 85 deletions(-)

diff -puN include/linux/blkdev.h~blk-per-process-rqlim include/linux/blkdev.h
--- linux-2.6/include/linux/blkdev.h~blk-per-process-rqlim	2004-02-20 09:53:22.000000000 +1100
+++ linux-2.6-npiggin/include/linux/blkdev.h	2004-02-20 10:12:04.000000000 +1100
@@ -25,6 +25,7 @@ struct request_pm_state;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+#define BLKDEV_MIN_PROC_RQ	64
 
 /*
  * This is the per-process anticipatory I/O scheduler state.
@@ -61,11 +62,7 @@ struct io_context {
 	atomic_t refcount;
 	pid_t pid;
 
-	/*
-	 * For request batching
-	 */
-	unsigned long last_waited; /* Time last woken after wait for request */
-	int nr_batch_requests;     /* Number of requests left in the batch */
+	unsigned int nr_requests;	/* Outstanding requests */
 
 	struct as_io_context *aic;
 };
@@ -141,6 +138,7 @@ struct request {
 	int ref_count;
 	request_queue_t *q;
 	struct request_list *rl;
+	struct io_context *ioc;
 
 	struct completion *waiting;
 	void *special;
diff -puN drivers/block/ll_rw_blk.c~blk-per-process-rqlim drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c~blk-per-process-rqlim	2004-02-20 09:53:25.000000000 +1100
+++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c	2004-02-20 10:40:43.000000000 +1100
@@ -55,12 +55,6 @@ unsigned long blk_max_low_pfn, blk_max_p
 EXPORT_SYMBOL(blk_max_low_pfn);
 EXPORT_SYMBOL(blk_max_pfn);
 
-/* Amount of time in which a process may batch requests */
-#define BLK_BATCH_TIME	(HZ/50UL)
-
-/* Number of requests a "batching" process may submit */
-#define BLK_BATCH_REQ	32
-
 /*
  * Return the threshold (number of used requests) at which the queue is
  * considered to be congested.  It include a little hysteresis to keep the
@@ -1495,57 +1489,27 @@ static inline struct request *blk_alloc_
 }
 
 /*
- * ioc_batching returns true if the ioc is a valid batching request and
- * should be given priority access to a request.
- */
-static inline int ioc_batching(struct io_context *ioc)
-{
-	if (!ioc)
-		return 0;
-
-	/*
-	 * Make sure the process is able to allocate at least 1 request
-	 * even if the batch times out, otherwise we could theoretically
-	 * lose wakeups.
-	 */
-	return ioc->nr_batch_requests == BLK_BATCH_REQ ||
-		(ioc->nr_batch_requests > 0
-		&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
-}
-
-/*
- * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This
- * will cause the process to be a "batcher" on all queues in the system. This
- * is the behaviour we want though - once it gets a wakeup it should be given
- * a nice run.
- */
-void ioc_set_batching(struct io_context *ioc)
-{
-	if (!ioc || ioc_batching(ioc))
-		return;
-
-	ioc->nr_batch_requests = BLK_BATCH_REQ;
-	ioc->last_waited = jiffies;
-}
-
-/*
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(request_queue_t *q, int rw)
+static void freed_request(request_queue_t *q, struct io_context *ioc, int rw)
 {
 	struct request_list *rl = &q->rq;
 
+	ioc->nr_requests--;
 	rl->count[rw]--;
 	if (rl->count[rw] < queue_congestion_off_threshold(q))
 		clear_queue_congested(q, rw);
 	if (rl->count[rw]+1 <= q->nr_requests) {
-		smp_mb();
-		if (waitqueue_active(&rl->wait[rw]))
-			wake_up(&rl->wait[rw]);
 		if (!waitqueue_active(&rl->wait[rw]))
 			blk_clear_queue_full(q, rw);
 	}
+
+	if (rl->count[rw]-31 <= q->nr_requests
+			|| ioc->nr_requests <= BLKDEV_MIN_PROC_RQ - 32) {
+		if (waitqueue_active(&rl->wait[rw]))
+			wake_up(&rl->wait[rw]);
+	}
 }
 
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
@@ -1556,32 +1520,34 @@ static struct request *get_request(reque
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
-	struct io_context *ioc = get_io_context(gfp_mask);
+	struct io_context *ioc;
 
 	spin_lock_irq(q->queue_lock);
 	if (rl->count[rw]+1 >= q->nr_requests) {
 		/*
 		 * The queue will fill after this allocation, so set it as
-		 * full, and mark this process as "batching". This process
-		 * will be allowed to complete a batch of requests, others
-		 * will be blocked.
+		 * full.
 		 */
-		if (!blk_queue_full(q, rw)) {
-			ioc_set_batching(ioc);
+		if (!blk_queue_full(q, rw))
 			blk_set_queue_full(q, rw);
-		}
 	}
 
+	ioc = get_io_context(gfp_mask);
+	if (unlikely(ioc == NULL))
+		goto out;
+
 	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
+			&& ioc->nr_requests >= BLKDEV_MIN_PROC_RQ
+			&& !elv_may_queue(q, rw)) {
 		/*
-		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * The queue is full and the allocating process is over
+		 * it's limit, and not exempted by the IO scheduler
 		 */
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
+	ioc->nr_requests++;
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
 		set_queue_congested(q, rw);
@@ -1597,14 +1563,11 @@ static struct request *get_request(reque
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
-	if (ioc_batching(ioc))
-		ioc->nr_batch_requests--;
-	
 	INIT_LIST_HEAD(&rq->queuelist);
 
 	/*
@@ -1620,13 +1583,15 @@ static struct request *get_request(reque
 	rq->ref_count = 1;
 	rq->q = q;
 	rq->rl = rl;
+	rq->ioc = ioc;
 	rq->waiting = NULL;
 	rq->special = NULL;
 	rq->data = NULL;
 	rq->sense = NULL;
 
-out:
+out_ioc:
 	put_io_context(ioc);
+out:
 	return rq;
 }
 
@@ -1644,25 +1609,12 @@ static struct request *get_request_wait(
 		struct request_list *rl = &q->rq;
 
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
-				TASK_UNINTERRUPTIBLE);
+						TASK_UNINTERRUPTIBLE);
 
 		rq = get_request(q, rw, GFP_NOIO);
-
-		if (!rq) {
-			struct io_context *ioc;
-
+		if (!rq)
 			io_schedule();
 
-			/*
-			 * After sleeping, we become a "batching" process and
-			 * will be able to allocate at least one request, and
-			 * up to a big batch of them for a small period time.
-			 * See ioc_batching, ioc_set_batching
-			 */
-			ioc = get_io_context(GFP_NOIO);
-			ioc_set_batching(ioc);
-			put_io_context(ioc);
-		}
 		finish_wait(&rl->wait[rw], &wait);
 	} while (!rq);
 
@@ -1854,6 +1806,7 @@ void __blk_put_request(request_queue_t *
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (rl) {
+		struct io_context *ioc = req->ioc;
 		int rw = rq_data_dir(req);
 
 		elv_completed_request(q, req);
@@ -1861,7 +1814,8 @@ void __blk_put_request(request_queue_t *
 		BUG_ON(!list_empty(&req->queuelist));
 
 		blk_free_request(q, req);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
+		put_io_context(ioc);
 	}
 }
 
@@ -2721,7 +2675,7 @@ int __init blk_dev_init(void)
  */
 void put_io_context(struct io_context *ioc)
 {
-	if (ioc == NULL)
+	if (unlikely(ioc == NULL))
 		return;
 
 	BUG_ON(atomic_read(&ioc->refcount) == 0);
@@ -2772,8 +2726,7 @@ struct io_context *get_io_context(int gf
 		if (ret) {
 			atomic_set(&ret->refcount, 1);
 			ret->pid = tsk->pid;
-			ret->last_waited = jiffies; /* doesn't matter... */
-			ret->nr_batch_requests = 0; /* because this is 0 */
+			ret->nr_requests = 0;
 			ret->aic = NULL;
 			tsk->io_context = ret;
 		}

_

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:12                         ` Nick Piggin
@ 2004-02-20  1:26                           ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19 20:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> Even with this patch, it might still be a good idea to allow
> pdflush to disregard the limits...

Has it been confirmed that pdflush is blocking in get_request_wait()?  I
guess that can happen very occasionally because we don't bother with any
locking around there but if it's happening a lot then something is bust.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
@ 2004-02-20  1:26                           ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-20  1:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> Even with this patch, it might still be a good idea to allow
> pdflush to disregard the limits...

Has it been confirmed that pdflush is blocking in get_request_wait()?  I
guess that can happen very occasionally because we don't bother with any
locking around there but if it's happening a lot then something is bust.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:26                           ` Andrew Morton
@ 2004-02-20  1:40                             ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 20:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber


Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Even with this patch, it might still be a good idea to allow
>>pdflush to disregard the limits...
>>
>
>Has it been confirmed that pdflush is blocking in get_request_wait()?  I
>guess that can happen very occasionally because we don't bother with any
>locking around there but if it's happening a lot then something is bust.
>
>

Miquel's analysis is pretty plausible, but I'm not sure if
he's confirmed it or not, Miquel? Even if it isn't happening
a lot, and something isn't bust it might be a good idea to
do this.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
@ 2004-02-20  1:40                             ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-20  1:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber



Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Even with this patch, it might still be a good idea to allow
>>pdflush to disregard the limits...
>>
>
>Has it been confirmed that pdflush is blocking in get_request_wait()?  I
>guess that can happen very occasionally because we don't bother with any
>locking around there but if it's happening a lot then something is bust.
>
>

Miquel's analysis is pretty plausible, but I'm not sure if
he's confirmed it or not, Miquel? Even if it isn't happening
a lot, and something isn't bust it might be a good idea to
do this.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:40                             ` Nick Piggin
@ 2004-02-20  2:32                               ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-19 21:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> 
> 
> Andrew Morton wrote:
> 
> >Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> >>Even with this patch, it might still be a good idea to allow
> >>pdflush to disregard the limits...
> >>
> >
> >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> >guess that can happen very occasionally because we don't bother with any
> >locking around there but if it's happening a lot then something is bust.
> >
> >
> 
> Miquel's analysis is pretty plausible, but I'm not sure if
> he's confirmed it or not, Miquel? Even if it isn't happening
> a lot, and something isn't bust it might be a good idea to
> do this.

Seems OK from a quick check.  pdflush will block in get_request_wait()
occasionally, but not at all often.  Perhaps we could move the
write_congested test into the mpage_writepages() inner loop but it hardly
seems worth the risk.

Maybe things are different on Miquel's clockwork controller.



 drivers/block/ll_rw_blk.c |    2 ++
 fs/fs-writeback.c         |    2 ++
 2 files changed, 4 insertions(+)

diff -puN drivers/block/ll_rw_blk.c~pdflush-blockage-check drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~pdflush-blockage-check	2004-02-19 18:16:33.000000000 -0800
+++ 25-akpm/drivers/block/ll_rw_blk.c	2004-02-19 18:16:33.000000000 -0800
@@ -1651,6 +1651,8 @@ static struct request *get_request_wait(
 		if (!rq) {
 			struct io_context *ioc;
 
+			WARN_ON(current_is_pdflush());
+
 			io_schedule();
 
 			/*
diff -puN fs/fs-writeback.c~pdflush-blockage-check fs/fs-writeback.c
--- 25/fs/fs-writeback.c~pdflush-blockage-check	2004-02-19 18:22:25.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c	2004-02-19 18:22:43.000000000 -0800
@@ -279,6 +279,8 @@ sync_sb_inodes(struct super_block *sb, s
 		}
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+			if (current_is_pdflush())
+				printk("saved pdflush\n");
 			wbc->encountered_congestion = 1;
 			if (sb != blockdev_superblock)
 				break;		/* Skip a congested fs */

_

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
@ 2004-02-20  2:32                               ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-20  2:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> 
> 
> Andrew Morton wrote:
> 
> >Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> >>Even with this patch, it might still be a good idea to allow
> >>pdflush to disregard the limits...
> >>
> >
> >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> >guess that can happen very occasionally because we don't bother with any
> >locking around there but if it's happening a lot then something is bust.
> >
> >
> 
> Miquel's analysis is pretty plausible, but I'm not sure if
> he's confirmed it or not, Miquel? Even if it isn't happening
> a lot, and something isn't bust it might be a good idea to
> do this.

Seems OK from a quick check.  pdflush will block in get_request_wait()
occasionally, but not at all often.  Perhaps we could move the
write_congested test into the mpage_writepages() inner loop but it hardly
seems worth the risk.

Maybe things are different on Miquel's clockwork controller.



 drivers/block/ll_rw_blk.c |    2 ++
 fs/fs-writeback.c         |    2 ++
 2 files changed, 4 insertions(+)

diff -puN drivers/block/ll_rw_blk.c~pdflush-blockage-check drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~pdflush-blockage-check	2004-02-19 18:16:33.000000000 -0800
+++ 25-akpm/drivers/block/ll_rw_blk.c	2004-02-19 18:16:33.000000000 -0800
@@ -1651,6 +1651,8 @@ static struct request *get_request_wait(
 		if (!rq) {
 			struct io_context *ioc;
 
+			WARN_ON(current_is_pdflush());
+
 			io_schedule();
 
 			/*
diff -puN fs/fs-writeback.c~pdflush-blockage-check fs/fs-writeback.c
--- 25/fs/fs-writeback.c~pdflush-blockage-check	2004-02-19 18:22:25.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c	2004-02-19 18:22:43.000000000 -0800
@@ -279,6 +279,8 @@ sync_sb_inodes(struct super_block *sb, s
 		}
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+			if (current_is_pdflush())
+				printk("saved pdflush\n");
 			wbc->encountered_congestion = 1;
 			if (sb != blockdev_superblock)
 				break;		/* Skip a congested fs */

_


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20  2:32                               ` Andrew Morton
@ 2004-02-23  8:41                                 ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-20 14:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber

On 2004.02.20 03:32, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> > 
> > 
> > Andrew Morton wrote:
> > 
> > >Nick Piggin <piggin@cyberone.com.au> wrote:
> > >
> > >>Even with this patch, it might still be a good idea to allow
> > >>pdflush to disregard the limits...
> > >>
> > >
> > >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> > >guess that can happen very occasionally because we don't bother with any
> > >locking around there but if it's happening a lot then something is bust.
> > >
> > >
> > 
> > Miquel's analysis is pretty plausible, but I'm not sure if
> > he's confirmed it or not, Miquel?

Yes, I've added current->comm to my debug printk's, and it's dd/pdflush
that both block in get_request_wait (alternating between the two).

> > Even if it isn't happening
> > a lot, and something isn't bust it might be a good idea to
> > do this.
> 
> Seems OK from a quick check.  pdflush will block in get_request_wait()
> occasionally, but not at all often.  Perhaps we could move the
> write_congested test into the mpage_writepages() inner loop but it hardly
> seems worth the risk.
> 
> Maybe things are different on Miquel's clockwork controller.

I haven't tested it yet because of the "This patch isn't actually so good"
comment, but I found another explanation.

>  drivers/block/ll_rw_blk.c |    2 ++
>  fs/fs-writeback.c         |    2 ++
>  2 files changed, 4 insertions(+)

*Lightbulb on* .. I just read fs-writeback.c. As I said, this happens with an
LVM device. Could it be that because LVM and the actual device have different struct
request_queue's things go awry ?

In fs-writeback.c, your're looking at the LVM device (and its request_queue, and
its backing_dev_info). In__make_request, you're looking at the SCSI device.

So that's why I saw "dd" and pdflush fighting over get_request. pdflush isn't
supposed to run when the blockdev is congested, but ofcourse the LVM device
is never marked as congested.

The same can happen with MD devices, I think. Here's another proof-of-concept
patch that fixes things for LVM. With this patch applied, I never see pdflush
running when the queue is congested. You'll probably mark it as "horrible",
but hey, it shows the problem clear enough..

This patch adds a function pointer to the backing_dev_info struct that can
be called to check on the congestion if it's anything other than a simple device.
Dm.c now sets this pointer to a function defined in dm-table.c that checks
the congestion bits on all devices part of the logical volume.

I'm not sure if it should be the other way around (ll_rw_blk setting the
congested bit in the LVM device instead). Also if any queue of any device
under LVM is congested, pdflush won't run on the whole device - but I think
that harms less than running on a device with a congested queue.

bdi-congestion-funp.patch

--- linux-2.6.3/include/linux/backing-dev.h.ORIG	2004-02-04 04:43:38.000000000 +0100
+++ linux-2.6.3/include/linux/backing-dev.h	2004-02-20 15:17:52.000000000 +0100
@@ -24,6 +24,8 @@
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	int memory_backed;	/* Cannot clean pages with writepage */
+	int (*congested)(int, void *); /* Function pointer if device is md/dm */
+	void *aux;		/* Pointer to aux data for congested func */
 };
 
 extern struct backing_dev_info default_backing_dev_info;
@@ -34,11 +36,15 @@
 
 static inline int bdi_read_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_read_congested, bdi->aux);
 	return test_bit(BDI_read_congested, &bdi->state);
 }
 
 static inline int bdi_write_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_write_congested, bdi->aux);
 	return test_bit(BDI_write_congested, &bdi->state);
 }
 
--- linux-2.6.3/drivers/md/dm.h.ORIG	2004-02-20 15:15:16.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.h	2004-02-20 15:12:30.000000000 +0100
@@ -115,6 +115,7 @@
 int dm_table_get_mode(struct dm_table *t);
 void dm_table_suspend_targets(struct dm_table *t);
 void dm_table_resume_targets(struct dm_table *t);
+int dm_table_any_congested(int bdi_state, void *aux);
 
 /*-----------------------------------------------------------------
  * A registry of target types.
--- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
+++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100
@@ -857,6 +857,30 @@
 	}
 }
 
+/*
+ *	See if any device in the logical volume has a queue
+ *	that is congested - in that case, pdflush skips it.
+ */
+int dm_table_any_congested(int bdi_state, void *aux)
+{
+	struct mapped_device *md = aux;
+	struct dm_table *t;
+	struct list_head *d, *devices;
+	int r = 0;
+										
+	if ((t = dm_get_table(md)) == NULL)
+		return 0;
+										
+	devices = dm_table_get_devices(t);
+	for (d = devices->next; d != devices; d = d->next) {
+		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
+		request_queue_t *q = bdev_get_queue(dd->bdev);
+		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
+	}
+	dm_table_put(t);
+										
+	return r;
+}
 
 EXPORT_SYMBOL(dm_get_device);
 EXPORT_SYMBOL(dm_put_device);
--- linux-2.6.3/drivers/md/dm.c.ORIG	2004-02-20 15:15:02.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.c	2004-02-20 15:14:51.000000000 +0100
@@ -608,6 +608,8 @@
 	}
 
 	md->queue->queuedata = md;
+	md->queue->backing_dev_info.congested = dm_table_any_congested;
+	md->queue->backing_dev_info.aux = md;
 	blk_queue_make_request(md->queue, dm_request);
 
 	md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab,

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-23  8:41                                 ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-23  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber

On 2004.02.20 03:32, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> > 
> > 
> > Andrew Morton wrote:
> > 
> > >Nick Piggin <piggin@cyberone.com.au> wrote:
> > >
> > >>Even with this patch, it might still be a good idea to allow
> > >>pdflush to disregard the limits...
> > >>
> > >
> > >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> > >guess that can happen very occasionally because we don't bother with any
> > >locking around there but if it's happening a lot then something is bust.
> > >
> > >
> > 
> > Miquel's analysis is pretty plausible, but I'm not sure if
> > he's confirmed it or not, Miquel?

Yes, I've added current->comm to my debug printk's, and it's dd/pdflush
that both block in get_request_wait (alternating between the two).

> > Even if it isn't happening
> > a lot, and something isn't bust it might be a good idea to
> > do this.
> 
> Seems OK from a quick check.  pdflush will block in get_request_wait()
> occasionally, but not at all often.  Perhaps we could move the
> write_congested test into the mpage_writepages() inner loop but it hardly
> seems worth the risk.
> 
> Maybe things are different on Miquel's clockwork controller.

I haven't tested it yet because of the "This patch isn't actually so good"
comment, but I found another explanation.

>  drivers/block/ll_rw_blk.c |    2 ++
>  fs/fs-writeback.c         |    2 ++
>  2 files changed, 4 insertions(+)

*Lightbulb on* .. I just read fs-writeback.c. As I said, this happens with an
LVM device. Could it be that because LVM and the actual device have different struct
request_queue's things go awry ?

In fs-writeback.c, your're looking at the LVM device (and its request_queue, and
its backing_dev_info). In__make_request, you're looking at the SCSI device.

So that's why I saw "dd" and pdflush fighting over get_request. pdflush isn't
supposed to run when the blockdev is congested, but ofcourse the LVM device
is never marked as congested.

The same can happen with MD devices, I think. Here's another proof-of-concept
patch that fixes things for LVM. With this patch applied, I never see pdflush
running when the queue is congested. You'll probably mark it as "horrible",
but hey, it shows the problem clear enough..

This patch adds a function pointer to the backing_dev_info struct that can
be called to check on the congestion if it's anything other than a simple device.
Dm.c now sets this pointer to a function defined in dm-table.c that checks
the congestion bits on all devices part of the logical volume.

I'm not sure if it should be the other way around (ll_rw_blk setting the
congested bit in the LVM device instead). Also if any queue of any device
under LVM is congested, pdflush won't run on the whole device - but I think
that harms less than running on a device with a congested queue.

bdi-congestion-funp.patch

--- linux-2.6.3/include/linux/backing-dev.h.ORIG	2004-02-04 04:43:38.000000000 +0100
+++ linux-2.6.3/include/linux/backing-dev.h	2004-02-20 15:17:52.000000000 +0100
@@ -24,6 +24,8 @@
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	int memory_backed;	/* Cannot clean pages with writepage */
+	int (*congested)(int, void *); /* Function pointer if device is md/dm */
+	void *aux;		/* Pointer to aux data for congested func */
 };
 
 extern struct backing_dev_info default_backing_dev_info;
@@ -34,11 +36,15 @@
 
 static inline int bdi_read_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_read_congested, bdi->aux);
 	return test_bit(BDI_read_congested, &bdi->state);
 }
 
 static inline int bdi_write_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_write_congested, bdi->aux);
 	return test_bit(BDI_write_congested, &bdi->state);
 }
 
--- linux-2.6.3/drivers/md/dm.h.ORIG	2004-02-20 15:15:16.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.h	2004-02-20 15:12:30.000000000 +0100
@@ -115,6 +115,7 @@
 int dm_table_get_mode(struct dm_table *t);
 void dm_table_suspend_targets(struct dm_table *t);
 void dm_table_resume_targets(struct dm_table *t);
+int dm_table_any_congested(int bdi_state, void *aux);
 
 /*-----------------------------------------------------------------
  * A registry of target types.
--- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
+++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100
@@ -857,6 +857,30 @@
 	}
 }
 
+/*
+ *	See if any device in the logical volume has a queue
+ *	that is congested - in that case, pdflush skips it.
+ */
+int dm_table_any_congested(int bdi_state, void *aux)
+{
+	struct mapped_device *md = aux;
+	struct dm_table *t;
+	struct list_head *d, *devices;
+	int r = 0;
+										
+	if ((t = dm_get_table(md)) == NULL)
+		return 0;
+										
+	devices = dm_table_get_devices(t);
+	for (d = devices->next; d != devices; d = d->next) {
+		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
+		request_queue_t *q = bdev_get_queue(dd->bdev);
+		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
+	}
+	dm_table_put(t);
+										
+	return r;
+}
 
 EXPORT_SYMBOL(dm_get_device);
 EXPORT_SYMBOL(dm_put_device);
--- linux-2.6.3/drivers/md/dm.c.ORIG	2004-02-20 15:15:02.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.c	2004-02-20 15:14:51.000000000 +0100
@@ -608,6 +608,8 @@
 	}
 
 	md->queue->queuedata = md;
+	md->queue->backing_dev_info.congested = dm_table_any_congested;
+	md->queue->backing_dev_info.aux = md;
 	blk_queue_make_request(md->queue, dm_request);
 
 	md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab,

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-23  8:41                                 ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-20 14:59                                   ` Joe Thornber
  -1 siblings, 0 replies; 59+ messages in thread
From: Joe Thornber @ 2004-02-20  9:56 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, axboe, linux-lvm,
	linux-kernel, thornber

On Fri, Feb 20, 2004 at 03:40:42PM +0100, Miquel van Smoorenburg wrote:
> --- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
> +++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100

<snip>

> +	if ((t = dm_get_table(md)) == NULL)
> +		return 0;

struct mapped_device has no business in this file.  You should move
this function to dm.c, and provide accessor fns in dm-table.c.

> +	devices = dm_table_get_devices(t);
> +	for (d = devices->next; d != devices; d = d->next) {
> +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> +		request_queue_t *q = bdev_get_queue(dd->bdev);
> +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));

Shouldn't this be calling your bdi_*_congested function rather than
assuming it is a real device under dm ? (often not true).

I'm also very slightly worried that or'ing together the congestion
results for all the seperate devices isn't always the right thing.
These devices include anything that the targets are using, exception
stores for snapshots, logs for mirror, all paths for multipath (or'ing
is most likely to be wrong for multipath).

- Joe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-20 14:59                                   ` Joe Thornber
  0 siblings, 0 replies; 59+ messages in thread
From: Joe Thornber @ 2004-02-20 14:59 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, axboe, linux-lvm,
	linux-kernel, thornber

On Fri, Feb 20, 2004 at 03:40:42PM +0100, Miquel van Smoorenburg wrote:
> --- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
> +++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100

<snip>

> +	if ((t = dm_get_table(md)) == NULL)
> +		return 0;

struct mapped_device has no business in this file.  You should move
this function to dm.c, and provide accessor fns in dm-table.c.

> +	devices = dm_table_get_devices(t);
> +	for (d = devices->next; d != devices; d = d->next) {
> +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> +		request_queue_t *q = bdev_get_queue(dd->bdev);
> +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));

Shouldn't this be calling your bdi_*_congested function rather than
assuming it is a real device under dm ? (often not true).

I'm also very slightly worried that or'ing together the congestion
results for all the seperate devices isn't always the right thing.
These devices include anything that the targets are using, exception
stores for snapshots, logs for mirror, all paths for multipath (or'ing
is most likely to be wrong for multipath).

- Joe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 14:59                                   ` Joe Thornber
@ 2004-02-20 15:00                                     ` Jens Axboe
  -1 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-20  9:59 UTC (permalink / raw)
  To: Joe Thornber
  Cc: Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels,
	linux-lvm, linux-kernel

On Fri, Feb 20 2004, Joe Thornber wrote:
> > +	devices = dm_table_get_devices(t);
> > +	for (d = devices->next; d != devices; d = d->next) {
> > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> 
> Shouldn't this be calling your bdi_*_congested function rather than
> assuming it is a real device under dm ? (often not true).
> 
> I'm also very slightly worried that or'ing together the congestion
> results for all the seperate devices isn't always the right thing.
> These devices include anything that the targets are using, exception
> stores for snapshots, logs for mirror, all paths for multipath (or'ing
> is most likely to be wrong for multipath).

Yeah the patch is pretty much crap in that area, I don't think Miquel
was aiming for inclusion :)

I'd suggest making queue functions for congestion state as well so it
stacks properly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-20 15:00                                     ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-20 15:00 UTC (permalink / raw)
  To: Joe Thornber
  Cc: Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels,
	linux-lvm, linux-kernel

On Fri, Feb 20 2004, Joe Thornber wrote:
> > +	devices = dm_table_get_devices(t);
> > +	for (d = devices->next; d != devices; d = d->next) {
> > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> 
> Shouldn't this be calling your bdi_*_congested function rather than
> assuming it is a real device under dm ? (often not true).
> 
> I'm also very slightly worried that or'ing together the congestion
> results for all the seperate devices isn't always the right thing.
> These devices include anything that the targets are using, exception
> stores for snapshots, logs for mirror, all paths for multipath (or'ing
> is most likely to be wrong for multipath).

Yeah the patch is pretty much crap in that area, I don't think Miquel
was aiming for inclusion :)

I'd suggest making queue functions for congestion state as well so it
stacks properly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 15:00                                     ` Jens Axboe
@ 2004-02-23  8:45                                       ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-22 14:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Joe Thornber, Miquel van Smoorenburg, Andrew Morton, Nick Piggin,
	miquels, linux-lvm, linux-kernel

On Fri, 20 Feb 2004 16:00:13, Jens Axboe wrote:
> On Fri, Feb 20 2004, Joe Thornber wrote:
> > > +	devices = dm_table_get_devices(t);
> > > +	for (d = devices->next; d != devices; d = d->next) {
> > > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> > 
> > Shouldn't this be calling your bdi_*_congested function rather than
> > assuming it is a real device under dm ? (often not true).
> > 
> > I'm also very slightly worried that or'ing together the congestion
> > results for all the seperate devices isn't always the right thing.
> > These devices include anything that the targets are using, exception
> > stores for snapshots, logs for mirror, all paths for multipath (or'ing
> > is most likely to be wrong for multipath).
> 
> Yeah the patch is pretty much crap in that area, I don't think Miquel
> was aiming for inclusion :)
> 
> I'd suggest making queue functions for congestion state as well so it
> stacks properly.

I've been looking at this. If you want to keep the struct backing_dev_info
reasonably stand-alone (no function pointer and aux data like I did) you
probably need to add a list of parent_queues to every request_list. If
the queue get congested, you mark the parent queue(s) congested as well.
Congested state would need to be changed into a counter instead of a
bitfield, I think.

The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
mostly meant to prevent 2 pdflush daemons from running the same queue.
I don't see much harm in pdflush #1 running on /dev/md0 which consists
of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Would that be the way to go? If so, I'll take a stab at it.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-23  8:45                                       ` Miquel van Smoorenburg
  0 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-23  8:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Joe Thornber, Miquel van Smoorenburg, Andrew Morton, Nick Piggin,
	miquels, linux-lvm, linux-kernel

On Fri, 20 Feb 2004 16:00:13, Jens Axboe wrote:
> On Fri, Feb 20 2004, Joe Thornber wrote:
> > > +	devices = dm_table_get_devices(t);
> > > +	for (d = devices->next; d != devices; d = d->next) {
> > > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> > 
> > Shouldn't this be calling your bdi_*_congested function rather than
> > assuming it is a real device under dm ? (often not true).
> > 
> > I'm also very slightly worried that or'ing together the congestion
> > results for all the seperate devices isn't always the right thing.
> > These devices include anything that the targets are using, exception
> > stores for snapshots, logs for mirror, all paths for multipath (or'ing
> > is most likely to be wrong for multipath).
> 
> Yeah the patch is pretty much crap in that area, I don't think Miquel
> was aiming for inclusion :)
> 
> I'd suggest making queue functions for congestion state as well so it
> stacks properly.

I've been looking at this. If you want to keep the struct backing_dev_info
reasonably stand-alone (no function pointer and aux data like I did) you
probably need to add a list of parent_queues to every request_list. If
the queue get congested, you mark the parent queue(s) congested as well.
Congested state would need to be changed into a counter instead of a
bitfield, I think.

The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
mostly meant to prevent 2 pdflush daemons from running the same queue.
I don't see much harm in pdflush #1 running on /dev/md0 which consists
of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Would that be the way to go? If so, I'll take a stab at it.

Mike.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-23  8:45                                       ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-22 19:55                                         ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-22 14:55 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: axboe, thornber, piggin, miquels, linux-lvm, linux-kernel

Miquel van Smoorenburg <miquels@cistron.net> wrote:
>
> The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
>  mostly meant to prevent 2 pdflush daemons from running the same queue.
>  I don't see much harm in pdflush #1 running on /dev/md0 which consists
>  of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Yes, having two pdflushes working the same spindle is a bit pointless but
is presumably fairly harmless.

The most important thing here is to prevent pdflush from blocking on
request exhaustion.  Because while pdflush sleeps on a particular disk, all
the others remain unserviced.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-22 19:55                                         ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2004-02-22 19:55 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: axboe, thornber, miquels, piggin, miquels, linux-lvm,
	linux-kernel

Miquel van Smoorenburg <miquels@cistron.net> wrote:
>
> The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
>  mostly meant to prevent 2 pdflush daemons from running the same queue.
>  I don't see much harm in pdflush #1 running on /dev/md0 which consists
>  of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Yes, having two pdflushes working the same spindle is a bit pointless but
is presumably fairly harmless.

The most important thing here is to prevent pdflush from blocking on
request exhaustion.  Because while pdflush sleeps on a particular disk, all
the others remain unserviced.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-23  8:41                                 ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-20 14:57                                   ` Jens Axboe
  -1 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-20  9:57 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel,
	thornber

On Fri, Feb 20 2004, Miquel van Smoorenburg wrote:
> > > Even if it isn't happening
> > > a lot, and something isn't bust it might be a good idea to
> > > do this.
> > 
> > Seems OK from a quick check.  pdflush will block in get_request_wait()
> > occasionally, but not at all often.  Perhaps we could move the
> > write_congested test into the mpage_writepages() inner loop but it hardly
> > seems worth the risk.
> > 
> > Maybe things are different on Miquel's clockwork controller.
> 
> I haven't tested it yet because of the "This patch isn't actually so good"
> comment, but I found another explanation.
> 
> >  drivers/block/ll_rw_blk.c |    2 ++
> >  fs/fs-writeback.c         |    2 ++
> >  2 files changed, 4 insertions(+)
> 
> *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens
> with an LVM device. Could it be that because LVM and the actual device
> have different struct request_queue's things go awry ?
> 
> In fs-writeback.c, your're looking at the LVM device (and its
> request_queue, and its backing_dev_info). In__make_request, you're
> looking at the SCSI device.

In principle, the lvm/md queues themselves will never be congested. But
the underlying queues can be, of course.

Now this approach is _much_ better, imo. I don't particularly care very
much for how you solved it, though, I'd much rather just see both
setting and testing passed down (and kill the ->aux as well).

Regardless of the initial hw depth vs block depth (which is also a
generic device problem, not just dm related), this would be a good
addition to the congestion logic.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
@ 2004-02-20 14:57                                   ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-20 14:57 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel,
	thornber

On Fri, Feb 20 2004, Miquel van Smoorenburg wrote:
> > > Even if it isn't happening
> > > a lot, and something isn't bust it might be a good idea to
> > > do this.
> > 
> > Seems OK from a quick check.  pdflush will block in get_request_wait()
> > occasionally, but not at all often.  Perhaps we could move the
> > write_congested test into the mpage_writepages() inner loop but it hardly
> > seems worth the risk.
> > 
> > Maybe things are different on Miquel's clockwork controller.
> 
> I haven't tested it yet because of the "This patch isn't actually so good"
> comment, but I found another explanation.
> 
> >  drivers/block/ll_rw_blk.c |    2 ++
> >  fs/fs-writeback.c         |    2 ++
> >  2 files changed, 4 insertions(+)
> 
> *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens
> with an LVM device. Could it be that because LVM and the actual device
> have different struct request_queue's things go awry ?
> 
> In fs-writeback.c, your're looking at the LVM device (and its
> request_queue, and its backing_dev_info). In__make_request, you're
> looking at the SCSI device.

In principle, the lvm/md queues themselves will never be congested. But
the underlying queues can be, of course.

Now this approach is _much_ better, imo. I don't particularly care very
much for how you solved it, though, I'd much rather just see both
setting and testing passed down (and kill the ->aux as well).

Regardless of the initial hw depth vs block depth (which is also a
generic device problem, not just dm related), this would be a good
addition to the congestion logic.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Queue congestion: passing down vs passing up [PATCH]
  2004-02-20 14:57                                   ` Jens Axboe
  (?)
@ 2004-02-24 12:54                                   ` Miquel van Smoorenburg
  -1 siblings, 0 replies; 59+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-24 12:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-lvm

[weeded out the enormous Cc: list]

On 2004.02.20 15:57, Jens Axboe wrote:
> On Fri, Feb 20 2004, Miquel van Smoorenburg wrote:
> > > > Even if it isn't happening
> > > > a lot, and something isn't bust it might be a good idea to
> > > > do this.
> > > 
> > > Seems OK from a quick check.  pdflush will block in get_request_wait()
> > > occasionally, but not at all often.  Perhaps we could move the
> > > write_congested test into the mpage_writepages() inner loop but it hardly
> > > seems worth the risk.
> > > 
> > > Maybe things are different on Miquel's clockwork controller.
> > 
> > I haven't tested it yet because of the "This patch isn't actually so good"
> > comment, but I found another explanation.
> > 
> > >  drivers/block/ll_rw_blk.c |    2 ++
> > >  fs/fs-writeback.c         |    2 ++
> > >  2 files changed, 4 insertions(+)
> > 
> > *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens
> > with an LVM device. Could it be that because LVM and the actual device
> > have different struct request_queue's things go awry ?
> > 
> > In fs-writeback.c, your're looking at the LVM device (and its
> > request_queue, and its backing_dev_info). In__make_request, you're
> > looking at the SCSI device.
> 
> In principle, the lvm/md queues themselves will never be congested. But
> the underlying queues can be, of course.
> 
> Now this approach is _much_ better, imo. I don't particularly care very
> much for how you solved it, though, I'd much rather just see both
> setting and testing passed down (and kill the ->aux as well).

I just sent a patch to linux-lvm / Joe Thornber that does this. It adds a
void *request_queue to the struct backing_dev_info, and a congestion_fn
to the request_queue_t struct. bdi_{read,write}_congested got moved to
ll_rw_blk.c. dm.c sets the congestion_fn to an internal function that
checks the congestion of all the underlying block devices.

See https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html
for the actual patch.

So that's testing being passed down.

The setting you cannot pass down - you need to pass it "up". I implemented that
too, added a list of "parent queues" to the request_queue, and a function so
that for example dm can register its queue as "parent" of an underlying device.

It needs to be a list, since different partitions on one device can be members
of different meta-devices.

When the congestion functions in ll_rw_blk.c are called, if a queue is congested,
that info is also passed "up" to all of the "parent" queues. backing_dev_info
uses a counter instead of a bit to mark read/write as congested.
Perhaps it would be better to call a function (congestion_fn).

I think that we only need one of the two - either testing being passed down
or setting being passed up. The first one is the easiest. The second one
keeps the struct backing_dev_info clean (no request_queue pointer).

What would you prefer ?

For reference, a first cut at the "passing up queue congestion" is below.

Mike.

queue-parent.patch

--- linux-2.6.3.orig/include/linux/backing-dev.h	2004-02-04 04:43:38.000000000 +0100
+++ linux-2.6.3-queue_parent/include/linux/backing-dev.h	2004-02-23 16:17:53.000000000 +0100
@@ -24,6 +24,8 @@
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	int memory_backed;	/* Cannot clean pages with writepage */
+	int read_congested;	/* Write queue or child write queue congested */
+	int write_congested;	/* Read queue or child read queue congested */
 };
 
 extern struct backing_dev_info default_backing_dev_info;
@@ -34,12 +36,12 @@
 
 static inline int bdi_read_congested(struct backing_dev_info *bdi)
 {
-	return test_bit(BDI_read_congested, &bdi->state);
+	return bdi->read_congested;
 }
 
 static inline int bdi_write_congested(struct backing_dev_info *bdi)
 {
-	return test_bit(BDI_write_congested, &bdi->state);
+	return bdi->write_congested;
 }
 
 #endif		/* _LINUX_BACKING_DEV_H */
--- linux-2.6.3.orig/include/linux/blkdev.h	2004-02-22 13:52:17.000000000 +0100
+++ linux-2.6.3-queue_parent/include/linux/blkdev.h	2004-02-24 15:02:49.000000000 +0100
@@ -267,6 +267,11 @@
 	atomic_t refcnt;		/* map can be shared */
 };
 
+struct rqueue_list {
+	struct list_head	rqueue_list_head;
+	struct request_queue	*rqueue;
+};
+
 struct request_queue
 {
 	/*
@@ -301,6 +306,8 @@
 
 	struct backing_dev_info	backing_dev_info;
 
+	struct list_head	parents;
+
 	/*
 	 * The queue owner gets to use this for whatever they like.
 	 * ll_rw_blk doesn't touch it.
@@ -514,6 +521,8 @@
 extern void __blk_stop_queue(request_queue_t *q);
 extern void blk_run_queue(request_queue_t *q);
 extern void blk_queue_activity_fn(request_queue_t *, activity_fn *, void *);
+extern int blk_queue_add_parent(request_queue_t *q, request_queue_t *parent);
+extern void blk_queue_remove_parent(request_queue_t *q, request_queue_t *parent);
 
 static inline request_queue_t *bdev_get_queue(struct block_device *bdev)
 {
--- linux-2.6.3.orig/drivers/block/ll_rw_blk.c	2004-02-04 04:43:10.000000000 +0100
+++ linux-2.6.3-queue_parent/drivers/block/ll_rw_blk.c	2004-02-24 16:50:20.000000000 +0100
@@ -93,9 +93,31 @@
 }
 
 /*
- * A queue has just exitted congestion.  Note this in the global counter of
- * congested queues, and wake up anyone who was waiting for requests to be
- * put back.
+ * Increment/decrement the congestion counter of this queue
+ * and recurse over the parent queues.
+ */
+static void set_queues_congested(request_queue_t *q, int rw, int how)
+{
+	struct rqueue_list *p;
+	struct list_head *l;
+
+	if (rw == WRITE)
+		q->backing_dev_info.write_congested += how;
+	else
+		q->backing_dev_info.read_congested += how;
+
+	__list_for_each(l, &q->parents) {
+		p = list_entry(l, struct rqueue_list, rqueue_list_head);
+		spin_lock(p->rqueue->queue_lock);
+		set_queues_congested(p->rqueue, rw, how);
+		spin_unlock(p->rqueue->queue_lock);
+	}
+}
+
+/*
+ * A queue has just exitted congestion. Flag that in the queue's
+ * VM-visible state flags, and wake up anyone who was waiting
+ * for requests to be put back.
  */
 static void clear_queue_congested(request_queue_t *q, int rw)
 {
@@ -103,23 +125,91 @@
 	wait_queue_head_t *wqh = &congestion_wqh[rw];
 
 	bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
-	clear_bit(bit, &q->backing_dev_info.state);
+	if (!test_and_clear_bit(bit, &q->backing_dev_info.state))
+		return;
+	set_queues_congested(q, rw, -1);
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
 }
 
 /*
- * A queue has just entered congestion.  Flag that in the queue's VM-visible
- * state flags and increment the global gounter of congested queues.
+ * A queue has just entered congestion. Flag that in the queue's
+ * VM-visible state flags.
  */
 static void set_queue_congested(request_queue_t *q, int rw)
 {
 	enum bdi_state bit;
 
 	bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
-	set_bit(bit, &q->backing_dev_info.state);
+	if (test_and_set_bit(bit, &q->backing_dev_info.state))
+		return;
+	set_queues_congested(q, rw, 1);
+}
+
+/**
+ * blk_queue_add_parent - add a queue to the list of parents.
+ * @q:		this queue
+ * @parent:	parent queue
+ *
+ * Adds a queue to the list of parents. Locks both queues.
+ */
+int blk_queue_add_parent(request_queue_t *q, request_queue_t *parent)
+{
+	struct rqueue_list	*r;
+
+	r = kmalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&r->rqueue_list_head);
+	r->rqueue = parent;
+
+	spin_lock(q->queue_lock);
+	spin_lock(parent->queue_lock);
+	list_add_tail(&r->rqueue_list_head, &q->parents);
+	parent->backing_dev_info.read_congested +=
+		q->backing_dev_info.read_congested;
+	parent->backing_dev_info.write_congested +=
+		q->backing_dev_info.write_congested;
+	spin_unlock(parent->queue_lock);
+	spin_unlock(q->queue_lock);
+
+	return 0;
 }
 
+EXPORT_SYMBOL(blk_queue_add_parent);
+
+/**
+ * blk_queue_remove_parent - remove a queue from the list of parents.
+ * @q:		this queue
+ * @parent:	parent queue
+ *
+ * Removes a queue from the list of parents. Locks both queues.
+ */
+void blk_queue_remove_parent(request_queue_t *q, request_queue_t *parent)
+{
+	struct list_head	*l;
+	struct rqueue_list	*r;
+
+	spin_lock(q->queue_lock);
+	__list_for_each(l, &q->parents) {
+		r = list_entry(l, struct rqueue_list, rqueue_list_head);
+		if (r->rqueue == parent) {
+			spin_lock(parent->queue_lock);
+			parent->backing_dev_info.read_congested -=
+				q->backing_dev_info.read_congested;
+			parent->backing_dev_info.write_congested -=
+				q->backing_dev_info.write_congested;
+			spin_unlock(parent->queue_lock);
+			list_del(l);
+			kfree(r);
+			break;
+		}
+	}
+	spin_unlock(q->queue_lock);
+}
+
+EXPORT_SYMBOL(blk_queue_remove_parent);
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
@@ -1372,6 +1462,7 @@
 	memset(q, 0, sizeof(*q));
 	init_timer(&q->unplug_timer);
 	atomic_set(&q->refcnt, 1);
+	INIT_LIST_HEAD(&q->parents);
 	return q;
 }
 
@@ -2813,8 +2904,11 @@
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
 	struct request_list *rl = &q->rq;
+	int ret;
+
+	spin_lock(q->queue_lock);
 
-	int ret = queue_var_store(&q->nr_requests, page, count);
+	ret = queue_var_store(&q->nr_requests, page, count);
 	if (q->nr_requests < BLKDEV_MIN_RQ)
 		q->nr_requests = BLKDEV_MIN_RQ;
 
@@ -2841,6 +2935,9 @@
 		blk_clear_queue_full(q, WRITE);
 		wake_up(&rl->wait[WRITE]);
 	}
+
+	spin_unlock(q->queue_lock);
+
 	return ret;
 }
 
--- linux-2.6.3.orig/drivers/md/dm.h	2004-02-04 04:43:45.000000000 +0100
+++ linux-2.6.3-queue_parent/drivers/md/dm.h	2004-02-23 21:32:40.000000000 +0100
@@ -110,6 +110,8 @@
 struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
 struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
 void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q);
+void dm_table_add_parent(struct dm_table *t, struct request_queue *q);
+void dm_table_remove_parent(struct dm_table *t, struct request_queue *q);
 unsigned int dm_table_get_num_targets(struct dm_table *t);
 struct list_head *dm_table_get_devices(struct dm_table *t);
 int dm_table_get_mode(struct dm_table *t);
--- linux-2.6.3.orig/drivers/md/dm-table.c	2004-02-04 04:44:59.000000000 +0100
+++ linux-2.6.3-queue_parent/drivers/md/dm-table.c	2004-02-24 15:17:06.000000000 +0100
@@ -818,6 +818,30 @@
 	q->seg_boundary_mask = t->limits.seg_boundary_mask;
 }
 
+void dm_table_add_parent(struct dm_table *t, struct request_queue *q)
+{
+	struct list_head *d, *devices;
+
+	devices = dm_table_get_devices(t);
+	for (d = devices->next; d != devices; d = d->next) {
+		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
+		request_queue_t *bq = bdev_get_queue(dd->bdev);
+		blk_queue_add_parent(bq, q);
+	}
+}
+
+void dm_table_remove_parent(struct dm_table *t, struct request_queue *q)
+{
+	struct list_head *d, *devices;
+
+	devices = dm_table_get_devices(t);
+	for (d = devices->next; d != devices; d = d->next) {
+		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
+		request_queue_t *bq = bdev_get_queue(dd->bdev);
+		blk_queue_remove_parent(bq, q);
+	}
+}
+
 unsigned int dm_table_get_num_targets(struct dm_table *t)
 {
 	return t->num_targets;
--- linux-2.6.3.orig/drivers/md/dm.c	2004-02-22 13:52:15.000000000 +0100
+++ linux-2.6.3-queue_parent/drivers/md/dm.c	2004-02-24 16:35:32.000000000 +0100
@@ -41,6 +41,7 @@
 struct mapped_device {
 	struct rw_semaphore lock;
 	atomic_t holders;
+	spinlock_t queue_lock;
 
 	unsigned long flags;
 
@@ -600,7 +601,7 @@
 	memset(md, 0, sizeof(*md));
 	init_rwsem(&md->lock);
 	atomic_set(&md->holders, 1);
-
+	md->queue_lock = SPIN_LOCK_UNLOCKED;
 	md->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!md->queue) {
 		kfree(md);
@@ -608,6 +609,7 @@
 	}
 
 	md->queue->queuedata = md;
+	md->queue->queue_lock = &md->queue_lock;
 	blk_queue_make_request(md->queue, dm_request);
 
 	md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab,
@@ -695,6 +697,7 @@
 
 	dm_table_get(t);
 	dm_table_set_restrictions(t, q);
+	dm_table_add_parent(t, q);
 	return 0;
 }
 
@@ -704,6 +707,7 @@
 		return;
 
 	dm_table_event_callback(md->map, NULL, NULL);
+	dm_table_remove_parent(md->map, md->queue);
 	dm_table_put(md->map);
 	md->map = NULL;
 }

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:12                         ` Nick Piggin
@ 2004-02-20  1:45                           ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19 20:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber


Nick Piggin wrote:

> Hi Miquel,
> Can you see if this patch helps you?
>

This patch isn't actually so good because it doesn't wake
a specific process when it gets below it's limit...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
@ 2004-02-20  1:45                           ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-20  1:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber



Nick Piggin wrote:

> Hi Miquel,
> Can you see if this patch helps you?
>

This patch isn't actually so good because it doesn't wake
a specific process when it gets below it's limit...


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  9:08           ` [linux-lvm] " Miquel van Smoorenburg
@ 2004-02-19  9:11             ` Nick Piggin
  -1 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  2:51 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg wrote:

>
>No, I'm actually referring to a struct request. I'm logging this in the
>SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
>in fact logged all the bio's submitted to __make_request, and the output
>of the elevator from elv_next_request(). The bio's are submitted sequentially,
>the resulting requests aren't. But this is because nr_requests is 128, while
>the 3ware device has a queue of 254 entries (no tagging though). Upping
>nr_requests to 512 makes this go away ..
>
>That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
>not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
>with a lot of debugging (unless I set nr_requests lower again), which points
>to a timing issue.
>
>

So the problem you are seeing is due to "unlucky" timing between
two processes submitting IO. And the very efficient mechanisms
(merging, sorting) we have to improve situations exactly like this
is effectively disabled. And to make it worse, it appears that your
controller shits itself on this trivially simple pattern.

Your hack makes a baby step in the direction of per *process*
request limits, which I happen to be an advocate of. As it stands
though, I don't like it.

Jens has the final say when it comes to the block layer though.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19  9:11             ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2004-02-19  9:11 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg wrote:

>
>No, I'm actually referring to a struct request. I'm logging this in the
>SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
>in fact logged all the bio's submitted to __make_request, and the output
>of the elevator from elv_next_request(). The bio's are submitted sequentially,
>the resulting requests aren't. But this is because nr_requests is 128, while
>the 3ware device has a queue of 254 entries (no tagging though). Upping
>nr_requests to 512 makes this go away ..
>
>That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
>not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
>with a lot of debugging (unless I set nr_requests lower again), which points
>to a timing issue.
>
>

So the problem you are seeing is due to "unlucky" timing between
two processes submitting IO. And the very efficient mechanisms
(merging, sorting) we have to improve situations exactly like this
is effectively disabled. And to make it worse, it appears that your
controller shits itself on this trivially simple pattern.

Your hack makes a baby step in the direction of per *process*
request limits, which I happen to be an advocate of. As it stands
though, I don't like it.

Jens has the final say when it comes to the block layer though.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  9:11             ` [linux-lvm] " Nick Piggin
@ 2004-02-19 10:26               ` Jens Axboe
  -1 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-19 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, Feb 19 2004, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >
> >No, I'm actually referring to a struct request. I'm logging this in the
> >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
> >in fact logged all the bio's submitted to __make_request, and the output
> >of the elevator from elv_next_request(). The bio's are submitted 
> >sequentially,
> >the resulting requests aren't. But this is because nr_requests is 128, 
> >while
> >the 3ware device has a queue of 254 entries (no tagging though). Upping
> >nr_requests to 512 makes this go away ..
> >
> >That shouldn't be necessary though. I only see this with LVM over 
> >3ware-raid5,
> >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less 
> >troublesome
> >with a lot of debugging (unless I set nr_requests lower again), which 
> >points
> >to a timing issue.
> >
> >
> 
> So the problem you are seeing is due to "unlucky" timing between
> two processes submitting IO. And the very efficient mechanisms
> (merging, sorting) we have to improve situations exactly like this
> is effectively disabled. And to make it worse, it appears that your
> controller shits itself on this trivially simple pattern.
> 
> Your hack makes a baby step in the direction of per *process*
> request limits, which I happen to be an advocate of. As it stands
> though, I don't like it.

I'm very much an advocate for per process request limits as well.  Would
be trivial to add... Miquels patch is horrible, I appreciate it being
posted as a cry for help.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [linux-lvm] Re: IO scheduler, queue depth, nr_requests
@ 2004-02-19 10:26               ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2004-02-19 10:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, Feb 19 2004, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >
> >No, I'm actually referring to a struct request. I'm logging this in the
> >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
> >in fact logged all the bio's submitted to __make_request, and the output
> >of the elevator from elv_next_request(). The bio's are submitted 
> >sequentially,
> >the resulting requests aren't. But this is because nr_requests is 128, 
> >while
> >the 3ware device has a queue of 254 entries (no tagging though). Upping
> >nr_requests to 512 makes this go away ..
> >
> >That shouldn't be necessary though. I only see this with LVM over 
> >3ware-raid5,
> >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less 
> >troublesome
> >with a lot of debugging (unless I set nr_requests lower again), which 
> >points
> >to a timing issue.
> >
> >
> 
> So the problem you are seeing is due to "unlucky" timing between
> two processes submitting IO. And the very efficient mechanisms
> (merging, sorting) we have to improve situations exactly like this
> is effectively disabled. And to make it worse, it appears that your
> controller shits itself on this trivially simple pattern.
> 
> Your hack makes a baby step in the direction of per *process*
> request limits, which I happen to be an advocate of. As it stands
> though, I don't like it.

I'm very much an advocate for per process request limits as well.  Would
be trivial to add... Miquels patch is horrible, I appreciate it being
posted as a cry for help.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2004-02-24 15:58 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-16  9:11 [linux-lvm] IO scheduler, queue depth, nr_requests Miquel van Smoorenburg
2004-02-16  8:30 ` [linux-lvm] " Jens Axboe
2004-02-16  9:01   ` Joe Thornber
2004-02-16 19:32   ` Kevin P. Fleming
2004-02-17  1:46   ` Kevin P. Fleming
2004-02-18 10:29   ` Miquel van Smoorenburg
2004-02-18 23:52     ` Miquel van Smoorenburg
2004-02-19  8:51       ` [linux-lvm] " Miquel van Smoorenburg
2004-02-19  1:24       ` Nick Piggin
2004-02-19  8:51         ` [linux-lvm] " Nick Piggin
2004-02-19  1:52         ` Miquel van Smoorenburg
2004-02-19  9:00           ` [linux-lvm] " Miquel van Smoorenburg
2004-02-19  2:01           ` Nick Piggin
2004-02-19  9:00             ` [linux-lvm] " Nick Piggin
2004-02-19  1:26       ` Andrew Morton
2004-02-19  8:50         ` [linux-lvm] " Andrew Morton
2004-02-19  2:11         ` Miquel van Smoorenburg
2004-02-19  9:08           ` [linux-lvm] " Miquel van Smoorenburg
2004-02-19  2:26           ` Andrew Morton
2004-02-19  9:08             ` [linux-lvm] " Andrew Morton
2004-02-19 10:15             ` Miquel van Smoorenburg
2004-02-19 10:23               ` [linux-lvm] " Miquel van Smoorenburg
2004-02-19 10:19               ` Jens Axboe
2004-02-19 10:26                 ` [linux-lvm] " Jens Axboe
2004-02-19 15:58                 ` Miquel van Smoorenburg
2004-02-19 20:59                   ` Miquel van Smoorenburg
2004-02-19 17:52                   ` [linux-lvm] " Nick Piggin
2004-02-19 22:52                     ` Nick Piggin
2004-02-19 18:52                     ` [linux-lvm] " Miquel van Smoorenburg
2004-02-19 23:53                       ` Miquel van Smoorenburg
2004-02-19 19:15                       ` [linux-lvm] " Nick Piggin
2004-02-20  0:15                         ` Nick Piggin
2004-02-19 20:16                       ` [linux-lvm] [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
2004-02-20  1:12                         ` Nick Piggin
2004-02-19 20:25                         ` [linux-lvm] " Andrew Morton
2004-02-20  1:26                           ` Andrew Morton
2004-02-19 20:40                           ` [linux-lvm] " Nick Piggin
2004-02-20  1:40                             ` Nick Piggin
2004-02-19 21:32                             ` [linux-lvm] " Andrew Morton
2004-02-20  2:32                               ` Andrew Morton
2004-02-20 14:40                               ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg
2004-02-23  8:41                                 ` [linux-lvm] " Miquel van Smoorenburg
2004-02-20  9:56                                 ` [linux-lvm] " Joe Thornber
2004-02-20 14:59                                   ` Joe Thornber
2004-02-20  9:59                                   ` [linux-lvm] " Jens Axboe
2004-02-20 15:00                                     ` Jens Axboe
2004-02-22 14:02                                     ` Miquel van Smoorenburg
2004-02-23  8:45                                       ` [linux-lvm] " Miquel van Smoorenburg
2004-02-22 14:55                                       ` Andrew Morton
2004-02-22 19:55                                         ` Andrew Morton
2004-02-20  9:57                                 ` [linux-lvm] " Jens Axboe
2004-02-20 14:57                                   ` Jens Axboe
2004-02-24 12:54                                   ` [linux-lvm] Queue congestion: passing down vs passing up [PATCH] Miquel van Smoorenburg
2004-02-19 20:52                         ` [linux-lvm] Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
2004-02-20  1:45                           ` Nick Piggin
2004-02-19  2:51           ` IO scheduler, queue depth, nr_requests Nick Piggin
2004-02-19  9:11             ` [linux-lvm] " Nick Piggin
2004-02-19 10:21             ` Jens Axboe
2004-02-19 10:26               ` [linux-lvm] " Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.