* Re: IO scheduler, queue depth, nr_requests [not found] ` <1r6S4-6cv-1@gated-at.bofh.it> @ 2004-02-25 20:17 ` Bill Davidsen 2004-02-25 21:39 ` Miquel van Smoorenburg 2004-02-26 0:39 ` Nick Piggin 0 siblings, 2 replies; 18+ messages in thread From: Bill Davidsen @ 2004-02-25 20:17 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List linux.kernelNick Piggin wrote: > But the whole reason it is getting blocked in the first place > is because your controller is sucking up all your requests. > The whole problem is not a problem if you use properly sized > queues. > > I'm a bit surprised that it wasn't working well with a controller > queue depth of 64 and 128 nr_requests. I'll give you a per process > request limit patch to try in a minute. And there's the rub... he did try what you are calling correctly sized queues, and his patch works better. I'm all in favor of having the theory and then writing the code, but when something works I would rather understand why and modify the theory. In other words, given a patch which does help performance in this case, it would be good to understand why, instead of favoring a solution which is better in theory, but which has been tried and found inferior in performance. I am NOT saying we should just block, effective as Miquel's patch seems, just that we should understand why it works well instead of saying it is in theory bad. I agree, but it works! Hopefully per-process limits solve this, but they "in theory" could result in blocking a process in an otherwise idle system. Unless I midread what you mean of course. Processes which calculate for a while and write results are not uncomon, and letting such a process write a whole bunch of data and then go calculate while it is written is certainly the way it should work. I'm unconvinced that per-process limits are the whole answer without considering the entire io load on the system. Feel free to tell me I'm misreading your intent (and why). -- bill on the road <davidsen@tmr.com> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-25 20:17 ` IO scheduler, queue depth, nr_requests Bill Davidsen @ 2004-02-25 21:39 ` Miquel van Smoorenburg 2004-02-26 0:39 ` Nick Piggin 1 sibling, 0 replies; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-25 21:39 UTC (permalink / raw) To: linux-kernel In article <403D02E3.4070208@tmr.com>, Bill Davidsen <davidsen@tmr.com> wrote: >linux.kernelNick Piggin wrote: > >> But the whole reason it is getting blocked in the first place >> is because your controller is sucking up all your requests. >> The whole problem is not a problem if you use properly sized >> queues. >> >> I'm a bit surprised that it wasn't working well with a controller >> queue depth of 64 and 128 nr_requests. I'll give you a per process >> request limit patch to try in a minute. > >And there's the rub... he did try what you are calling correctly sized >queues, and his patch works better. I'm all in favor of having the >theory and then writing the code, but when something works I would >rather understand why and modify the theory. Ah, but we now do understand why - it was pdflush and a process fighting over get_request on basically the same device, before and after LVM remapping. See: http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1522.html More complete solutions: https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html https://www.redhat.com/archives/linux-lvm/2004-February/msg00215.html Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-25 20:17 ` IO scheduler, queue depth, nr_requests Bill Davidsen 2004-02-25 21:39 ` Miquel van Smoorenburg @ 2004-02-26 0:39 ` Nick Piggin 1 sibling, 0 replies; 18+ messages in thread From: Nick Piggin @ 2004-02-26 0:39 UTC (permalink / raw) To: bill davidsen; +Cc: Linux Kernel Mailing List Bill Davidsen wrote: > linux.kernelNick Piggin wrote: > >> But the whole reason it is getting blocked in the first place >> is because your controller is sucking up all your requests. >> The whole problem is not a problem if you use properly sized >> queues. >> >> I'm a bit surprised that it wasn't working well with a controller >> queue depth of 64 and 128 nr_requests. I'll give you a per process >> request limit patch to try in a minute. > > > And there's the rub... he did try what you are calling correctly sized > queues, and his patch works better. I'm all in favor of having the > theory and then writing the code, but when something works I would > rather understand why and modify the theory. > > In other words, given a patch which does help performance in this > case, it would be good to understand why, instead of favoring a > solution which is better in theory, but which has been tried and found > inferior in performance. > > I am NOT saying we should just block, effective as Miquel's patch > seems, just that we should understand why it works well instead of > saying it is in theory bad. I agree, but it works! Hopefully > per-process limits solve this, but they "in theory" could result in > blocking a process in an otherwise idle system. Unless I midread what > you mean of course. Processes which calculate for a while and write > results are not uncomon, and letting such a process write a whole > bunch of data and then go calculate while it is written is certainly > the way it should work. I'm unconvinced that per-process limits are > the whole answer without considering the entire io load on the system. > > Feel free to tell me I'm misreading your intent (and why). No, I know Miquel's patch works and his analysis of what is happening is correct (he proved it). The patch is a bit of a hack to get around the specific problem he is seeing which should never happen with an appropriately sized queue, and it might actually hurt in some cases. ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <20040216131609.GA21974@cistron.nl>]
[parent not found: <20040216133047.GA9330@suse.de>]
[parent not found: <20040217145716.GE30438@traveler.cistron.net>]
* Re: IO scheduler, queue depth, nr_requests [not found] ` <20040217145716.GE30438@traveler.cistron.net> @ 2004-02-18 23:52 ` Miquel van Smoorenburg 2004-02-19 1:24 ` Nick Piggin 2004-02-19 1:26 ` Andrew Morton 0 siblings, 2 replies; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-18 23:52 UTC (permalink / raw) To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > For some reason, when using LVM, write requests get queued out > of order to the 3ware controller, which results in quite a bit > of seeking and thus performance loss. [..] > Okay I repeated some earlier tests, and I added some debug code in > several places. > > I added logging to tw_scsi_queue() in the 3ware driver to log the > start sector and length of each request. It logs something like: > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > With a perl script, I can check if the requests are sent to the > host in order. That outputs something like this: > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. I found out what causes this. It's get_request_wait(). When the request queue is full, and a new request needs to be created, __make_request() blocks in get_request_wait(). Another process wakes up first (pdflush / process submitting I/O itself / xfsdatad / etc) and sends the next bio's to __make_request(). In the mean time some free requests have become available, and the bios are merged into a new request. Those requests are submitted to the device. Then, get_request_wait() returns but the bio is not mergeable anymore - and that results in a backwards seek, severely limiting the I/O rate. Wouldn't it be better to allow the request allocation and queue the request, and /then/ put the process to sleep ? The queue will grow larger than nr_requests, but it does that anyway. Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-18 23:52 ` Miquel van Smoorenburg @ 2004-02-19 1:24 ` Nick Piggin 2004-02-19 1:52 ` Miquel van Smoorenburg 2004-02-19 1:26 ` Andrew Morton 1 sibling, 1 reply; 18+ messages in thread From: Nick Piggin @ 2004-02-19 1:24 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > >>For some reason, when using LVM, write requests get queued out >>of order to the 3ware controller, which results in quite a bit >>of seeking and thus performance loss. >> >[..] > >>Okay I repeated some earlier tests, and I added some debug code in >>several places. >> >>I added logging to tw_scsi_queue() in the 3ware driver to log the >>start sector and length of each request. It logs something like: >>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 >> >>With a perl script, I can check if the requests are sent to the >>host in order. That outputs something like this: >> >>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 >>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 >>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 >> >>See, 31 requests in order, then one request "backwards", then 31 in order, etc. >> > >I found out what causes this. It's get_request_wait(). > >When the request queue is full, and a new request needs to be created, >__make_request() blocks in get_request_wait(). > >Another process wakes up first (pdflush / process submitting I/O itself / >xfsdatad / etc) and sends the next bio's to __make_request(). >In the mean time some free requests have become available, and the bios >are merged into a new request. Those requests are submitted to the device. > >Then, get_request_wait() returns but the bio is not mergeable anymore - >and that results in a backwards seek, severely limiting the I/O rate. > >Wouldn't it be better to allow the request allocation and queue the >request, and /then/ put the process to sleep ? The queue will grow larger >than nr_requests, but it does that anyway. > > The "batching" logic there should allow a process to submit a number of requests even above the nr_requests limit to prevent this interleave and context switching. Are you using tagged command queueing? What depth? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 1:24 ` Nick Piggin @ 2004-02-19 1:52 ` Miquel van Smoorenburg 2004-02-19 2:01 ` Nick Piggin 0 siblings, 1 reply; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 1:52 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel, Joe Thornber On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > Miquel van Smoorenburg wrote: > > >I found out what causes this. It's get_request_wait(). > > > >When the request queue is full, and a new request needs to be created, > >__make_request() blocks in get_request_wait(). > > > >Another process wakes up first (pdflush / process submitting I/O itself / > >xfsdatad / etc) and sends the next bio's to __make_request(). > >In the mean time some free requests have become available, and the bios > >are merged into a new request. Those requests are submitted to the device. > > > >Then, get_request_wait() returns but the bio is not mergeable anymore - > >and that results in a backwards seek, severely limiting the I/O rate. > > > > The "batching" logic there should allow a process to submit > a number of requests even above the nr_requests limit to > prevent this interleave and context switching. > > Are you using tagged command queueing? What depth? No, I'm not using tagged command queueing. The 3ware controller is not a real scsi controller, the driver just emulates one. It's a raid5 controller that drives SATA disks. It has an internal request queue ("can_queu") of 254 outstanding commands. Because that is way bigger than nr_requests this happens - if I set nr_requests to 512, the problem goes away. But that shouldn't happen ;) I'm preparing a proof-of-concept patch now, if it works and I don't wedge the remote machine I'm testing this on I'll post it in a few minutes. Mike. > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 1:52 ` Miquel van Smoorenburg @ 2004-02-19 2:01 ` Nick Piggin 0 siblings, 0 replies; 18+ messages in thread From: Nick Piggin @ 2004-02-19 2:01 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote: > >>Miquel van Smoorenburg wrote: >> >> >>>I found out what causes this. It's get_request_wait(). >>> >>>When the request queue is full, and a new request needs to be created, >>>__make_request() blocks in get_request_wait(). >>> >>>Another process wakes up first (pdflush / process submitting I/O itself / >>>xfsdatad / etc) and sends the next bio's to __make_request(). >>>In the mean time some free requests have become available, and the bios >>>are merged into a new request. Those requests are submitted to the device. >>> >>>Then, get_request_wait() returns but the bio is not mergeable anymore - >>>and that results in a backwards seek, severely limiting the I/O rate. >>> >>> >>The "batching" logic there should allow a process to submit >>a number of requests even above the nr_requests limit to >>prevent this interleave and context switching. >> >>Are you using tagged command queueing? What depth? >> > >No, I'm not using tagged command queueing. The 3ware controller is not a >real scsi controller, the driver just emulates one. It's a raid5 controller >that drives SATA disks. It has an internal request queue ("can_queu") >of 254 outstanding commands. > This is what I mean by tagged command queueing. > Because that is way bigger than nr_requests >this happens - if I set nr_requests to 512, the problem goes away. But >that shouldn't happen ;) > > What shouldn't happen? >I'm preparing a proof-of-concept patch now, if it works and I don't wedge >the remote machine I'm testing this on I'll post it in a few minutes. > > I'm not very happy with forcing a process to sleep _after_ it has submitted a request... but I'd be interested to see exactly what your patch does. By far the best option is to use appropriately sized queues. The below patch is a start, but it unfortunately doesn't help drivers which use private queueing implementations. http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-18 23:52 ` Miquel van Smoorenburg 2004-02-19 1:24 ` Nick Piggin @ 2004-02-19 1:26 ` Andrew Morton 2004-02-19 2:11 ` Miquel van Smoorenburg 1 sibling, 1 reply; 18+ messages in thread From: Andrew Morton @ 2004-02-19 1:26 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: axboe, miquels, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > For some reason, when using LVM, write requests get queued out > > of order to the 3ware controller, which results in quite a bit > > of seeking and thus performance loss. > [..] > > Okay I repeated some earlier tests, and I added some debug code in > > several places. > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > start sector and length of each request. It logs something like: > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > With a perl script, I can check if the requests are sent to the > > host in order. That outputs something like this: > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are actually referring to a single request which has 31 128k BIOs in it, yes? > > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > I found out what causes this. It's get_request_wait(). > > When the request queue is full, and a new request needs to be created, > __make_request() blocks in get_request_wait(). > > Another process wakes up first (pdflush / process submitting I/O itself / > xfsdatad / etc) and sends the next bio's to __make_request(). > In the mean time some free requests have become available, and the bios > are merged into a new request. Those requests are submitted to the device. > > Then, get_request_wait() returns but the bio is not mergeable anymore - > and that results in a backwards seek, severely limiting the I/O rate. > > Wouldn't it be better to allow the request allocation and queue the > request, and /then/ put the process to sleep ? The queue will grow larger > than nr_requests, but it does that anyway. That would help, but is a bit kludgy. What _should_ have happened was: a) The queue gets plugged b) The maximal-sized 31-bio request is queued c) The 4k request gets inserted *before* the 3968k request d) The queue gets unplugged, and both requests stream smoothly. (And this assumes that the queue was initially empty. ALmost certainly that was not the case). So the question is, why did the little, later 4k request not get itself inserted in front of the earlier, large 3968k request? Are you using md as well? If so, try removing the blk_run_queues() statements from md_thread() and md_do_sync(). And generally hunt around and try nuking blk_run_queues() statements from the device driver/md/dm layer - they may well be wrong. Is this problem specific to md? Can it be reproduced on a boring-old-disk? Does elevator=deadline change anything? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 1:26 ` Andrew Morton @ 2004-02-19 2:11 ` Miquel van Smoorenburg 2004-02-19 2:26 ` Andrew Morton 2004-02-19 2:51 ` Nick Piggin 0 siblings, 2 replies; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 2:11 UTC (permalink / raw) To: Andrew Morton Cc: Miquel van Smoorenburg, axboe, miquels, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote: > > > For some reason, when using LVM, write requests get queued out > > > of order to the 3ware controller, which results in quite a bit > > > of seeking and thus performance loss. > > [..] > > > Okay I repeated some earlier tests, and I added some debug code in > > > several places. > > > > > > I added logging to tw_scsi_queue() in the 3ware driver to log the > > > start sector and length of each request. It logs something like: > > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256 > > > > > > With a perl script, I can check if the requests are sent to the > > > host in order. That outputs something like this: > > > > > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31 > > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1 > > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31 > > 3968 / 31 = 128 exactly. I think when you say "requests: 31" you are > actually referring to a single request which has 31 128k BIOs in it, yes? No, I'm actually referring to a struct request. I'm logging this in the SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have in fact logged all the bio's submitted to __make_request, and the output of the elevator from elv_next_request(). The bio's are submitted sequentially, the resulting requests aren't. But this is because nr_requests is 128, while the 3ware device has a queue of 254 entries (no tagging though). Upping nr_requests to 512 makes this go away .. That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome with a lot of debugging (unless I set nr_requests lower again), which points to a timing issue. > > > See, 31 requests in order, then one request "backwards", then 31 in order, etc. > > > > I found out what causes this. It's get_request_wait(). > > > > When the request queue is full, and a new request needs to be created, > > __make_request() blocks in get_request_wait(). > > > > Another process wakes up first (pdflush / process submitting I/O itself / > > xfsdatad / etc) and sends the next bio's to __make_request(). > > In the mean time some free requests have become available, and the bios > > are merged into a new request. Those requests are submitted to the device. > > > > Then, get_request_wait() returns but the bio is not mergeable anymore - > > and that results in a backwards seek, severely limiting the I/O rate. > > > > Wouldn't it be better to allow the request allocation and queue the > > request, and /then/ put the process to sleep ? The queue will grow larger > > than nr_requests, but it does that anyway. > > That would help, but is a bit kludgy. Patch that does exactly this (it's probably butt-ugly, but it proves the point) below. > What _should_ have happened was: > > a) The queue gets plugged > > b) The maximal-sized 31-bio request is queued > > c) The 4k request gets inserted *before* the 3968k request > > d) The queue gets unplugged, and both requests stream smoothly. > > (And this assumes that the queue was initially empty. ALmost certainly > that was not the case). The thing is, the bio's are submitted perfectly sequentially. It's just that every so often a request (with just its initial bio) gets stuck for a while. Because the bio's after it are merged and sent to the device, it's not possible to merge that stuck request later on when it gets unstuck, because the other bio's have already left the building so to speak. I'm not using MD at all, and I tried all elevators. All the same results, so I stayed with elevator=noop because it's so much easier to understand ;) This proof-of-concept patch fixes things: ll_rw_blk.c.patch (Hmm, 03:11, time to go to bed. I'm happy I finally found this though!) --- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG 2004-02-04 04:43:10.000000000 +0100 +++ linux-2.6.3/drivers/block/ll_rw_blk.c 2004-02-19 02:51:06.000000000 +0100 @@ -1545,7 +1545,8 @@ /* * Get a free request, queue_lock must not be held */ -static struct request *get_request(request_queue_t *q, int rw, int gfp_mask) +static struct request *get_request(request_queue_t *q, int rw, + int gfp_mask, int *qfull) { struct request *rq = NULL; struct request_list *rl = &q->rq; @@ -1569,10 +1570,15 @@ && !ioc_batching(ioc) && !elv_may_queue(q, rw)) { /* * The queue is full and the allocating process is not a - * "batcher", and not exempted by the IO scheduler + * "batcher", and not exempted by the IO scheduler. + * If "qfull" is a valid pointer, set it to 1 to return + * this info to the caller so that it can sleep. */ - spin_unlock_irq(q->queue_lock); - goto out; + if (qfull == NULL) { + spin_unlock_irq(q->queue_lock); + goto out; + } else + *qfull = 1; } rl->count[rw]++; @@ -1627,7 +1633,7 @@ * No available requests for this queue, unplug the device and wait for some * requests to become available. */ -static struct request *get_request_wait(request_queue_t *q, int rw) +static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq) { DEFINE_WAIT(wait); struct request *rq; @@ -1639,7 +1645,7 @@ prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE); - rq = get_request(q, rw, GFP_NOIO); + rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL; if (!rq) { struct io_context *ioc; @@ -1657,7 +1663,7 @@ put_io_context(ioc); } finish_wait(&rl->wait[rw], &wait); - } while (!rq); + } while (!rq && wantreq); return rq; } @@ -1669,9 +1675,9 @@ BUG_ON(rw != READ && rw != WRITE); if (gfp_mask & __GFP_WAIT) - rq = get_request_wait(q, rw); + rq = get_request_wait(q, rw, 1); else - rq = get_request(q, rw, gfp_mask); + rq = get_request(q, rw, gfp_mask, NULL); return rq; } @@ -2002,6 +2008,7 @@ struct request *req, *freereq = NULL; int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra; sector_t sector; + int qfull; sector = bio->bi_sector; nr_sectors = bio_sectors(bio); @@ -2023,6 +2030,7 @@ ra = bio->bi_rw & (1 << BIO_RW_AHEAD); again: + qfull = 0; spin_lock_irq(q->queue_lock); if (elv_queue_empty(q)) { @@ -2096,14 +2104,15 @@ freereq = NULL; } else { spin_unlock_irq(q->queue_lock); - if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) { + freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull); + if (freereq == NULL) { /* * READA bit set */ if (ra) goto end_io; - freereq = get_request_wait(q, rw); + freereq = get_request_wait(q, rw, 1); } goto again; } @@ -2141,6 +2150,13 @@ req->start_time = jiffies; add_request(q, req); + + if (qfull) { + spin_unlock_irq(q->queue_lock); + get_request_wait(q, rw, 0); + spin_lock_irq(q->queue_lock); + } + out: if (freereq) __blk_put_request(q, freereq); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 2:11 ` Miquel van Smoorenburg @ 2004-02-19 2:26 ` Andrew Morton 2004-02-19 10:15 ` Miquel van Smoorenburg 2004-02-19 2:51 ` Nick Piggin 1 sibling, 1 reply; 18+ messages in thread From: Andrew Morton @ 2004-02-19 2:26 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > The thing is, the bio's are submitted perfectly sequentially. It's just that > every so often a request (with just its initial bio) gets stuck for a while. > Because the bio's after it are merged and sent to the device, it's not > possible to merge that stuck request later on when it gets unstuck, because > the other bio's have already left the building so to speak. Oh. So the raid controller's queue depth is larger than the kernel's. So everything gets immediately shovelled into the device and the kernel is left with nothing to merge the little request against. Shouldn't the controller itself be performing the insertion? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 2:26 ` Andrew Morton @ 2004-02-19 10:15 ` Miquel van Smoorenburg 2004-02-19 10:19 ` Jens Axboe 0 siblings, 1 reply; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 10:15 UTC (permalink / raw) To: Andrew Morton Cc: Miquel van Smoorenburg, miquels, axboe, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > every so often a request (with just its initial bio) gets stuck for a while. > > Because the bio's after it are merged and sent to the device, it's not > > possible to merge that stuck request later on when it gets unstuck, because > > the other bio's have already left the building so to speak. > > Oh. So the raid controller's queue depth is larger than the kernel's. So > everything gets immediately shovelled into the device and the kernel is > left with nothing to merge the little request against. Well, the request queue of the kernel is max'ed out too, otherwise get_request_wait() wouldn't be called. It's just an unfortunate timing issue. > Shouldn't the controller itself be performing the insertion? Well, you would indeed expect the 3ware hardware to be smarter than that, but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags at all. It just has a large queue on the host, in hardware. Perhaps this info should be exported into the request queue of the device, so that ll_rw_blk knows about this and can do something similar to the hack I posted ? Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c. blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's not being used at all yet ? Or am I being dense ? Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 10:15 ` Miquel van Smoorenburg @ 2004-02-19 10:19 ` Jens Axboe 2004-02-19 20:59 ` Miquel van Smoorenburg 0 siblings, 1 reply; 18+ messages in thread From: Jens Axboe @ 2004-02-19 10:19 UTC (permalink / raw) To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote: > > Miquel van Smoorenburg <miquels@cistron.nl> wrote: > > > > > > The thing is, the bio's are submitted perfectly sequentially. It's just that > > > every so often a request (with just its initial bio) gets stuck for a while. > > > Because the bio's after it are merged and sent to the device, it's not > > > possible to merge that stuck request later on when it gets unstuck, because > > > the other bio's have already left the building so to speak. > > > > Oh. So the raid controller's queue depth is larger than the kernel's. So > > everything gets immediately shovelled into the device and the kernel is > > left with nothing to merge the little request against. > > Well, the request queue of the kernel is max'ed out too, otherwise > get_request_wait() wouldn't be called. It's just an unfortunate timing > issue. Indeed > > Shouldn't the controller itself be performing the insertion? > > Well, you would indeed expect the 3ware hardware to be smarter than > that, but in its defence, the driver doesn't set sdev->simple_tags or > sdev->ordered_tags at all. It just has a large queue on the host, in > hardware. A too large queue. IMHO the simple and correct solution to your problem is to diminish the host queue (sane solution), or bump the block layer queue size (dumb solution). > Perhaps this info should be exported into the request queue of the device, > so that ll_rw_blk knows about this and can do something similar to the > hack I posted ? > > Note that AFAICS nothing in drivers/scsi uses the tagging stuff in > ll_rw_blk.c. blk_queue_init_tags() is only called by > scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c > driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing > in drivers/block calls it either. It's not being used at all yet ? Or > am I being dense ? No you are correct, I already outlined that to you explicitly in the very first mail in this thread. Hopefully this will change with 2.7 so we have some block layer control over tagging in general. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 10:19 ` Jens Axboe @ 2004-02-19 20:59 ` Miquel van Smoorenburg 2004-02-19 22:52 ` Nick Piggin 0 siblings, 1 reply; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 20:59 UTC (permalink / raw) To: Jens Axboe Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > > > > Shouldn't the controller itself be performing the insertion? > > > > Well, you would indeed expect the 3ware hardware to be smarter than > > that, but in its defence, the driver doesn't set sdev->simple_tags or > > sdev->ordered_tags at all. It just has a large queue on the host, in > > hardware. > > A too large queue. IMHO the simple and correct solution to your problem > is to diminish the host queue (sane solution), or bump the block layer > queue size (dumb solution). Well, I did that. Lowering the queue size of the 3ware controller to 64 does help a bit, but performance is still not optimal - leaving it at 254 and increasing the nr_requests of the queue to 512 helps the most. But the patch I posted does just as well, without any tuning. I changed it a little though - it only has the "new" behaviour (instead of blocking on allocating a request, allocate it, queue it, _then_ block) for WRITEs. That results in the best performance I've seen, by far. Now the style of my patch might be ugly, but what is conceptually wrong with allocating the request and queueing it, then block if the queue is full, versus blocking on allocating the request and keeping a bio "stuck" for quite some time, resulting in out-of-order requests to the hardware ? Note that this is not an issue of '2 processes writing to 1 file', really. It's one process and pdflush writing the same dirty pages of the same file. Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 20:59 ` Miquel van Smoorenburg @ 2004-02-19 22:52 ` Nick Piggin 2004-02-19 23:53 ` Miquel van Smoorenburg 0 siblings, 1 reply; 18+ messages in thread From: Nick Piggin @ 2004-02-19 22:52 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: >> >> >>>>Shouldn't the controller itself be performing the insertion? >>>> >>>Well, you would indeed expect the 3ware hardware to be smarter than >>>that, but in its defence, the driver doesn't set sdev->simple_tags or >>>sdev->ordered_tags at all. It just has a large queue on the host, in >>>hardware. >>> >>A too large queue. IMHO the simple and correct solution to your problem >>is to diminish the host queue (sane solution), or bump the block layer >>queue size (dumb solution). >> > >Well, I did that. Lowering the queue size of the 3ware controller to 64 >does help a bit, but performance is still not optimal - leaving it at 254 >and increasing the nr_requests of the queue to 512 helps the most. > >But the patch I posted does just as well, without any tuning. I changed >it a little though - it only has the "new" behaviour (instead of blocking >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. >That results in the best performance I've seen, by far. > > That's because you are half introducing per-process limits. >Now the style of my patch might be ugly, but what is conceptually wrong >with allocating the request and queueing it, then block if the queue is >full, versus blocking on allocating the request and keeping a bio >"stuck" for quite some time, resulting in out-of-order requests to the >hardware ? > > Conceptually? The concept that you have everything you need to continue and yet you block anyway is wrong. >Note that this is not an issue of '2 processes writing to 1 file', really. >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 22:52 ` Nick Piggin @ 2004-02-19 23:53 ` Miquel van Smoorenburg 2004-02-20 0:15 ` Nick Piggin 0 siblings, 1 reply; 18+ messages in thread From: Miquel van Smoorenburg @ 2004-02-19 23:53 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote: > > > >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote: > >> > >> > >>>>Shouldn't the controller itself be performing the insertion? > >>>> > >>>Well, you would indeed expect the 3ware hardware to be smarter than > >>>that, but in its defence, the driver doesn't set sdev->simple_tags or > >>>sdev->ordered_tags at all. It just has a large queue on the host, in > >>>hardware. > >>> > >>A too large queue. IMHO the simple and correct solution to your problem > >>is to diminish the host queue (sane solution), or bump the block layer > >>queue size (dumb solution). > >> > > > >Well, I did that. Lowering the queue size of the 3ware controller to 64 > >does help a bit, but performance is still not optimal - leaving it at 254 > >and increasing the nr_requests of the queue to 512 helps the most. > > > >But the patch I posted does just as well, without any tuning. I changed > >it a little though - it only has the "new" behaviour (instead of blocking > >on allocating a request, allocate it, queue it, _then_ block) for WRITEs. > >That results in the best performance I've seen, by far. > > > > > > That's because you are half introducing per-process limits. > > >Now the style of my patch might be ugly, but what is conceptually wrong > >with allocating the request and queueing it, then block if the queue is > >full, versus blocking on allocating the request and keeping a bio > >"stuck" for quite some time, resulting in out-of-order requests to the > >hardware ? > > > > > > Conceptually? The concept that you have everything you need to > continue and yet you block anyway is wrong. For reading, I agree. For writing .. ah well, English is not my first language, let's not argue about language semantics. > >Note that this is not an issue of '2 processes writing to 1 file', really. > >It's one process and pdflush writing the same dirty pages of the same file. > > pdflush is a process though, that's all that matters. I understand that when the two processes are unrelated, the patch as I sent it will do the wrong thing. But the thing is, you get this: - "dd" process writes requests - pdflush triggers to write dirty pages - too many pages are dirty so "dd" blocks as well to write synchronously - "dd" process triggers "queue full" but gets marked as "batching" so can continue (get_request) - pdflush tries to submit one bio and gets blocked (get_request_wait) - "dd" continues, but that one bio from pdflush remains stuck for a while That's stupid, that one bio from pdflush should really be allowed on the queue, since "dd" is adding requests from the same source to it anyway. Perhaps writes from pdflush should be handled differently to prevent this specific case ? Say, if pdflush adds request #128, don't mark it as batching, but let it block. The next process will be the one marked as batching and can continue. If pdflush tries to add a request > 128, allow it, but _then_ block it. Would something like that work ? Would it be a good idea to never mark a pdflush process as batching, or would that have a negative impact for some things ? Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 23:53 ` Miquel van Smoorenburg @ 2004-02-20 0:15 ` Nick Piggin 0 siblings, 0 replies; 18+ messages in thread From: Nick Piggin @ 2004-02-20 0:15 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: >On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote: > >> >>Miquel van Smoorenburg wrote: >> > ... >>>Note that this is not an issue of '2 processes writing to 1 file', really. >>>It's one process and pdflush writing the same dirty pages of the same file. >>> >>pdflush is a process though, that's all that matters. >> > >I understand that when the two processes are unrelated, the patch as I >sent it will do the wrong thing. > >But the thing is, you get this: > >- "dd" process writes requests >- pdflush triggers to write dirty pages >- too many pages are dirty so "dd" blocks as well to write synchronously >- "dd" process triggers "queue full" but gets marked as "batching" so > can continue (get_request) >- pdflush tries to submit one bio and gets blocked (get_request_wait) >- "dd" continues, but that one bio from pdflush remains stuck for a while > > The batching logic can probably all be ripped out with per process limits. It's too complex anyway really. >That's stupid, that one bio from pdflush should really be allowed on >the queue, since "dd" is adding requests from the same source to it >anyway. > > But the whole reason it is getting blocked in the first place is because your controller is sucking up all your requests. The whole problem is not a problem if you use properly sized queues. I'm a bit surprised that it wasn't working well with a controller queue depth of 64 and 128 nr_requests. I'll give you a per process request limit patch to try in a minute. >Perhaps writes from pdflush should be handled differently to prevent >this specific case ? > >Say, if pdflush adds request #128, don't mark it as batching, but >let it block. The next process will be the one marked as batching >and can continue. If pdflush tries to add a request > 128, allow it, >but _then_ block it. > >Would something like that work ? Would it be a good idea to never mark >a pdflush process as batching, or would that have a negative impact >for some things ? > > It's hard to know. Maybe a better solution would be to allow pdflush to be exempt from the limits entirely as long as it tries not to write to congested queues (which is what it does)... ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 2:11 ` Miquel van Smoorenburg 2004-02-19 2:26 ` Andrew Morton @ 2004-02-19 2:51 ` Nick Piggin 2004-02-19 10:21 ` Jens Axboe 1 sibling, 1 reply; 18+ messages in thread From: Nick Piggin @ 2004-02-19 2:51 UTC (permalink / raw) To: Miquel van Smoorenburg Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber Miquel van Smoorenburg wrote: > >No, I'm actually referring to a struct request. I'm logging this in the >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have >in fact logged all the bio's submitted to __make_request, and the output >of the elevator from elv_next_request(). The bio's are submitted sequentially, >the resulting requests aren't. But this is because nr_requests is 128, while >the 3ware device has a queue of 254 entries (no tagging though). Upping >nr_requests to 512 makes this go away .. > >That shouldn't be necessary though. I only see this with LVM over 3ware-raid5, >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome >with a lot of debugging (unless I set nr_requests lower again), which points >to a timing issue. > > So the problem you are seeing is due to "unlucky" timing between two processes submitting IO. And the very efficient mechanisms (merging, sorting) we have to improve situations exactly like this is effectively disabled. And to make it worse, it appears that your controller shits itself on this trivially simple pattern. Your hack makes a baby step in the direction of per *process* request limits, which I happen to be an advocate of. As it stands though, I don't like it. Jens has the final say when it comes to the block layer though. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: IO scheduler, queue depth, nr_requests 2004-02-19 2:51 ` Nick Piggin @ 2004-02-19 10:21 ` Jens Axboe 0 siblings, 0 replies; 18+ messages in thread From: Jens Axboe @ 2004-02-19 10:21 UTC (permalink / raw) To: Nick Piggin Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel, thornber On Thu, Feb 19 2004, Nick Piggin wrote: > > > Miquel van Smoorenburg wrote: > > > > >No, I'm actually referring to a struct request. I'm logging this in the > >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have > >in fact logged all the bio's submitted to __make_request, and the output > >of the elevator from elv_next_request(). The bio's are submitted > >sequentially, > >the resulting requests aren't. But this is because nr_requests is 128, > >while > >the 3ware device has a queue of 254 entries (no tagging though). Upping > >nr_requests to 512 makes this go away .. > > > >That shouldn't be necessary though. I only see this with LVM over > >3ware-raid5, > >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less > >troublesome > >with a lot of debugging (unless I set nr_requests lower again), which > >points > >to a timing issue. > > > > > > So the problem you are seeing is due to "unlucky" timing between > two processes submitting IO. And the very efficient mechanisms > (merging, sorting) we have to improve situations exactly like this > is effectively disabled. And to make it worse, it appears that your > controller shits itself on this trivially simple pattern. > > Your hack makes a baby step in the direction of per *process* > request limits, which I happen to be an advocate of. As it stands > though, I don't like it. I'm very much an advocate for per process request limits as well. Would be trivial to add... Miquels patch is horrible, I appreciate it being posted as a cry for help. -- Jens Axboe ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-02-26 0:39 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1qJVx-75K-15@gated-at.bofh.it>
[not found] ` <1qJVx-75K-17@gated-at.bofh.it>
[not found] ` <1qJVw-75K-11@gated-at.bofh.it>
[not found] ` <1qLb8-6m-27@gated-at.bofh.it>
[not found] ` <1qLXl-XV-17@gated-at.bofh.it>
[not found] ` <1qMgF-1dA-5@gated-at.bofh.it>
[not found] ` <1qTs3-7A2-51@gated-at.bofh.it>
[not found] ` <1qTBB-7Hh-7@gated-at.bofh.it>
[not found] ` <1r3AS-1hW-5@gated-at.bofh.it>
[not found] ` <1r5jD-2RQ-31@gated-at.bofh.it>
[not found] ` <1r6fH-3L8-11@gated-at.bofh.it>
[not found] ` <1r6S4-6cv-1@gated-at.bofh.it>
2004-02-25 20:17 ` IO scheduler, queue depth, nr_requests Bill Davidsen
2004-02-25 21:39 ` Miquel van Smoorenburg
2004-02-26 0:39 ` Nick Piggin
[not found] <20040216131609.GA21974@cistron.nl>
[not found] ` <20040216133047.GA9330@suse.de>
[not found] ` <20040217145716.GE30438@traveler.cistron.net>
2004-02-18 23:52 ` Miquel van Smoorenburg
2004-02-19 1:24 ` Nick Piggin
2004-02-19 1:52 ` Miquel van Smoorenburg
2004-02-19 2:01 ` Nick Piggin
2004-02-19 1:26 ` Andrew Morton
2004-02-19 2:11 ` Miquel van Smoorenburg
2004-02-19 2:26 ` Andrew Morton
2004-02-19 10:15 ` Miquel van Smoorenburg
2004-02-19 10:19 ` Jens Axboe
2004-02-19 20:59 ` Miquel van Smoorenburg
2004-02-19 22:52 ` Nick Piggin
2004-02-19 23:53 ` Miquel van Smoorenburg
2004-02-20 0:15 ` Nick Piggin
2004-02-19 2:51 ` Nick Piggin
2004-02-19 10:21 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox